CN105243159B - A kind of distributed network crawler system based on visualization script editing machine - Google Patents
A kind of distributed network crawler system based on visualization script editing machine Download PDFInfo
- Publication number
- CN105243159B CN105243159B CN201510713985.7A CN201510713985A CN105243159B CN 105243159 B CN105243159 B CN 105243159B CN 201510713985 A CN201510713985 A CN 201510713985A CN 105243159 B CN105243159 B CN 105243159B
- Authority
- CN
- China
- Prior art keywords
- queue
- script
- module
- editing machine
- url link
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention provides a kind of distributed network crawler system based on visualization script editing machine, comprising: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result memory module;It is inputted according to user by visualization interface, system automatically generated meta-data extraction script, it can identify the structure of targeted sites, efficiently grab specific data, assigned tasks are created by task scheduling modules, webpage capture module is responsible for grabbing the page, and it is metadata set that content processing module, which transfers corresponding script for conversion of page, it is finally uniformly processed, is stored by result memory module.The present invention can greatly improve the efficiency that crawls for particular station data, reduce user's labor intensity, save system resource, and possess good scalability and retractility, be suitable for all types of internet sites.
Description
Technical field
The present invention relates to technical field of network communication more particularly to a kind of distributed networks based on visualization script editing machine
Network crawler system.
Background technique
It is born from internet at the end of the 20th century, internet information has obtained explosively increasing, already huge as one
, it is widely distributed, high isomerism, it is semi-structured, and the information Librarian that dynamic is high.In order to from internet information
It collects and extracts the interested data of people, web crawlers is with regard to this birth.Since then, crawler technology just gets out of hand, and is with it
Foundation stone has expedited the emergence of the search engine giant both domestic and external such as Baidu, Google, and the window of a fan information is opened to common people.
Now, internet information is mainly provided by website and WEB service form.Website is by miscellaneous group of web
At the data provided are substantially with hypertext markup language (HTML, the Hypertext Markup of non-structured static state
Language presentation).Since information analysis system can not directly use HTML, generally require to carry out it secondary treatment ability
Extract useful information.WEB service is then the data-interface of opposite specification, can obtain data by special parameter access,
WEB service can be individually present, can also be in conjunction with website.How efficiently and accurately from a large amount of specific websites or WEB service
It is more and more of interest by people to obtain specific information.It is huge that this is encountered by the web crawlers technology for being responsible for network information gathering
Big challenge.
Although web crawlers undergoes more Dai Fazhan, the multiple systems model basically formed.Crawler is set both at home and abroad
Meter has had very mature solution, and has come into operation, but those solutions are only provided to public users mostly
A kind of general service can not be formulated for particular station specific data, can not consider that each user's is various each
The demand of sample.
In internet area, there are several types of the designs of the crawler of mainstream at present:
1. traditional crawler system
Traditional crawler system needs the software programmers of profession to pass through the Web Organization form of analysis targeted sites, number
According to Javascript logical code on interface and the page, corresponding program code or script are write out, to realize according to certain
Rule-based filtering goes out specific data.It is obvious that needed for the advantages of this method, can accurately extract from targeted sites
Data.
But this method has very big defect, it is general only just to be adopted in the case where targeted sites quantity is extremely limited
With.The reason is that the HMTL language that internet site uses writes specification there is no fixation, need to write all targeted sites
Corresponding script, along with having current more and more websites using dynamically load mode, writing difficulty is greatly improved.Work as monitoring
When the correcting of website, need to adjust script in time, and redeploy crawler.This greatly improves the manpower in development and maintenance
Cost.In addition to this, this mode causes scalability bad with retractility due to its complexity, is unfavorable for large-scale distributed
Deployment.
2. universal distributed crawler system
Universal distributed crawler system, primary structure are scheduling (control), crawl and the grouping of the big basic courses department of contents processing three
At.Most current internet search engine is all this mode.Such as: disclosing in the prior art " relevant point of theme a kind of
Cloth network crawler system, ", see Publication No.: CN102646129A, publication date are as follows: the Chinese patent of 2012-08-22, this is
System includes: topic links memory, and the hyperlink of crawl is not completed for storage system;Control node is used for from topic links
Hyperlink is extracted in memory, removes the hyperlink wherein crossed by system grabs, is not then crossed by system grabs super
Node of creeping is distributed in link, and controls whether that termination system is run;It creeps node, for receiving the hyperlink of control node distribution
It connects, then downloads the webpage of hyperlink mark, and by web storage in web database;Web database, for storing
The webpage of node of creeping crawl;Page analyzer, for periodically reading the newest net for node downloading of creeping from web database
Page carries out content analysis to webpage, calculates the topic correlativity of contained hyperlink in webpage and webpage, then according to theme correlation
Relevant hyperlink is stored in topic links memory by degree, and the topic correlativity of each webpage is stored in web database
In.The invention is exactly using such mode.Such crawler system has mainly focused on url filtering and Web page subject
Analysis on, contents processing part is substantially using textual analysis extraction module.
Textual analysis module can substantially be divided into: the 1. text extraction algorithms 2. based on label applications are sentenced based on label densities
The text extracting of the fixed 3. 4. view-based access control model web page blocks analytical technologies of Web page text extracting method based on machine learning.But whether
Using which kind of algorithm, it is only used for the extraction of the trunks data such as Web page text and does not can guarantee the accuracy for extracting data.
These inventive methods can preferably be used for distributed reptile system, but be confined to the algorithm relied on, be only applicable to laterally big model
The fuzzy data enclosed crawls, for crawling with birth defect for specific data.Because it is in order to obtain maximum versatility, sacrificial
Domestic animal customization ability, text message can only be extracted from webpage, certain types of metadata can not be but isolated from text.
The citing such as commodity price in the electric business website page, the drug specifications in the network pharmacy page.Secondly, most of textual analysis is calculated
Method is relative complex, and the script that comparison customizes when largely using can consume more system resources, causes under crawler system performance
Drop.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of distributed network based on visualization script editing machine and climbs
Worm system, can be realized efficiently customize to a large amount of particular stations and crawls while compatible universal website crawls, and solve
Defect of the existing technology;User's labor intensity is reduced, system resource is saved.
The present invention is implemented as follows: a kind of distributed network crawler system based on visualization script editing machine, described
System includes: visualization script editing machine, Distributed Message Queue, task scheduling modules, webpage capture module, contents processing mould
Block and result memory module;
The visualization script editing machine, for checking targeted website, and selection target website data capture area;It will use
The input at family is converted to an execution chain, while generating corresponding script and one database of deposit according to chain is executed;The script is
For the corresponding script in targeted website;
The Distributed Message Queue is used for task scheduling modules, webpage capture module, content processing module and result
Memory module is decoupled, which includes scheduling queue, crawl queue, processing queue and result queue;
The task scheduling modules, for be responsible for coordinate whole system running, read in targeted website starting URL link with
User inputs information package at being passed to the scheduling queue after task, and task object is obtained in scheduling queue, and filters weight
The crawl queue is sent to after multiple task;
The webpage capture module automatically parses website coding, and will grab for getting URL link from crawl queue
The Content Transformation of the website taken is encoded at UTF-8, is forwarded and is sent to after the content which encodes is packaged with website relevant information
Handle queue;
The content processing module uses visualization foot for getting the web page contents of website from the processing queue
The URL matching rule that this editing machine generates matches the URL link of this webpage, calls this URL matching rule pair if finding matching
The script answered parses this web page contents;Result after parsing is passed in result queue;
The result memory module, for taking out result data from result queue, and result data is pre- according to system
The configuration of definition carries out that screening is uniformly processed, and is then stored in database.
Further, the system also includes monitoring module, the monitoring module is monitored in real time in Distributed Message Queue
Scheduling queue, grab queue, handle queue, whether four queues of result queue malfunction, and when an abnormality is discovered, push disappears in time
It ceases to the user interface of system, reminds user to check error reason and whether re-start script input.
Further, the system also includes text extraction module, when webpage domain name matching less than with the database
In script when, call the text extraction module, to extract the corresponding script of webpage, the text extraction module is used
The text extracting mode of view-based access control model web page blocks analytical technology extracts.
Further, described this script of calling is parsed;If what is generated after dissection process is new URL link,
New URL link is passed to the scheduling queue, re-executes task scheduling modules;If after dissection process being result data,
Then the result data after parsing is passed in result queue.
Further, the task scheduling modules include url filtering module and rate manager, the url filtering module,
Duplicate removal is carried out to URL link using Bloom filter, prevents from repeating to crawl same URL link, Bloom filter is by one two
A series of random mapping function compositions of system vector sum, for retrieving an element whether in a set;The rate pipe
Device is managed, network congestion is prevented using token bucket algorithm, the flow of limitation outflow network makes flow with uniform speed to outgoing
It send, guarantees the stability of system.
Further, the webpage capture module includes: proxy access module and browser analog module, and the agency visits
It asks module, is accessed to specified URL link using preset IP agency according to user configuration information, prevent the webpage from grabbing
Server ip where modulus block is blocked because amount of access is excessive by targeted website, and the browser analog module uses WebKit
Open source browser engine parses targeted website, and the Javascript code being able to carry out on the page generates the complete of targeted website
The whole page.
Further, the execution chain includes several subparameters, and there are many selection, the selection packets of subparameter for subparameter
It includes: lower layer's URL link selection rule, the scripted code that metadata selected mark or system can be performed.
Further, the visualization script editing machine specific implementation flow is as follows:
Step 1 inputs targeted website URL link address in visualization script editor interface;
Targeted website URL link web page contents are presented in step 2, visualization script editing machine,
Step 3, if you do not need to the lower layer into this webpage URL link, then enter step 5, if necessary to enter under
Layer URL link then enters step 4;
Step 4, the block for selecting lower layer's URL link, visualization script editing machine will record the position of these blocks, and deposit
Enter an execution chain, all location informations are formed in the form of CSS or XPATH grammer, return step 3;
Step 5 selects several to need to grab the block of content, and deposit one executes chain,
Step 6, user confirm that editing is completed,
The execution chain recorded is passed to script generator by step 7, visualization script editing machine, produces corresponding target network
The crawl script stood, while being directed to advanced level user, provides additional interface, and user can be by writing the code of compatible system, directly
It is embedded among the crawl script;
Script is stored in database by step 8.
The present invention has the advantage that the visualization script editing machine of this system, can be such that unprofessional user intuitively selects
The operation of user is converted into automatically specifically to handle script, by crawler by the capture area for selecting targeted sites related data
Each distributed processors unit in system in operation preferentially executes these processing script dynamics, greatly reduces customization
Human cost needed for changing crawler, while improving the operational efficiency of crawler system.The accuracy rate of the system grabs data is high, and
With high scalability and retractility.
Detailed description of the invention
Fig. 1 is the structural schematic diagram of present system.
Fig. 2 is the work flow diagram of present system.
Fig. 3 is that visualization script editing machine of the present invention executes structural schematic diagram.
Fig. 4 is visualization script editing machine workflow schematic diagram of the present invention.
Fig. 5 is the flow chart that the present invention executes chain function mode.
Fig. 6 is the work flow diagram of the content of present invention processing module and script.
Fig. 7 is the structural schematic diagram of one embodiment of present system.
Specific embodiment
It please refers to shown in Fig. 1 to Fig. 7, a kind of distributed network crawler system based on visualization script editing machine of the invention
System, the system comprises: visualization script editing machines, Distributed Message Queue, task scheduling modules, webpage capture module, interior
Hold processing module and result memory module;
The visualization script editing machine, is used for visual check targeted website content, and the data of selection target website are grabbed
Take region;Since it by the input of user (inputting targeted sites URL link, to all users behaviour being finally completed in editor
Make the input generated) an execution chain is converted to other inessential parameters (for example whether extracting using text, if simulation is clear
Look at device etc.), while generating corresponding script according to chain is executed and be stored in a database;The visualization script editing machine to use
Family can check targeted website as normal browsing webpage without having programming skill.The script is that targeted website is corresponding
Script;
Configuration management module provides WEB interface, and user can configure the website for needing to crawl herein, and be directed to one
A or a series of website configuration schedules strategies (such as: priority periodically crawls, and climbs interval etc. again), crawl strategy (error weight
Examination, enabling agency enable visit device simulation etc.) and other configurations parameter, form user configuration information.
The Distributed Message Queue is used for task scheduling modules, webpage capture module, content processing module and result
Memory module is decoupled, and high distributed deployment ability is realized.The Distributed Message Queue includes scheduling queue, crawl
Queue, processing queue and result queue;
The task scheduling modules coordinate the running of whole system for being responsible for, and reading in targeted website, (targeted website is
For the website that carry out processing judgement) starting URL link and user input information package at being passed to the scheduling queue after task,
And task object is obtained in scheduling queue, and be sent to the crawl queue after filtering iterative task;The task schedule mould
Block includes url filtering module and rate manager, and the url filtering module goes URL link using Bloom filter
Weight prevents from repeating to crawl same URL link, Bloom filter be actually by a very long binary vector and it is a series of with
Machine mapping function composition, for retrieving an element whether in a set;Its advantages are space efficiency and query time
All considerably beyond general algorithm, the disadvantage is that having certain false recognition rate and deleting difficult.It can be greatly using Bloom filter
Improve system effectiveness, and it the shortcomings that crawler system will not be had an impact completely, be very suitable for crawler system use.The speed
Rate manager prevents network congestion using token bucket algorithm, and the flow of limitation outflow network keeps flow outside with uniform speed
It sends, guarantees the stability of system.
The webpage capture module automatically parses website coding, and will grab for getting URL link from crawl queue
The Content Transformation of the website taken is encoded at UTF-8, is forwarded and is sent to after the content which encodes is packaged with website relevant information
Handle queue;The webpage capture module includes: proxy access module and browser analog module, the proxy access module,
With the development of network technology, nowadays more and more websites use dynamic page technology, have used a large amount of Javascript
Script generates web page contents, and the webpage capture of traditional mode can only obtain the source code of the page, can not execute
Javascript script leads to not the complete page for obtaining targeted sites, the difficult multiplication that data are extracted.Agency of the invention
Access modules can be accessed to specified URL link using preset IP agency according to user configuration information, and the net is prevented
Server ip where page handling module is blocked because amount of access is excessive by targeted website, and the browser analog module uses
WebKit increases income browser engine to parse targeted website, and the Javascript code being able to carry out on the page generates target network
The complete page stood.
The content processing module, for getting the web page contents of website from the processing queue, if the URL of this webpage
Link matches with URL matching rule predetermined, and (user is in advance in the targeted sites URL of visualization script editing machine input
Link intelligently generates a URL matching rule according to the condition of user setting by visual editor), then it calls and matches this URL
The script of link parses the web page contents of website;Result after parsing is passed in result queue;The result storage
Module carries out at unified for taking out result data from result queue, and by result data according to the predefined configuration of system
Reason screening, is then stored in database.
Wherein, the system also includes monitoring module and text extraction module, the monitoring module real time monitoring is distributed
Scheduling queue in message queue grabs queue, handles queue, and whether four queues of result queue malfunction, when an abnormality is discovered,
Timely PUSH message reminds user to check error reason and whether re-starts script input to the user interface of system.
When the matching of the domain name of webpage is less than with script in the database, the text extraction module is called, into
Row extracts the corresponding script of webpage, and the text extraction module uses the text extracting mode of view-based access control model web page blocks analytical technology
It extracts.
In the present invention, described this script of calling is parsed;If what is generated after dissection process is new URL link,
New URL link is then passed to the scheduling queue, re-executes task scheduling modules;If after dissection process being number of results
According to then result data after parsing is passed in result queue.
The execution chain includes several subparameters, and for subparameter there are many selection, the selection of subparameter includes: lower layer URL
Link selection rule, the scripted code that metadata selected identifies (format such as CSS, XPATH selector) or system can be performed.
As shown in Fig. 3,4,5, the visualization script editing machine specific implementation flow is as follows:
Step 1 inputs targeted website URL link address in visualization script editor interface;
Targeted website URL link web page contents are presented in step 2, visualization script editing machine,
Step 3, if you do not need to the lower layer into this webpage URL link, then enter step 5, if necessary to enter under
Layer URL link then enters step 4;
Step 4, the block for selecting lower layer's URL link, visualization script editing machine will record the position of these blocks, and deposit
Enter an execution chain, (visualization script editing machine will record the position of these blocks, and one execution chain concrete operations of deposit can be joined
See Fig. 5) all location informations form in the form of CSS or XPATH grammer, return step 3;
Step 5 selects several to need to grab the block of content, and deposit one executes chain,
Step 6, user confirm that editing is completed,
The execution chain recorded is passed to script generator by step 7, visualization script editing machine, produces corresponding target network
The crawl script stood, while being directed to advanced level user, provides additional interface, and user can be by writing the code of compatible system, directly
It is embedded among the crawl script;
Script is stored in database by step 8.
It is specific as follows such as the work flow diagram that Fig. 2 is present system:
(1) task scheduling modules access configuration management module, read in starting URL link and user configuration information is packaged into and appoints
Scheduling queue is passed to after business.
(2) task scheduling modules obtain task object in scheduling queue, url filtering module are inquired, if do not accessed
The URL link of this task is crossed, then is sent directly to crawl queue.If accessed, the parameter for detecting user setting (is paid a return visit
Time etc.), if allowing to access again, it is also sent to crawl queue, otherwise abandons this task.Finally by the task after filter weight
It is sent to crawl queue.
(3) webpage capture module gets URL link from crawl queue, executes grasping manipulation, automatically parses website coding,
And it changes the content of crawl into general UTF-8 coding and is sent to processing queue with the packing forwarding of website relevant information.
(4) content processing module gets web page contents from processing queue.If the information matches such as the domain name of this webpage arrive
The script (script i.e. in database) that user pre-defines, then call this script to be parsed.If generated after processing
Be new URL link, these link by incoming scheduling queue, reenter step (2) if it is result data, be then passed to
Result queue.
(5) result memory module is taken out from result queue as a result, do according to predetermined configuration, do it is final it is unified at
Reason, is restored again into database.
(6) (2)~(5) are repeated to cease and desist order until receiving system.
Such as the structural schematic diagram that Fig. 7 is one embodiment of present system.The modules of the invention can be with single machine
The more way of example deployment of more examples, multimachine list example, multimachine.System i.e. of the invention can be with distributed deployment.
In addition, the transmitted data object to message queue of this system is collectively referred to as task object, a task object packet
Contain: 1. content (URL link, web page contents or result data etc. change according to the difference of message queue);2. configuration ginseng
Number;3. status indicator;
It is all first to take out task object, then relevant information is taken out from task object actually from message queue.
What needs to be explained here is that: the present invention in task scheduling modules, webpage capture module, content processing module and
As a result memory module can start multiple examples on multiple servers, they are realized by message queue and are decoupled, Ke Yisui
When stop or increase any type of example.Such design can be in the scalability and retractility of maximum lifting system.
In short, the present invention is inputted according to user by visualization interface, system automatically generated meta-data extraction script,
It can identify the structure of targeted sites, efficiently grab specific data, create assigned tasks by task scheduling modules, webpage is grabbed
Modulus block is responsible for grabbing the page, and it is metadata set that content processing module, which transfers corresponding script for conversion of page, is finally uniformly processed,
It is stored by result memory module.The present invention can greatly improve the efficiency that crawls for particular station data, reduce and use
Family labor intensity saves system resource, and possesses good scalability and retractility, is suitable for all types of internet sites
Point.
The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with
Modification, is all covered by the present invention.
Claims (6)
1. a kind of distributed network crawler system based on visualization script editing machine, it is characterised in that: the system comprises: it can
Depending on changing script-editor, Distributed Message Queue, task scheduling modules, webpage capture module, content processing module and result
Memory module;
The visualization script editing machine, for checking targeted website, and selection target website data capture area;By user's
Input is converted to an execution chain, while generating corresponding script and one database of deposit according to chain is executed;The script is mesh
Mark the corresponding script in website;
The Distributed Message Queue, for storing task scheduling modules, webpage capture module, content processing module and result
Module is decoupled, which includes scheduling queue, crawl queue, processing queue and result queue;
The task scheduling modules coordinate the running of whole system for being responsible for, read in targeted website starting URL link and user
Input information package obtains task object at being passed to the scheduling queue after task in scheduling queue, and filters repetition and appoint
The crawl queue is sent to after business;
The webpage capture module automatically parses website coding, and by crawl for getting URL link from crawl queue
The Content Transformation of website is encoded at UTF-8, and forwarding is sent to processing after content and website relevant information which encodes are packaged
Queue;
The content processing module, for getting the web page contents of website from the processing queue, if the URL link of this webpage
Match with URL matching rule predetermined, then calls and match this corresponding script of URL matching rule in the webpage of website
Appearance is parsed;Result after parsing is passed in result queue;
The result memory module is predefined for taking out result data from result queue, and by result data according to system
Configuration carry out that screening is uniformly processed, be then stored in database;
The system also includes monitoring module, the monitoring module monitors the scheduling queue in Distributed Message Queue in real time, grabs
Queue is taken, queue is handled, whether four queues of result queue malfunction, when an abnormality is discovered, the use of timely PUSH message to system
Family interface reminds user to check error reason and whether re-starts script input;
The system also includes text extraction module, when the matching of the domain name of webpage is less than with script in the database, adjust
With the text extraction module, to extract the corresponding script of webpage, the text extraction module uses view-based access control model webpage
The text extracting mode of block analysis technology extracts.
2. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature
It is: if what is generated after dissection process is new URL link, new URL link is passed to the scheduling queue, is held again
Row task scheduling modules;If after dissection process being result data, the result data after parsing is passed in result queue.
3. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature
Be: the task scheduling modules include url filtering module and rate manager, and the url filtering module uses the grand filtering of cloth
Device carries out duplicate removal to URL link, prevents from repeating to crawl same URL link, Bloom filter is by a binary vector and one
Serial random mapping function composition, for retrieving an element whether in a set;The rate manager, using token
Bucket algorithm prevents network congestion, and the flow of limitation outflow network is sent out flow with uniform speed, guarantees the steady of system
It is qualitative.
4. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature
Be: the webpage capture module includes: proxy access module and browser analog module, the proxy access module, according to
User configuration information accesses to specified URL link using preset IP agency, prevents the webpage capture module place
Server ip is blocked because amount of access is excessive by targeted website, the browser analog module, uses WebKit open source browser
Engine parses targeted website, and the Javascript code being able to carry out on the page generates the complete page of targeted website.
5. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature
Be: the execution chain includes several subparameters, and for subparameter there are many selection, the selection of subparameter includes: lower layer's URL link
Selection rule, the scripted code that metadata selected mark or system can be performed.
6. a kind of distributed network crawler system based on visualization script editing machine according to claim 1, feature
Be: the visualization script editing machine specific implementation flow is as follows:
Step 1 inputs targeted website URL link address in visualization script editor interface;
Targeted website URL link web page contents are presented in step 2, visualization script editing machine,
Step 3, if you do not need to the lower layer into this webpage URL link, then enter step 5, if necessary to enter lower layer URL
Link then enters step 4;
Step 4, the block for selecting lower layer's URL link, visualization script editing machine will record the position of these blocks, and be stored in one
Chain is executed, all location informations are formed in the form of CSS or XPATH grammer, return step 3;
Step 5 selects several to need to grab the block of content, and deposit one executes chain,
Step 6, user confirm that editing is completed,
The execution chain recorded is passed to script generator by step 7, visualization script editing machine, produces corresponding targeted website
Script is grabbed, while being directed to advanced level user, additional interface is provided, user can be directly embedded by writing the code of compatible system
Among the crawl script;
Script is stored in database by step 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510713985.7A CN105243159B (en) | 2015-10-28 | 2015-10-28 | A kind of distributed network crawler system based on visualization script editing machine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510713985.7A CN105243159B (en) | 2015-10-28 | 2015-10-28 | A kind of distributed network crawler system based on visualization script editing machine |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105243159A CN105243159A (en) | 2016-01-13 |
CN105243159B true CN105243159B (en) | 2019-06-25 |
Family
ID=55040807
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510713985.7A Active CN105243159B (en) | 2015-10-28 | 2015-10-28 | A kind of distributed network crawler system based on visualization script editing machine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105243159B (en) |
Families Citing this family (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10701087B2 (en) * | 2015-11-02 | 2020-06-30 | Nippon Telegraph And Telephone Corporation | Analysis apparatus, analysis method, and analysis program |
CN107180050A (en) * | 2016-03-11 | 2017-09-19 | 精硕科技(北京)股份有限公司 | A kind of data grabber system and method |
CN106886547A (en) * | 2016-07-13 | 2017-06-23 | 阿里巴巴集团控股有限公司 | A kind of scenario generation method and device |
CN106168985A (en) * | 2016-08-26 | 2016-11-30 | 南京车易淘网络信息技术有限公司 | A kind of can the reptile method of fast distributed deployment |
CN108228614B (en) * | 2016-12-14 | 2022-03-18 | 北京国双科技有限公司 | Method and device for detecting webpage broken link |
CN106933973A (en) * | 2017-02-14 | 2017-07-07 | 广州优亿信息科技有限公司 | A kind of visual network reptile method |
CN106980687B (en) * | 2017-03-31 | 2020-05-22 | 北京奇艺世纪科技有限公司 | Resource downloading system, method and crawler downloading system |
CN107103242B (en) * | 2017-05-11 | 2020-07-17 | 北京安赛创想科技有限公司 | Data acquisition method and device |
CN107317724B (en) * | 2017-06-06 | 2020-12-11 | 中证信用增进股份有限公司 | Data acquisition system and method based on cloud computing technology |
CN110020066B (en) * | 2017-07-31 | 2021-09-07 | 北京国双科技有限公司 | Method and device for annotating tasks to crawler platform |
CN107870965A (en) * | 2017-08-11 | 2018-04-03 | 成都萌想科技有限责任公司 | One kind visualization data collecting system |
CN107577788B (en) * | 2017-09-15 | 2021-12-31 | 广东技术师范大学 | E-commerce website topic crawler method for automatically structuring data |
CN108108440A (en) * | 2017-12-21 | 2018-06-01 | 北京慧数科技有限公司 | The acquisition method of proxy server and internet data |
CN108549678B (en) * | 2018-04-02 | 2020-06-19 | 北京今朝在线科技有限公司 | Information acquisition system |
CN109285046A (en) * | 2018-08-10 | 2019-01-29 | 浙江工业大学 | A kind of electric business big data acquisition system based on business plug-in unit |
CN108875091B (en) * | 2018-08-14 | 2020-06-02 | 杭州费尔斯通科技有限公司 | Distributed web crawler system with unified management |
CN109101636A (en) * | 2018-08-16 | 2018-12-28 | 成都市映潮科技股份有限公司 | A kind of method, apparatus and system carrying out data acquisition in cloud by visual configuration |
CN109284430A (en) * | 2018-09-07 | 2019-01-29 | 杭州艾塔科技有限公司 | Visualization subject web page content based on distributed structure/architecture crawls system and method |
CN109522466B (en) * | 2018-10-20 | 2023-04-07 | 河南工程学院 | Distributed crawler system |
CN109783715A (en) * | 2019-01-08 | 2019-05-21 | 鑫涌算力信息科技(上海)有限公司 | Network crawler system and method |
CN109614539A (en) * | 2019-01-16 | 2019-04-12 | 重庆金融资产交易所有限责任公司 | Data grab method, device and computer readable storage medium |
CN109948026A (en) * | 2019-03-28 | 2019-06-28 | 深信服科技股份有限公司 | A kind of web data crawling method, device, equipment and medium |
CN110807137A (en) * | 2019-04-11 | 2020-02-18 | 上海丛云信息科技有限公司 | Distributed big data acquisition implementation method |
CN110020062B (en) * | 2019-04-12 | 2021-09-24 | 北京邮电大学 | Customizable web crawler method and system |
CN110297962B (en) * | 2019-06-28 | 2021-08-24 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN110457556B (en) * | 2019-07-04 | 2023-11-14 | 重庆金融资产交易所有限责任公司 | Distributed crawler system architecture, method for crawling data and computer equipment |
CN110413276B (en) * | 2019-07-31 | 2024-04-09 | 网易(杭州)网络有限公司 | Parameter editing method and device, electronic equipment and storage medium |
CN110851681B (en) * | 2019-10-12 | 2024-07-09 | 平安科技(深圳)有限公司 | Crawler processing method, crawler processing device, server and computer readable storage medium |
CN112783615B (en) * | 2019-11-08 | 2024-03-01 | 北京沃东天骏信息技术有限公司 | Data processing task cleaning method and device |
CN111045659A (en) * | 2019-11-11 | 2020-04-21 | 国家计算机网络与信息安全管理中心 | Method and system for collecting project list of Internet financial webpage |
CN111178057B (en) * | 2020-01-02 | 2024-01-30 | 大汉软件股份有限公司 | Content analysis and extraction system for government electronic documents |
CN111310002B (en) * | 2020-04-17 | 2023-04-07 | 西安热工研究院有限公司 | General crawler system based on distributor and configuration table combination |
CN111651656B (en) * | 2020-06-02 | 2023-02-24 | 重庆邮电大学 | Method and system for dynamic webpage crawler based on agent mode |
CN112100061A (en) * | 2020-08-28 | 2020-12-18 | 广州探迹科技有限公司 | Visual crawler code compiling and debugging method |
CN112256636A (en) * | 2020-11-10 | 2021-01-22 | 国网湖南省电力有限公司 | Data acquisition system for mobile application APP |
CN112364226A (en) * | 2020-11-12 | 2021-02-12 | 江苏易启策网络科技有限公司 | Interactive information acquisition method and system based on dynamic content analysis |
CN112434205A (en) * | 2020-11-30 | 2021-03-02 | 北京秒针人工智能科技有限公司 | Data integration capturing method and system based on data site and computer equipment |
CN112667873A (en) * | 2020-12-16 | 2021-04-16 | 北京华如慧云数据科技有限公司 | Crawler system and method suitable for general data acquisition of most websites |
CN112487269B (en) * | 2020-12-22 | 2023-10-24 | 安徽商信政通信息技术股份有限公司 | Method and device for detecting automation script of crawler |
CN112328238B (en) * | 2021-01-05 | 2021-03-30 | 深圳点猫科技有限公司 | Building block code execution control method, system and storage medium |
CN112818201A (en) * | 2021-02-07 | 2021-05-18 | 四川封面传媒有限责任公司 | Network data acquisition method and device, computer equipment and storage medium |
CN113742550B (en) * | 2021-08-20 | 2024-04-19 | 广州市易工品科技有限公司 | Browser-based data acquisition method, device and system |
CN113656674B (en) * | 2021-08-30 | 2023-06-27 | 山谷网安科技股份有限公司 | Automatic processing method and device for click type hyperlink in website crawler |
CN113934912A (en) * | 2021-11-11 | 2022-01-14 | 北京搜房科技发展有限公司 | Data crawling method and device, storage medium and electronic equipment |
CN113918793A (en) * | 2021-12-10 | 2022-01-11 | 江苏宝和数据股份有限公司 | Multi-source scientific and creative resource data acquisition method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001033345A1 (en) * | 1999-11-02 | 2001-05-10 | Alta Vista Company | System and method for enforcing politeness while scheduling downloads in a web crawler |
CN101089856A (en) * | 2007-07-20 | 2007-12-19 | 李沫南 | Method for abstracting network data and web reptile system |
CN102982161A (en) * | 2012-12-05 | 2013-03-20 | 北京奇虎科技有限公司 | Method and device for acquiring webpage information |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN104765592A (en) * | 2014-01-03 | 2015-07-08 | 任子行网络技术股份有限公司 | Plugin management method and device facing web page acquisition task |
-
2015
- 2015-10-28 CN CN201510713985.7A patent/CN105243159B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001033345A1 (en) * | 1999-11-02 | 2001-05-10 | Alta Vista Company | System and method for enforcing politeness while scheduling downloads in a web crawler |
CN101089856A (en) * | 2007-07-20 | 2007-12-19 | 李沫南 | Method for abstracting network data and web reptile system |
CN102982161A (en) * | 2012-12-05 | 2013-03-20 | 北京奇虎科技有限公司 | Method and device for acquiring webpage information |
CN103970788A (en) * | 2013-02-01 | 2014-08-06 | 北京英富森信息技术有限公司 | Webpage-crawling-based crawler technology |
CN104765592A (en) * | 2014-01-03 | 2015-07-08 | 任子行网络技术股份有限公司 | Plugin management method and device facing web page acquisition task |
Non-Patent Citations (1)
Title |
---|
搜索引擎中通用爬虫系统的研究与设计;高龙;《中国优秀硕士学位论文全文数据库信息科技辑》;20130915(第09期);第I138-518页 |
Also Published As
Publication number | Publication date |
---|---|
CN105243159A (en) | 2016-01-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105243159B (en) | A kind of distributed network crawler system based on visualization script editing machine | |
CN101222349B (en) | Method and system for collecting web user action and performance data | |
CN107317724B (en) | Data acquisition system and method based on cloud computing technology | |
CN101651707B (en) | Method for automatically acquiring user behavior log of network | |
Kung et al. | An object-oriented web test model for testing web applications | |
CN106021257B (en) | A kind of crawler capturing data method, apparatus and system for supporting online programming | |
CN107391775A (en) | A kind of general web crawlers model implementation method and system | |
US20120265824A1 (en) | Method and system for configuration-controlled instrumentation of application programs | |
CN104077402B (en) | Data processing method and data handling system | |
CN107071009A (en) | A kind of distributed big data crawler system of load balancing | |
CN106096056A (en) | A kind of based on distributed public sentiment data real-time collecting method and system | |
CN107885777A (en) | A kind of control method and system of the crawl web data based on collaborative reptile | |
CN105260388A (en) | Optimization method of distributed vertical crawler service system | |
CN107729564A (en) | A kind of distributed focused web crawler web page crawl method and system | |
CN102262635A (en) | Page crawler system and page crawler method | |
CN107766509A (en) | A kind of method and apparatus of webpage static backup | |
CN109359231A (en) | A kind of information crawler method, server and the storage medium of distributed network crawler | |
CN107239563A (en) | Public feelings information dynamic monitoring and controlling method | |
CN109840298A (en) | The multi information source acquisition method and system of large scale network data | |
US20130036108A1 (en) | Method and system for assisting users with operating network devices | |
CN112307292A (en) | Information processing method and system based on advanced persistent threat attack | |
CN112395485A (en) | Policy big data mining method and device, computer equipment and storage medium | |
CN103905434A (en) | Method and device for processing network data | |
Smith | Go Web Scraping Quick Start Guide: Implement the power of Go to scrape and crawl data from the web | |
CN108205548A (en) | A kind of Web Spider structure and its method of work based on agriculture webpage information acquisition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |