CN108536788A - A kind of data capture method and its system based on distributed reptile - Google Patents
A kind of data capture method and its system based on distributed reptile Download PDFInfo
- Publication number
- CN108536788A CN108536788A CN201810274062.XA CN201810274062A CN108536788A CN 108536788 A CN108536788 A CN 108536788A CN 201810274062 A CN201810274062 A CN 201810274062A CN 108536788 A CN108536788 A CN 108536788A
- Authority
- CN
- China
- Prior art keywords
- webpage
- data
- information
- module
- parsing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Abstract
The present invention discloses a kind of data capture method based on distributed reptile, includes the following steps:Webpage is divided according to the attribute of different web pages;For the division of webpage, at least one crawler capturing parameter is determined;Current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed, obtains parsing data;The parsing data of acquisition are compiled;Webpage data information after compiling is summarized, and feeds back to data storage module;The webpage data information of the General Logistics Department of using foreign currency is stored.The present invention is based on the data capture method of distributed reptile and its systems, by being divided to webpage, and determine crawler capturing parameter according to the division of webpage, webpage information can be captured by crawler capturing parameter, and redirect connection and can be attached with other webpages, it realizes the quick crawl of webpage information, ensures the acquisition for being completed in a relatively short time data, the efficiency for improving acquisition meets the requirement of time.
Description
Technical field
The invention belongs to technical field of network data collection, are related to a kind of data capture method based on distributed reptile
And its system.
Background technology
With the rapid development of network, WWW becomes the carrier of bulk information, how to efficiently extract and use these
Information becomes a huge challenge.Search engine (Search Engine), such as traditional universal search engine
AltaVista, Yahoo and Google etc., the tool of information is retrieved as an auxiliary people becomes user's access WWW
Entrance and guide.But there is also certain limitations for these versatility search engines, therefore, web crawlers comes into being,
Web crawlers (be otherwise known as webpage spider, network robot, and among the communities FOAF, more frequent is known as webpage follower),
It is a kind of according to certain rule, automatically captures the program or script of web message.The rarely needed name of other
Word also has ant, automatic indexing, simulation program or worm.Web crawlers is a program for automatically extracting webpage, it is to search
Index holds up the contained network page above and below WWW, is the important composition of search engine.Traditional reptile is from one or several Initial pages
URL starts, and the URL obtained on Initial page constantly extracts new URL from current page and put during capturing webpage
Enqueue, certain stop condition until meeting system.
In a large amount of website datas of crawler capturing, the type of usually each webpage is different, causes webpage capture difficult and right
It needs that different crawler capturing parameters are arranged in different webpages, causes the speed of crawl slow, or even be absorbed in the dead of cycle crawl and follow
Ring causes the operating load of computer, is quickly captured to webpage for convenience, improves the efficiency of crawler capturing, now design
A kind of data capture method and its system based on distributed reptile.
Invention content
The purpose of the present invention is to provide a kind of data capture method and its system based on distributed reptile, by net
Page is divided, and determines crawler capturing parameter according to the division of webpage, and webpage information can be captured by crawler capturing parameter, and
Redirecting connection can be attached with other webpages, realize the quick crawl of webpage information, solve the process of existing crawler capturing
In, there is a problem of that crawl efficiency is low.
The purpose of the present invention can be achieved through the following technical solutions:
A kind of data capture method based on distributed reptile, includes the following steps:
S1, webpage is divided according to the attribute of different web pages;
S2, the division for webpage determine at least one crawler capturing parameter;
S3, current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed,
Obtain parsing data;
S4, the parsing data of acquisition are compiled, obtain the webpage data information that user can identify;
S5, the webpage data information after compiling is summarized, and feeds back to data storage module;
S6, the General Logistics Department of using foreign currency webpage data information stored.
A kind of data-acquisition system based on distributed reptile, including Web page classifying module, tactful handling module, parsing mould
Block, collector, data feedback module and data storage module;
The webpage of same alike result is divided into together by the Web page classifying module for being divided to the attribute of different web pages
It is a kind of;
The strategy handling module determines at least one crawler capturing parameter according to the different demarcation of webpage;
The parsing module captures webpage information according to crawler capturing parameter from current webpage, and to the webpage of crawl
Information is parsed, and parsing data are obtained;
The collector is obtained for being compiled the parsing data that parsing module is obtained according to different resolution rules
It is sent to data feedback module to the identifiable webpage data information of user, and by the webpage data information after compiling;
The data feedback module, summarizes the webpage data information after the compiling of reception, and by the net after summarizing
Page data information feeds back to data storage module;
Webpage data information after the data storage module summarizes for what is fed back to data feedback module stores.
Further, the tactful handling module redirects connection including at least one, for being attached with other webpages.
Further, the parsing module includes type judging unit and resolution unit, and the type judging unit is used for
The type of webpage of acquisition is judged;Resolution unit by the different type according to webpage, to the data information on webpage into
Row parsing.
Beneficial effects of the present invention:
The present invention is based on the data capture method of distributed reptile and its systems, by being divided to webpage, and according to
The division of webpage determines crawler capturing parameter, and webpage information can be captured by crawler capturing parameter, and redirecting connection can be with other
Webpage is attached, and realizes the quick crawl of webpage information, is ensured the acquisition for being completed in a relatively short time data, is improved and adopt
The efficiency of collection meets the requirement of time.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability
For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 is a kind of schematic diagram of the data capture method based on distributed reptile of the present invention;
Fig. 2 is a kind of schematic diagram of the data-acquisition system based on distributed reptile of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained all other without creative efforts
Embodiment shall fall within the protection scope of the present invention.
Refering to Figure 1, the present invention is a kind of data capture method based on distributed reptile, include the following steps:
S1, webpage is divided according to the attribute of different web pages;
S2, the division for webpage determine at least one crawler capturing parameter;
S3, current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed,
Obtain parsing data;
S4, the parsing data of acquisition are compiled, obtain the webpage data information that user can identify;
S5, the webpage data information after compiling is summarized, and feeds back to data storage module;
S6, the General Logistics Department of using foreign currency webpage data information stored.
As shown in Fig. 2, a kind of data-acquisition system based on distributed reptile, including Web page classifying module, strategy crawl
Module, parsing module, collector, data feedback module and data storage module;
The webpage of same alike result is divided into same by Web page classifying module for being divided to the attribute of different web pages
Class;
Tactful handling module determines at least one crawler capturing parameter according to the different demarcation of webpage, wherein strategy crawl
Module redirects connection including at least one, for being attached with other webpages, realizes redirecting for webpage;
Parsing module captures webpage information according to crawler capturing parameter from current webpage, and to the webpage information of crawl
It is parsed, obtains parsing data, parsing module includes type judging unit and resolution unit, wherein type judging unit is used
Judge in the type of webpage to acquisition, the type of webpage includes homepage, paging and detail page;Resolution unit passes through basis
The different type of webpage parses the data information on webpage;
If current type of webpage be homepage, according to homepage parse respective rule, to the data information on webpage into
Row parsing obtains parsing data, if current type of webpage is paging, according to the respective rule that paging parses, on webpage
Data information is parsed to obtain parsing data, if current web page type is details page, the corresponding rule parsed according to details page
Then, the data information of four big row of webpage is parsed to obtain parsing data;
Collector is used for being compiled the parsing data that parsing module is obtained according to different resolution rules
The identifiable webpage data information in family, and the webpage data information after compiling is sent to data feedback module;
Data feedback module summarizes the webpage data information after the compiling of reception, and the webpage number after summarizing
It is believed that breath feeds back to data storage module;
Webpage data information after data storage module summarizes for what is fed back to data feedback module stores.
The present invention is based on the data capture method of distributed reptile and its systems, by being divided to webpage, and according to
The division of webpage determines crawler capturing parameter, and webpage information can be captured by crawler capturing parameter, and redirecting connection can be with other
Webpage is attached, and realizes the quick crawl of webpage information, is ensured the acquisition for being completed in a relatively short time data, is improved and adopt
The efficiency of collection meets the requirement of time.
The above content is just an example and description of the concept of the present invention, affiliated those skilled in the art
It makes various modifications or additions to the described embodiments or substitutes by a similar method, without departing from invention
Design or beyond the scope defined by this claim, be within the scope of protection of the invention.
Claims (4)
1. a kind of data capture method based on distributed reptile, which is characterized in that include the following steps:
S1, webpage is divided according to the attribute of different web pages;
S2, the division for webpage determine at least one crawler capturing parameter;
S3, current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed, obtained
Parse data;
S4, the parsing data of acquisition are compiled, obtain the webpage data information that user can identify;
S5, the webpage data information after compiling is summarized, and feeds back to data storage module;
S6, the General Logistics Department of using foreign currency webpage data information stored.
2. a kind of data-acquisition system based on distributed reptile, it is characterised in that:Including Web page classifying module, strategy crawl mould
Block, parsing module, collector, data feedback module and data storage module;
The webpage of same alike result is divided into same by the Web page classifying module for being divided to the attribute of different web pages
Class;
The strategy handling module determines at least one crawler capturing parameter according to the different demarcation of webpage;
The parsing module captures webpage information according to crawler capturing parameter from current webpage, and to the webpage information of crawl
It is parsed, obtains parsing data;
The collector is used for being compiled the parsing data that parsing module is obtained according to different resolution rules
The identifiable webpage data information in family, and the webpage data information after compiling is sent to data feedback module;
The data feedback module, summarizes the webpage data information after the compiling of reception, and the webpage number after summarizing
It is believed that breath feeds back to data storage module;
Webpage data information after the data storage module summarizes for what is fed back to data feedback module stores.
3. a kind of data-acquisition system based on distributed reptile according to claim 2, it is characterised in that:The strategy
Handling module redirects connection including at least one, for being attached with other webpages.
4. a kind of data-acquisition system based on distributed reptile according to claim 2, it is characterised in that:The parsing
Module includes type judging unit and resolution unit, and the type judging unit is for judging the type of webpage of acquisition;
Resolution unit parses the data information on webpage by the different type according to webpage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810274062.XA CN108536788A (en) | 2018-03-29 | 2018-03-29 | A kind of data capture method and its system based on distributed reptile |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810274062.XA CN108536788A (en) | 2018-03-29 | 2018-03-29 | A kind of data capture method and its system based on distributed reptile |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108536788A true CN108536788A (en) | 2018-09-14 |
Family
ID=63482598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810274062.XA Withdrawn CN108536788A (en) | 2018-03-29 | 2018-03-29 | A kind of data capture method and its system based on distributed reptile |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108536788A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020194161A1 (en) * | 2001-04-12 | 2002-12-19 | Mcnamee J. Paul | Directed web crawler with machine learning |
CN101937469A (en) * | 2010-09-15 | 2011-01-05 | 深圳市任子行网络技术股份有限公司 | Information capture method of video website |
CN104252530A (en) * | 2014-09-10 | 2014-12-31 | 北京京东尚科信息技术有限公司 | Single-computer crawler grabbing method and system |
CN104951539A (en) * | 2015-06-19 | 2015-09-30 | 成都艾尔普科技有限责任公司 | Internet data center harmful information monitoring system |
CN106126747A (en) * | 2016-07-14 | 2016-11-16 | 北京邮电大学 | Data capture method based on reptile and device |
-
2018
- 2018-03-29 CN CN201810274062.XA patent/CN108536788A/en not_active Withdrawn
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020194161A1 (en) * | 2001-04-12 | 2002-12-19 | Mcnamee J. Paul | Directed web crawler with machine learning |
CN101937469A (en) * | 2010-09-15 | 2011-01-05 | 深圳市任子行网络技术股份有限公司 | Information capture method of video website |
CN104252530A (en) * | 2014-09-10 | 2014-12-31 | 北京京东尚科信息技术有限公司 | Single-computer crawler grabbing method and system |
CN104951539A (en) * | 2015-06-19 | 2015-09-30 | 成都艾尔普科技有限责任公司 | Internet data center harmful information monitoring system |
CN106126747A (en) * | 2016-07-14 | 2016-11-16 | 北京邮电大学 | Data capture method based on reptile and device |
Non-Patent Citations (1)
Title |
---|
岳宗林: "基于垂直搜索引擎的舆情预警分析平台", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102930059B (en) | Method for designing focused crawler | |
CN106021257B (en) | A kind of crawler capturing data method, apparatus and system for supporting online programming | |
AU2009276354B2 (en) | Providing posts to discussion threads in response to a search query | |
CN107066576B (en) | A kind of big data web crawlers paging selection method and system | |
CN104516982A (en) | Method and system for extracting Web information based on Nutch | |
CN105243159A (en) | Visual script editor-based distributed web crawler system | |
CN107943838B (en) | Method and system for automatically acquiring xpath generated crawler script | |
CN102567407B (en) | Method and system for collecting forum reply increment | |
CN101963965B (en) | Document indexing method, data query method and server based on search engine | |
CN102486799B (en) | World wide web (WWW) page processing method and device | |
CN104391978A (en) | Method and device for storing and processing web pages of browsers | |
CN104090976A (en) | Method and device for crawling webpages by search engine crawlers | |
CN103617213A (en) | Method and system for identifying newspage attributive characters | |
CN108733813A (en) | Information extracting method, system towards BBS forum Web pages contents and medium | |
CN104331438A (en) | Method and device for selectively extracting content of novel webpage | |
CN110377796A (en) | Text extracting method, device, equipment and storage medium based on dom tree | |
CN103970800A (en) | Method and system for extracting and processing webpage related keywords | |
CN104778232B (en) | Searching result optimizing method and device based on long query | |
CN104281629A (en) | Method and device for extracting picture from webpage and client equipment | |
CN106326236A (en) | Webpage content identification method and system | |
CN105930385A (en) | Data crawling method and system | |
CN107547944A (en) | Interface realizing method and device, set top box | |
CN106250456A (en) | Bid winning announcement extraction method and device | |
CN104331512A (en) | Automatic BBS (bulletin board system) page acquisition method | |
CN104361121B (en) | A kind of batch analytic method of WEB reporting systems formula |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20180914 |