CN108536788A - A kind of data capture method and its system based on distributed reptile - Google Patents

A kind of data capture method and its system based on distributed reptile Download PDF

Info

Publication number
CN108536788A
CN108536788A CN201810274062.XA CN201810274062A CN108536788A CN 108536788 A CN108536788 A CN 108536788A CN 201810274062 A CN201810274062 A CN 201810274062A CN 108536788 A CN108536788 A CN 108536788A
Authority
CN
China
Prior art keywords
webpage
data
information
module
parsing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810274062.XA
Other languages
Chinese (zh)
Inventor
王华伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Jun Machinery Technology Co Ltd
Original Assignee
Hefei Jun Machinery Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Jun Machinery Technology Co Ltd filed Critical Hefei Jun Machinery Technology Co Ltd
Priority to CN201810274062.XA priority Critical patent/CN108536788A/en
Publication of CN108536788A publication Critical patent/CN108536788A/en
Withdrawn legal-status Critical Current

Links

Abstract

The present invention discloses a kind of data capture method based on distributed reptile, includes the following steps:Webpage is divided according to the attribute of different web pages;For the division of webpage, at least one crawler capturing parameter is determined;Current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed, obtains parsing data;The parsing data of acquisition are compiled;Webpage data information after compiling is summarized, and feeds back to data storage module;The webpage data information of the General Logistics Department of using foreign currency is stored.The present invention is based on the data capture method of distributed reptile and its systems, by being divided to webpage, and determine crawler capturing parameter according to the division of webpage, webpage information can be captured by crawler capturing parameter, and redirect connection and can be attached with other webpages, it realizes the quick crawl of webpage information, ensures the acquisition for being completed in a relatively short time data, the efficiency for improving acquisition meets the requirement of time.

Description

A kind of data capture method and its system based on distributed reptile
Technical field
The invention belongs to technical field of network data collection, are related to a kind of data capture method based on distributed reptile And its system.
Background technology
With the rapid development of network, WWW becomes the carrier of bulk information, how to efficiently extract and use these Information becomes a huge challenge.Search engine (Search Engine), such as traditional universal search engine AltaVista, Yahoo and Google etc., the tool of information is retrieved as an auxiliary people becomes user's access WWW Entrance and guide.But there is also certain limitations for these versatility search engines, therefore, web crawlers comes into being, Web crawlers (be otherwise known as webpage spider, network robot, and among the communities FOAF, more frequent is known as webpage follower), It is a kind of according to certain rule, automatically captures the program or script of web message.The rarely needed name of other Word also has ant, automatic indexing, simulation program or worm.Web crawlers is a program for automatically extracting webpage, it is to search Index holds up the contained network page above and below WWW, is the important composition of search engine.Traditional reptile is from one or several Initial pages URL starts, and the URL obtained on Initial page constantly extracts new URL from current page and put during capturing webpage Enqueue, certain stop condition until meeting system.
In a large amount of website datas of crawler capturing, the type of usually each webpage is different, causes webpage capture difficult and right It needs that different crawler capturing parameters are arranged in different webpages, causes the speed of crawl slow, or even be absorbed in the dead of cycle crawl and follow Ring causes the operating load of computer, is quickly captured to webpage for convenience, improves the efficiency of crawler capturing, now design A kind of data capture method and its system based on distributed reptile.
Invention content
The purpose of the present invention is to provide a kind of data capture method and its system based on distributed reptile, by net Page is divided, and determines crawler capturing parameter according to the division of webpage, and webpage information can be captured by crawler capturing parameter, and Redirecting connection can be attached with other webpages, realize the quick crawl of webpage information, solve the process of existing crawler capturing In, there is a problem of that crawl efficiency is low.
The purpose of the present invention can be achieved through the following technical solutions:
A kind of data capture method based on distributed reptile, includes the following steps:
S1, webpage is divided according to the attribute of different web pages;
S2, the division for webpage determine at least one crawler capturing parameter;
S3, current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed, Obtain parsing data;
S4, the parsing data of acquisition are compiled, obtain the webpage data information that user can identify;
S5, the webpage data information after compiling is summarized, and feeds back to data storage module;
S6, the General Logistics Department of using foreign currency webpage data information stored.
A kind of data-acquisition system based on distributed reptile, including Web page classifying module, tactful handling module, parsing mould Block, collector, data feedback module and data storage module;
The webpage of same alike result is divided into together by the Web page classifying module for being divided to the attribute of different web pages It is a kind of;
The strategy handling module determines at least one crawler capturing parameter according to the different demarcation of webpage;
The parsing module captures webpage information according to crawler capturing parameter from current webpage, and to the webpage of crawl Information is parsed, and parsing data are obtained;
The collector is obtained for being compiled the parsing data that parsing module is obtained according to different resolution rules It is sent to data feedback module to the identifiable webpage data information of user, and by the webpage data information after compiling;
The data feedback module, summarizes the webpage data information after the compiling of reception, and by the net after summarizing Page data information feeds back to data storage module;
Webpage data information after the data storage module summarizes for what is fed back to data feedback module stores.
Further, the tactful handling module redirects connection including at least one, for being attached with other webpages.
Further, the parsing module includes type judging unit and resolution unit, and the type judging unit is used for The type of webpage of acquisition is judged;Resolution unit by the different type according to webpage, to the data information on webpage into Row parsing.
Beneficial effects of the present invention:
The present invention is based on the data capture method of distributed reptile and its systems, by being divided to webpage, and according to The division of webpage determines crawler capturing parameter, and webpage information can be captured by crawler capturing parameter, and redirecting connection can be with other Webpage is attached, and realizes the quick crawl of webpage information, is ensured the acquisition for being completed in a relatively short time data, is improved and adopt The efficiency of collection meets the requirement of time.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 is a kind of schematic diagram of the data capture method based on distributed reptile of the present invention;
Fig. 2 is a kind of schematic diagram of the data-acquisition system based on distributed reptile of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained all other without creative efforts Embodiment shall fall within the protection scope of the present invention.
Refering to Figure 1, the present invention is a kind of data capture method based on distributed reptile, include the following steps:
S1, webpage is divided according to the attribute of different web pages;
S2, the division for webpage determine at least one crawler capturing parameter;
S3, current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed, Obtain parsing data;
S4, the parsing data of acquisition are compiled, obtain the webpage data information that user can identify;
S5, the webpage data information after compiling is summarized, and feeds back to data storage module;
S6, the General Logistics Department of using foreign currency webpage data information stored.
As shown in Fig. 2, a kind of data-acquisition system based on distributed reptile, including Web page classifying module, strategy crawl Module, parsing module, collector, data feedback module and data storage module;
The webpage of same alike result is divided into same by Web page classifying module for being divided to the attribute of different web pages Class;
Tactful handling module determines at least one crawler capturing parameter according to the different demarcation of webpage, wherein strategy crawl Module redirects connection including at least one, for being attached with other webpages, realizes redirecting for webpage;
Parsing module captures webpage information according to crawler capturing parameter from current webpage, and to the webpage information of crawl It is parsed, obtains parsing data, parsing module includes type judging unit and resolution unit, wherein type judging unit is used Judge in the type of webpage to acquisition, the type of webpage includes homepage, paging and detail page;Resolution unit passes through basis The different type of webpage parses the data information on webpage;
If current type of webpage be homepage, according to homepage parse respective rule, to the data information on webpage into Row parsing obtains parsing data, if current type of webpage is paging, according to the respective rule that paging parses, on webpage Data information is parsed to obtain parsing data, if current web page type is details page, the corresponding rule parsed according to details page Then, the data information of four big row of webpage is parsed to obtain parsing data;
Collector is used for being compiled the parsing data that parsing module is obtained according to different resolution rules The identifiable webpage data information in family, and the webpage data information after compiling is sent to data feedback module;
Data feedback module summarizes the webpage data information after the compiling of reception, and the webpage number after summarizing It is believed that breath feeds back to data storage module;
Webpage data information after data storage module summarizes for what is fed back to data feedback module stores.
The present invention is based on the data capture method of distributed reptile and its systems, by being divided to webpage, and according to The division of webpage determines crawler capturing parameter, and webpage information can be captured by crawler capturing parameter, and redirecting connection can be with other Webpage is attached, and realizes the quick crawl of webpage information, is ensured the acquisition for being completed in a relatively short time data, is improved and adopt The efficiency of collection meets the requirement of time.
The above content is just an example and description of the concept of the present invention, affiliated those skilled in the art It makes various modifications or additions to the described embodiments or substitutes by a similar method, without departing from invention Design or beyond the scope defined by this claim, be within the scope of protection of the invention.

Claims (4)

1. a kind of data capture method based on distributed reptile, which is characterized in that include the following steps:
S1, webpage is divided according to the attribute of different web pages;
S2, the division for webpage determine at least one crawler capturing parameter;
S3, current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed, obtained Parse data;
S4, the parsing data of acquisition are compiled, obtain the webpage data information that user can identify;
S5, the webpage data information after compiling is summarized, and feeds back to data storage module;
S6, the General Logistics Department of using foreign currency webpage data information stored.
2. a kind of data-acquisition system based on distributed reptile, it is characterised in that:Including Web page classifying module, strategy crawl mould Block, parsing module, collector, data feedback module and data storage module;
The webpage of same alike result is divided into same by the Web page classifying module for being divided to the attribute of different web pages Class;
The strategy handling module determines at least one crawler capturing parameter according to the different demarcation of webpage;
The parsing module captures webpage information according to crawler capturing parameter from current webpage, and to the webpage information of crawl It is parsed, obtains parsing data;
The collector is used for being compiled the parsing data that parsing module is obtained according to different resolution rules The identifiable webpage data information in family, and the webpage data information after compiling is sent to data feedback module;
The data feedback module, summarizes the webpage data information after the compiling of reception, and the webpage number after summarizing It is believed that breath feeds back to data storage module;
Webpage data information after the data storage module summarizes for what is fed back to data feedback module stores.
3. a kind of data-acquisition system based on distributed reptile according to claim 2, it is characterised in that:The strategy Handling module redirects connection including at least one, for being attached with other webpages.
4. a kind of data-acquisition system based on distributed reptile according to claim 2, it is characterised in that:The parsing Module includes type judging unit and resolution unit, and the type judging unit is for judging the type of webpage of acquisition; Resolution unit parses the data information on webpage by the different type according to webpage.
CN201810274062.XA 2018-03-29 2018-03-29 A kind of data capture method and its system based on distributed reptile Withdrawn CN108536788A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810274062.XA CN108536788A (en) 2018-03-29 2018-03-29 A kind of data capture method and its system based on distributed reptile

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810274062.XA CN108536788A (en) 2018-03-29 2018-03-29 A kind of data capture method and its system based on distributed reptile

Publications (1)

Publication Number Publication Date
CN108536788A true CN108536788A (en) 2018-09-14

Family

ID=63482598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810274062.XA Withdrawn CN108536788A (en) 2018-03-29 2018-03-29 A kind of data capture method and its system based on distributed reptile

Country Status (1)

Country Link
CN (1) CN108536788A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194161A1 (en) * 2001-04-12 2002-12-19 Mcnamee J. Paul Directed web crawler with machine learning
CN101937469A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Information capture method of video website
CN104252530A (en) * 2014-09-10 2014-12-31 北京京东尚科信息技术有限公司 Single-computer crawler grabbing method and system
CN104951539A (en) * 2015-06-19 2015-09-30 成都艾尔普科技有限责任公司 Internet data center harmful information monitoring system
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194161A1 (en) * 2001-04-12 2002-12-19 Mcnamee J. Paul Directed web crawler with machine learning
CN101937469A (en) * 2010-09-15 2011-01-05 深圳市任子行网络技术股份有限公司 Information capture method of video website
CN104252530A (en) * 2014-09-10 2014-12-31 北京京东尚科信息技术有限公司 Single-computer crawler grabbing method and system
CN104951539A (en) * 2015-06-19 2015-09-30 成都艾尔普科技有限责任公司 Internet data center harmful information monitoring system
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
岳宗林: "基于垂直搜索引擎的舆情预警分析平台", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
CN102930059B (en) Method for designing focused crawler
CN106021257B (en) A kind of crawler capturing data method, apparatus and system for supporting online programming
AU2009276354B2 (en) Providing posts to discussion threads in response to a search query
CN107066576B (en) A kind of big data web crawlers paging selection method and system
CN104516982A (en) Method and system for extracting Web information based on Nutch
CN105243159A (en) Visual script editor-based distributed web crawler system
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
CN102567407B (en) Method and system for collecting forum reply increment
CN101963965B (en) Document indexing method, data query method and server based on search engine
CN102486799B (en) World wide web (WWW) page processing method and device
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN104090976A (en) Method and device for crawling webpages by search engine crawlers
CN103617213A (en) Method and system for identifying newspage attributive characters
CN108733813A (en) Information extracting method, system towards BBS forum Web pages contents and medium
CN104331438A (en) Method and device for selectively extracting content of novel webpage
CN110377796A (en) Text extracting method, device, equipment and storage medium based on dom tree
CN103970800A (en) Method and system for extracting and processing webpage related keywords
CN104778232B (en) Searching result optimizing method and device based on long query
CN104281629A (en) Method and device for extracting picture from webpage and client equipment
CN106326236A (en) Webpage content identification method and system
CN105930385A (en) Data crawling method and system
CN107547944A (en) Interface realizing method and device, set top box
CN106250456A (en) Bid winning announcement extraction method and device
CN104331512A (en) Automatic BBS (bulletin board system) page acquisition method
CN104361121B (en) A kind of batch analytic method of WEB reporting systems formula

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20180914