CN108536788A

CN108536788A - A kind of data capture method and its system based on distributed reptile

Info

Publication number: CN108536788A
Application number: CN201810274062.XA
Authority: CN
Inventors: 王华伟
Original assignee: Hefei Jun Machinery Technology Co Ltd
Current assignee: Hefei Jun Machinery Technology Co Ltd
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2018-09-14

Abstract

The present invention discloses a kind of data capture method based on distributed reptile, includes the following steps：Webpage is divided according to the attribute of different web pages；For the division of webpage, at least one crawler capturing parameter is determined；Current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed, obtains parsing data；The parsing data of acquisition are compiled；Webpage data information after compiling is summarized, and feeds back to data storage module；The webpage data information of the General Logistics Department of using foreign currency is stored.The present invention is based on the data capture method of distributed reptile and its systems, by being divided to webpage, and determine crawler capturing parameter according to the division of webpage, webpage information can be captured by crawler capturing parameter, and redirect connection and can be attached with other webpages, it realizes the quick crawl of webpage information, ensures the acquisition for being completed in a relatively short time data, the efficiency for improving acquisition meets the requirement of time.

Description

A kind of data capture method and its system based on distributed reptile

Technical field

The invention belongs to technical field of network data collection, are related to a kind of data capture method based on distributed reptile And its system.

Background technology

With the rapid development of network, WWW becomes the carrier of bulk information, how to efficiently extract and use these Information becomes a huge challenge.Search engine (Search Engine), such as traditional universal search engine AltaVista, Yahoo and Google etc., the tool of information is retrieved as an auxiliary people becomes user's access WWW Entrance and guide.But there is also certain limitations for these versatility search engines, therefore, web crawlers comes into being, Web crawlers (be otherwise known as webpage spider, network robot, and among the communities FOAF, more frequent is known as webpage follower), It is a kind of according to certain rule, automatically captures the program or script of web message.The rarely needed name of other Word also has ant, automatic indexing, simulation program or worm.Web crawlers is a program for automatically extracting webpage, it is to search Index holds up the contained network page above and below WWW, is the important composition of search engine.Traditional reptile is from one or several Initial pages URL starts, and the URL obtained on Initial page constantly extracts new URL from current page and put during capturing webpage Enqueue, certain stop condition until meeting system.

In a large amount of website datas of crawler capturing, the type of usually each webpage is different, causes webpage capture difficult and right It needs that different crawler capturing parameters are arranged in different webpages, causes the speed of crawl slow, or even be absorbed in the dead of cycle crawl and follow Ring causes the operating load of computer, is quickly captured to webpage for convenience, improves the efficiency of crawler capturing, now design A kind of data capture method and its system based on distributed reptile.

Invention content

The purpose of the present invention is to provide a kind of data capture method and its system based on distributed reptile, by net Page is divided, and determines crawler capturing parameter according to the division of webpage, and webpage information can be captured by crawler capturing parameter, and Redirecting connection can be attached with other webpages, realize the quick crawl of webpage information, solve the process of existing crawler capturing In, there is a problem of that crawl efficiency is low.

The purpose of the present invention can be achieved through the following technical solutions：

A kind of data capture method based on distributed reptile, includes the following steps：

S1, webpage is divided according to the attribute of different web pages；

S2, the division for webpage determine at least one crawler capturing parameter；

S3, current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed, Obtain parsing data；

S4, the parsing data of acquisition are compiled, obtain the webpage data information that user can identify；

S5, the webpage data information after compiling is summarized, and feeds back to data storage module；

S6, the General Logistics Department of using foreign currency webpage data information stored.

A kind of data-acquisition system based on distributed reptile, including Web page classifying module, tactful handling module, parsing mould Block, collector, data feedback module and data storage module；

The webpage of same alike result is divided into together by the Web page classifying module for being divided to the attribute of different web pages It is a kind of；

The strategy handling module determines at least one crawler capturing parameter according to the different demarcation of webpage；

The parsing module captures webpage information according to crawler capturing parameter from current webpage, and to the webpage of crawl Information is parsed, and parsing data are obtained；

The collector is obtained for being compiled the parsing data that parsing module is obtained according to different resolution rules It is sent to data feedback module to the identifiable webpage data information of user, and by the webpage data information after compiling；

The data feedback module, summarizes the webpage data information after the compiling of reception, and by the net after summarizing Page data information feeds back to data storage module；

Webpage data information after the data storage module summarizes for what is fed back to data feedback module stores.

Further, the tactful handling module redirects connection including at least one, for being attached with other webpages.

Further, the parsing module includes type judging unit and resolution unit, and the type judging unit is used for The type of webpage of acquisition is judged；Resolution unit by the different type according to webpage, to the data information on webpage into Row parsing.

Beneficial effects of the present invention：

The present invention is based on the data capture method of distributed reptile and its systems, by being divided to webpage, and according to The division of webpage determines crawler capturing parameter, and webpage information can be captured by crawler capturing parameter, and redirecting connection can be with other Webpage is attached, and realizes the quick crawl of webpage information, is ensured the acquisition for being completed in a relatively short time data, is improved and adopt The efficiency of collection meets the requirement of time.

Description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, will be described below to embodiment required Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of schematic diagram of the data capture method based on distributed reptile of the present invention；

Fig. 2 is a kind of schematic diagram of the data-acquisition system based on distributed reptile of the present invention.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained all other without creative efforts Embodiment shall fall within the protection scope of the present invention.

Refering to Figure 1, the present invention is a kind of data capture method based on distributed reptile, include the following steps：

S1, webpage is divided according to the attribute of different web pages；

As shown in Fig. 2, a kind of data-acquisition system based on distributed reptile, including Web page classifying module, strategy crawl Module, parsing module, collector, data feedback module and data storage module；

The webpage of same alike result is divided into same by Web page classifying module for being divided to the attribute of different web pages Class；

Tactful handling module determines at least one crawler capturing parameter according to the different demarcation of webpage, wherein strategy crawl Module redirects connection including at least one, for being attached with other webpages, realizes redirecting for webpage；

Parsing module captures webpage information according to crawler capturing parameter from current webpage, and to the webpage information of crawl It is parsed, obtains parsing data, parsing module includes type judging unit and resolution unit, wherein type judging unit is used Judge in the type of webpage to acquisition, the type of webpage includes homepage, paging and detail page；Resolution unit passes through basis The different type of webpage parses the data information on webpage；

If current type of webpage be homepage, according to homepage parse respective rule, to the data information on webpage into Row parsing obtains parsing data, if current type of webpage is paging, according to the respective rule that paging parses, on webpage Data information is parsed to obtain parsing data, if current web page type is details page, the corresponding rule parsed according to details page Then, the data information of four big row of webpage is parsed to obtain parsing data；

Collector is used for being compiled the parsing data that parsing module is obtained according to different resolution rules The identifiable webpage data information in family, and the webpage data information after compiling is sent to data feedback module；

Data feedback module summarizes the webpage data information after the compiling of reception, and the webpage number after summarizing It is believed that breath feeds back to data storage module；

Webpage data information after data storage module summarizes for what is fed back to data feedback module stores.

The above content is just an example and description of the concept of the present invention, affiliated those skilled in the art It makes various modifications or additions to the described embodiments or substitutes by a similar method, without departing from invention Design or beyond the scope defined by this claim, be within the scope of protection of the invention.

Claims

1. a kind of data capture method based on distributed reptile, which is characterized in that include the following steps：

S1, webpage is divided according to the attribute of different web pages；

S3, current web page information is captured according to determining crawler capturing parameter, and the webpage information of crawl is parsed, obtained Parse data；

2. a kind of data-acquisition system based on distributed reptile, it is characterised in that：Including Web page classifying module, strategy crawl mould Block, parsing module, collector, data feedback module and data storage module；

The webpage of same alike result is divided into same by the Web page classifying module for being divided to the attribute of different web pages Class；

The parsing module captures webpage information according to crawler capturing parameter from current webpage, and to the webpage information of crawl It is parsed, obtains parsing data；

The collector is used for being compiled the parsing data that parsing module is obtained according to different resolution rules The identifiable webpage data information in family, and the webpage data information after compiling is sent to data feedback module；

The data feedback module, summarizes the webpage data information after the compiling of reception, and the webpage number after summarizing It is believed that breath feeds back to data storage module；

3. a kind of data-acquisition system based on distributed reptile according to claim 2, it is characterised in that：The strategy Handling module redirects connection including at least one, for being attached with other webpages.

4. a kind of data-acquisition system based on distributed reptile according to claim 2, it is characterised in that：The parsing Module includes type judging unit and resolution unit, and the type judging unit is for judging the type of webpage of acquisition； Resolution unit parses the data information on webpage by the different type according to webpage.