CN104750804A - Plug-in type configurable vertical network spider implementation method - Google Patents

Plug-in type configurable vertical network spider implementation method Download PDF

Info

Publication number
CN104750804A
CN104750804A CN201510131253.7A CN201510131253A CN104750804A CN 104750804 A CN104750804 A CN 104750804A CN 201510131253 A CN201510131253 A CN 201510131253A CN 104750804 A CN104750804 A CN 104750804A
Authority
CN
China
Prior art keywords
url
configuration
plug
capture
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510131253.7A
Other languages
Chinese (zh)
Inventor
孟硕培
程向飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tu Niu Science And Technology Ltd
Original Assignee
Nanjing Tu Niu Science And Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tu Niu Science And Technology Ltd filed Critical Nanjing Tu Niu Science And Technology Ltd
Priority to CN201510131253.7A priority Critical patent/CN104750804A/en
Publication of CN104750804A publication Critical patent/CN104750804A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a plug-in type configurable vertical network spider implementation method. The plug-in type configurable vertical network spider implementation method comprises a grabbing stage and an extracting stage, wherein the grabbing stage comprises a grabbing configuration stage and a grabbing program execution stage, and the extracting stage comprises an extract configuration stage and an extract program execution stage. The plug-in type configurable vertical network spider implementation method can achieve webpage grabbing and information extract of multiple fields through a configuration mode, the precision is high, not only is the defect that a traditional search engine is ambiguous in intention and not high in precision overcome, but also the webpage grabbing and the information extract of the multiple fields can be achieved.

Description

A kind of plug-in type configurable vertical field web crawlers implementation method
Technical field
The present invention relates to vertical field web crawlers implementation method, particularly relate to a kind of plug-in type configurable vertical field web crawlers implementation method.
Background technology
Along with the development of Internet technology, the information on network is vast as the open sea.How in the ocean of information, to find required information rapidly and accurately, become the focal issue that people pay close attention to.For the network information of flood tide, the worker of different field only needs to pay close attention to the data relevant to this area, and does not need other data, and they often have higher requirement to the quality of data.Traditional search engine search intention is failed to understand, result degree of accuracy is not high, data destructuring, is difficult to provide accurate and clean data.
The reptile in general vertical field adopts certain algorithm to make for a certain specific area, can obtain the information in certain field exactly, but to the webpage capture of the other field outside this specific area and information extraction helpless.
Summary of the invention
Goal of the invention: the object of this invention is to provide a kind of webpage capture and the information extraction that can be realized multiple field by the mode of configuration, the plug-in type that degree of accuracy is high configurable vertical field web crawlers implementation method.
Technical scheme: for reaching this object, the present invention by the following technical solutions:
Plug-in type of the present invention configurable vertical field web crawlers implementation method, comprises stage of gripping and extraction stage; Wherein, stage of gripping comprises crawl configuration phase and capture program execute phase, and extraction stage comprises extraction configuration phase and extraction program execute phase;
Capture configuration phase: configuration captures configuration file;
The capture program execute phase: read and capture configuration file, carry out crawl work, obtain capturing data;
Extract configuration phase: comprise configure base information, and configure decimation pattern according to described crawl data;
The extraction program execute phase: read and extract configuration file, extraction work is carried out to described crawl data, obtains extracted data.
Further, described crawl configuration phase comprises the following steps:
Step (11): all entrance URL are placed in independent configuration file, and specify corresponding state;
Step (12): the coded system depositing path and the crawl page configuring the data grabbed;
Step (13): the storage configuring the URL grabbed is arranged;
Step (14): the path of configuration URL entry file;
Step (15): configuration seized condition, in order to define the rules for grasping of the url list of next state;
Step (16): judge whether seized condition is final state; If not, then enter step (15) and continue the next state of configuration; If so, then step (17) is entered;
Step (17): the configuration information preserving reptile.
Further, the rules for grasping in described step (15) realizes by capturing plug-in unit, captures plug-in unit and adopts XPath, regular expression or custom programming.
Further, the described capture program execute phase comprises the following steps:
Step (21): read configuration file;
Step (22): obtain entrance url list, and entrance url list is joined URL and wait to capture in queue;
Step (23): judge that URL waits whether capture queue is empty; If be empty, then capture end; Otherwise, perform step (24);
Step (24): wait to capture queue from URL and take out a URL;
Step (25): judge whether the URL taken out captured; If so, step (23) is re-executed; Otherwise, perform step (26);
Step (26): the webpage pointed by the URL of taking-up is carried out capturing and storing;
Step (27): the URL captured is added in URL pond;
Step (28): enter the seized condition in step (15), obtain new url list according to rules for grasping;
Step (29): add new url list and wait to capture in queue to URL, perform step (23).
Further, described extraction configuration phase comprises the following steps:
Step (31): described configure base information, comprises the store path of the store path of the data grabbed, URL to be extracted, and the store path of the data be drawn into;
Step (32): described configures decimation pattern according to crawl data, comprises the information extracted from the webpage captured, and decimation rule.
Further, the decimation rule in described step (32) realizes by extracting plug-in unit, extracts plug-in unit and adopts XPath, regular expression or custom programming.
Further, the described extraction program execute phase comprises the following steps:
Step (41): read configuration file;
Step (42): read url list to be extracted, adds in URL queue to be extracted by url list to be extracted;
Step (43): read a URL from URL queue to be extracted;
Step (44): retrieve webpage corresponding to URL from the store path of the data grabbed described in step (31); If webpage exists, then perform step (45); Otherwise, perform step (46);
Step (45): extract corresponding contents according to decimation rule, and store;
Step (46): judge whether URL queue to be extracted is empty; If be empty, then terminate; Otherwise, perform step (43).
Beneficial effect: the present invention can realize webpage capture and the information extraction in multiple field by the mode of configuration, and degree of accuracy is high, the shortcoming that traditional search engines intention is not clear, degree of accuracy is not high can be solved, webpage capture and the information extraction in multiple field can be realized again.
Accompanying drawing explanation
Fig. 1 is method block diagram of the present invention;
Fig. 2 is the process flow diagram of capture program execute phase of the present invention;
Fig. 3 is the process flow diagram of extraction program execute phase of the present invention.
Embodiment
Below in conjunction with accompanying drawing, further illustrate technical scheme of the present invention by embodiment.
The block diagram of plug-in type of the present invention configurable vertical field web crawlers implementation method as shown in Figure 1, comprises stage of gripping and extraction stage; Wherein, stage of gripping comprises crawl configuration phase and capture program execute phase, and extraction stage comprises extraction configuration phase and extraction program execute phase.
Capture configuration phase: configuration captures configuration file;
The capture program execute phase: read and capture configuration file, carry out crawl work, obtain capturing data;
Extract configuration phase: comprise configure base information, and configure decimation pattern according to crawl data;
The extraction program execute phase: read and extract configuration file, extraction work is carried out to crawl data, obtains extracted data.
Technical scheme of the present invention is further illustrated below with an embodiment.If there is a demand, Water demand website is to the circuit price distribution of yunnan tourism.In order to complete the demand, developer needs to obtain the information such as the price of this each circuit of website, number of days of going on a tour, group's phase.Now just according to method of the present invention, first can perform stage of gripping, the content of website be captured, then performs extraction stage, the information extraction such as circuit ID, line name, price needed out.
1, configuration phase is captured
(1) entrance URL is configured
Step (11): the list page navigating to the yunnan tourism circuit of this website, as: http://www.tuniu.com/g3300/tours-sh-0/list-h0/1/, as long as all circuits can be traveled through by the circulation of lower one page, this page can be used as the entrance page of whole crawl; All entrance URL are placed in independent configuration file, and specify corresponding state; Collocation method is as follows:
Numerical value wherein in state node indicates the state value of this URL in crawl step.
(2) configuration captures the page and deposits path and coding, and the storage capturing data is arranged, and the path of URL entry file
Step (12): configure the deposit data path and coding that grab;
Step (13): the storage configuring the URL grabbed is arranged;
Step (14): the path of configuration URL entry file;
The configuration sample of step (12), step (13) and step (14): all configuration informations are all placed in an xml file, for arranging the every configuration capturing configuration phase, comprise the deposit data path (index_dir) grabbed, capture the coded system (encoding) of the page, URL entry file path (entry_urls), and the URL grabbed deposit path (output_urls).
(3) seized condition is configured
Step (15): after configuring above-mentioned Back ground Information, just enters the link of seized condition configuration.Configuration seized condition, in order to define the rules for grasping of next state url list.Each state writes in the xml node of state by name, and it is that the attribute id of state node indicates this state for which state; Action node in each state indicates which kind of rule to obtain target url list according to.The type attribute of action node specifies crawl plug-in unit used, captures plug-in unit and adopts XPath, regular expression or custom programming; State attribute specifies the NextState obtaining URL and will jump to.
The concrete configuration process of step (15) is as follows:
First, configuration status 1, for obtaining the URL of all list page.The function of state 1 has traveled through all list page of this website to the circuit of yunnan tourism.Each list page only has 20 circuits, and needing to jump to lower one page could capture all circuits.Therefore, capture plug-in unit by nextpage in webpage, find name to be the link of " lower one page ", thus obtain the URL of lower one page.
Secondly, configuration status 2, for obtaining the URL of all circuit details pages.Next state is just entered, i.e. state 2 after obtaining all list page URL.For page http://www.tuniu.com/g3300/tours-sh-0/list-h0/1/, this page has 20 circuits, and we can pass through XPath technology, is namely found the URL of these 20 circuits by //dt/p//a/@href.
Finally, 3 are got the hang of after obtaining circuit details page URL.Without any action inside state 3, also namely state 3 is marks of conditions desist, and the circuit details page grabbed is saved in server.
The configuration sample of step (15):
In sum, whole crawl configuration phase has been completed.Next, just can carry out the capture program execute phase according to the configuration file captured in configuration phase.
2, the capture program execute phase
The process flow diagram of capture program execute phase as shown in Figure 2, comprises the following steps:
Step (21): read configuration file;
Step (22): obtain entrance url list, and entrance url list is joined URL and wait to capture in queue;
Step (23): judge that URL waits whether capture queue is empty; If be empty, then capture end; Otherwise, perform step (24);
Step (24): wait to capture queue from URL and take out a URL;
Step (25): judge whether the URL taken out captured; If so, step (23) is re-executed; Otherwise, perform step (26);
Step (26): the webpage pointed by the URL of taking-up is carried out capturing and storing;
Step (27): the URL captured is added in URL pond;
Step (28): enter the seized condition in step (15), obtain new url list according to rules for grasping;
Step (29): add new url list and wait to capture in queue to URL, perform step (23).
3, configuration phase is extracted
The data obtained after capture program is finished are whole webpages, directly cannot bring and carry out data analysis, need to extract the data of these webpages, filter out useful data.
Extract configuration phase and comprise following two steps:
Step (31): configure base information, comprises the store path of the store path of the data grabbed, URL to be extracted, and the store path of the data be drawn into;
Step (32): configure decimation pattern according to the data (such as webpage) that stage of gripping obtains, comprises the information extracted from the webpage captured, and decimation rule.
Connection with step (31), some relevant Back ground Informations are extracted in configuration, comprise the store path (url_list) of url to be extracted, the store path (index_dir) of the data grabbed, and the store path of the data be drawn into (output_dir).For the catalogue of the deposit data be drawn in file system, the configuration sample of step (31) is as follows:
Connection with step (32), for this webpage of http://sh.tuniu.com/tours/22512110, needs the data extracted to have 3 fields such as product id, name (name) and promotion price (price).Each field specifies its process extracted path and will do with the form of an xml node.
Node node is a field to be processed, and name attribute specifies the name of this field.Xpath attribute indicates the xpath path of this field in webpage.If the content obtained by xpath path is undesirable, can also processing it by extracting plug-in unit (action), replacement process can be carried out with regular expression to the content obtained as extracted plug-in unit replace.Extract plug-in unit and adopt XPath or regular expression, if existing extraction plug-in unit can not satisfy the demands, developer also can pass through custom programming, develops the extraction plug-in unit of oneself, realizes this function.
4, the extraction program execute phase
After extracting configuration, just can carry out extraction program according to the information configured in extraction configuration phase to the data that stage of gripping obtains and perform.The process flow diagram of extraction program execute phase as shown in Figure 3, comprises the following steps:
Step (41): read configuration file;
Step (42): read url list to be extracted, adds in URL queue to be extracted by url list to be extracted;
Step (43): read a URL from URL queue to be extracted;
Step (44): retrieve webpage corresponding to URL from corresponding catalogue or database; If webpage exists, then perform step (45); Otherwise, perform step (46);
Step (45): extract corresponding contents according to decimation rule, and store;
Step (46): judge whether URL queue to be extracted is empty; If be empty, then terminate; Otherwise, perform step (43).
The data obtained are extracted as follows in this embodiment:
id:22512110
Name:< Kunming-Dali-bis-corridor-Lijing round trip flight 6 day tour > sees the large scape in Dian Chi, two corridor in product dream
price:1299
Extract the data obtained and become very structuring, developer can utilize these data to carry out other research easily.

Claims (7)

1. a plug-in type configurable vertical field web crawlers implementation method, is characterized in that: comprise stage of gripping and extraction stage; Wherein, stage of gripping comprises crawl configuration phase and capture program execute phase, and extraction stage comprises extraction configuration phase and extraction program execute phase;
Capture configuration phase: configuration captures configuration file;
The capture program execute phase: read and capture configuration file, carry out crawl work, obtain capturing data;
Extract configuration phase: comprise configure base information, and configure decimation pattern according to described crawl data;
The extraction program execute phase: read and extract configuration file, extraction work is carried out to described crawl data, obtains extracted data.
2. plug-in type according to claim 1 configurable vertical field web crawlers implementation method, is characterized in that: described crawl configuration phase comprises the following steps:
Step (11): all entrance URL are placed in independent configuration file, and specify corresponding state;
Step (12): the coded system depositing path and the crawl page configuring the data grabbed;
Step (13): the storage configuring the URL grabbed is arranged;
Step (14): the path of configuration URL entry file;
Step (15): configuration seized condition, in order to define the rules for grasping of the url list of next state;
Step (16): judge whether seized condition is final state; If not, then enter step (15) and continue the next state of configuration; If so, then step (17) is entered;
Step (17): the configuration information preserving reptile.
3. plug-in type according to claim 2 configurable vertical field web crawlers implementation method, it is characterized in that: the rules for grasping in described step (15) realizes by capturing plug-in unit, captures plug-in unit and adopts XPath, regular expression or custom programming.
4. plug-in type according to claim 2 configurable vertical field web crawlers implementation method, is characterized in that: the described capture program execute phase comprises the following steps:
Step (21): read configuration file;
Step (22): obtain entrance url list, and entrance url list is joined URL and wait to capture in queue;
Step (23): judge that URL waits whether capture queue is empty; If be empty, then capture end; Otherwise, perform step (24);
Step (24): wait to capture queue from URL and take out a URL;
Step (25): judge whether the URL taken out captured; If so, step (23) is re-executed; Otherwise, perform step (26);
Step (26): the webpage pointed by the URL of taking-up is carried out capturing and storing;
Step (27): the URL captured is added in URL pond;
Step (28): enter the seized condition in step (15), obtain new url list according to rules for grasping;
Step (29): add new url list and wait to capture in queue to URL, perform step (23).
5. plug-in type according to claim 1 configurable vertical field web crawlers implementation method, is characterized in that: described extraction configuration phase comprises the following steps:
Step (31): described configure base information, comprises the store path of the store path of the data grabbed, URL to be extracted, and the store path of the data be drawn into;
Step (32): described configures decimation pattern according to crawl data, comprises the information extracted from the webpage captured, and decimation rule.
6. plug-in type according to claim 5 configurable vertical field web crawlers implementation method, it is characterized in that: the decimation rule in described step (32) realizes by extracting plug-in unit, extracts plug-in unit and adopts XPath, regular expression or custom programming.
7. plug-in type according to claim 5 configurable vertical field web crawlers implementation method, is characterized in that: the described extraction program execute phase comprises the following steps:
Step (41): read configuration file;
Step (42): read url list to be extracted, adds in URL queue to be extracted by url list to be extracted;
Step (43): read a URL from URL queue to be extracted;
Step (44): retrieve webpage corresponding to URL from the store path of the data grabbed described in step (31); If webpage exists, then perform step (45); Otherwise, perform step (46);
Step (45): extract corresponding contents according to decimation rule, and store;
Step (46): judge whether URL queue to be extracted is empty; If be empty, then terminate; Otherwise, perform step (43).
CN201510131253.7A 2015-03-24 2015-03-24 Plug-in type configurable vertical network spider implementation method Pending CN104750804A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510131253.7A CN104750804A (en) 2015-03-24 2015-03-24 Plug-in type configurable vertical network spider implementation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510131253.7A CN104750804A (en) 2015-03-24 2015-03-24 Plug-in type configurable vertical network spider implementation method

Publications (1)

Publication Number Publication Date
CN104750804A true CN104750804A (en) 2015-07-01

Family

ID=53590488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510131253.7A Pending CN104750804A (en) 2015-03-24 2015-03-24 Plug-in type configurable vertical network spider implementation method

Country Status (1)

Country Link
CN (1) CN104750804A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557495A (en) * 2015-09-25 2017-04-05 北京国双科技有限公司 The extended method and device of reptile function
CN107729508A (en) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 Information crawler method and apparatus
CN108205531A (en) * 2016-12-16 2018-06-26 北京京东尚科信息技术有限公司 Data pick-up method and data extraction system
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN110020068A (en) * 2017-09-26 2019-07-16 北京国双科技有限公司 A kind of page crawls the configuration method and device of rule
CN110765334A (en) * 2019-09-10 2020-02-07 北京字节跳动网络技术有限公司 Data capture method, system, medium and electronic device
CN112685619A (en) * 2020-12-29 2021-04-20 江苏乐道胡巴投资管理有限公司 Universal capture method for webpage table data
CN112836106A (en) * 2019-11-25 2021-05-25 北京国双科技有限公司 Method and device for creating data crawling application, storage medium and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071766A1 (en) * 2003-09-25 2005-03-31 Brill Eric D. Systems and methods for client-based web crawling
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071766A1 (en) * 2003-09-25 2005-03-31 Brill Eric D. Systems and methods for client-based web crawling
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
陈建彪: "面向领域的Web信息自动化抽取系统设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
鲍亮等: "《实战大数据》", 31 March 2014, 清华大学出版社 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557495A (en) * 2015-09-25 2017-04-05 北京国双科技有限公司 The extended method and device of reptile function
CN106557495B (en) * 2015-09-25 2020-05-22 北京国双科技有限公司 Crawler function expansion method and device
CN108205531A (en) * 2016-12-16 2018-06-26 北京京东尚科信息技术有限公司 Data pick-up method and data extraction system
CN108205531B (en) * 2016-12-16 2022-01-28 北京京东尚科信息技术有限公司 Data extraction method and data extraction system
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN110020068A (en) * 2017-09-26 2019-07-16 北京国双科技有限公司 A kind of page crawls the configuration method and device of rule
CN110020068B (en) * 2017-09-26 2021-10-15 北京国双科技有限公司 Method and device for configuring page crawling rules
CN107729508A (en) * 2017-10-23 2018-02-23 北京京东金融科技控股有限公司 Information crawler method and apparatus
CN110765334A (en) * 2019-09-10 2020-02-07 北京字节跳动网络技术有限公司 Data capture method, system, medium and electronic device
CN112836106A (en) * 2019-11-25 2021-05-25 北京国双科技有限公司 Method and device for creating data crawling application, storage medium and equipment
CN112685619A (en) * 2020-12-29 2021-04-20 江苏乐道胡巴投资管理有限公司 Universal capture method for webpage table data

Similar Documents

Publication Publication Date Title
CN104750804A (en) Plug-in type configurable vertical network spider implementation method
US20170235767A1 (en) System and method for multiresolution and multitemporal image search
CN104077402B (en) Data processing method and data handling system
CN102054028B (en) Method for implementing web-rendering function by using web crawler system
CN104516982A (en) Method and system for extracting Web information based on Nutch
CN104866471B (en) A kind of example match method based on local sensitivity Hash strategy
CN104598536B (en) A kind of distributed network information structuring processing method
CN104021483B (en) Passenger demand recommends method
CN106202108B (en) Web crawlers grabs method for allocating tasks and device and data grab method and device
CN109242553A (en) A kind of user behavior data recommended method, server and computer-readable medium
CN105302876A (en) Regular expression based URL filtering method
CN104572934B (en) A kind of webpage key content abstracting method based on DOM
Kim et al. Event diffusion patterns in social media
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN105095209B (en) Document clustering method and device, the network equipment
CN104077293A (en) Webpage acquisition method and device
CN102567521B (en) Webpage data capturing and filtering method
CN107832344A (en) A kind of food security Internet public opinion analysis method based on storm stream calculation frameworks
CN104536830A (en) KNN text classification method based on MapReduce
CN103902667A (en) Simple network information collector achieving method based on meta-search
CN107861974A (en) A kind of adaptive network crawler system and its data capture method
CN103559258A (en) Webpage ranking method based on cloud computation
CN103198078B (en) A kind of internet news event report trend analysis and system
Xie et al. Token embeddings alignment for cross-modal retrieval
Xie et al. Design and implementation of the topic-focused crawler based on Scrapy

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent of invention or patent application
CB03 Change of inventor or designer information

Inventor after: Meng Shuopei

Inventor after: Cheng Xiangfei

Inventor after: Shi Hefu

Inventor before: Meng Shuopei

Inventor before: Cheng Xiangfei

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: MENG SHUOPEI CHENG XIANGFEI TO: MENG SHUOPEI CHENG XIANGFEI SHI HEFU

RJ01 Rejection of invention patent application after publication

Application publication date: 20150701

RJ01 Rejection of invention patent application after publication