CN101625692A - Method for rapidly collecting dynamic script website data - Google Patents

Method for rapidly collecting dynamic script website data Download PDF

Info

Publication number
CN101625692A
CN101625692A CN200910090288A CN200910090288A CN101625692A CN 101625692 A CN101625692 A CN 101625692A CN 200910090288 A CN200910090288 A CN 200910090288A CN 200910090288 A CN200910090288 A CN 200910090288A CN 101625692 A CN101625692 A CN 101625692A
Authority
CN
China
Prior art keywords
page
event
trigger
type
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910090288A
Other languages
Chinese (zh)
Other versions
CN101625692B (en
Inventor
夏冰
高军
王腾蛟
杨冬青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN2009100902885A priority Critical patent/CN101625692B/en
Publication of CN101625692A publication Critical patent/CN101625692A/en
Application granted granted Critical
Publication of CN101625692B publication Critical patent/CN101625692B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a method for rapidly collecting dynamic script website data, which is characterized by comprising the following steps: (i) obtaining and storing an Index page and adding the Index page into a queue to be processed; (ii) judging whether the queue to be processed is empty or not, if so, indicating capturing finish and quitting the capturing process, or else randomly selecting a page from the queue to be processed, obtaining the type of the current page by the similarity of pages and confirming an event on a page element, which needs to be triggered by the type according to the XPath characteristics extracted from the training step; (iii) judging whether the current page is provided with an untriggered event or not, if not, turning to the step (ii), or else triggering the event; judging whether the current page is changed or not and the changed page is a new page or not, if not, turning to the step (v), or else continuously executing the step (iv); (iv) storing the new page and adding the new page into the queue to be processed in the step (i); (v) returning to the page state before the event is triggered and turning to the step (iii).

Description

A kind of method for rapidly collecting dynamic script website data
Technical field
The present invention relates to a kind of network data acquisition method, particularly about a kind of method for rapidly collecting dynamic script website data.
Background technology
Along with the Web arrival in 2.0 epoch, the internet more and more server end and customer end adopted the mode of dynamic script carry out alternately.The content of webpage, from before static mode be converted to and from database, obtain Data Dynamic and generate.On the one hand, the webpage homepage downloads to also to be needed behind this locality to carry out for several times could obtaining whole data alternately with server end again, as the reading number of Sina's blog with comment on number be the page be written into after again to server transmission acquisition request; On the other hand, the link of web page contents much also no longer is traditional "<a〉" label, and has been to use the mode of JavaScript, adopts JavaScript to control as the page turning mode of forum of Tengxun, forum of Netease etc.
Reptile is the first step of search engine, also is the basis that internet data is analyzed.The tradition reptile is treated the defective that there is inadequate natural endowment in existing website: the main flow search engine treats that dynamic script website is generally taked to avoid, the mode of hard coded and three kinds of attitudes that search engine interface is kept in some website.Have research institution to propose modelling customer behavior and click the technology that all page elements of the page grasp dynamic script website successively, but this mode speed is very slow, is not suitable for practical application.
Summary of the invention
At the problems referred to above, the purpose of this invention is to provide a kind of particular event that can only pass through to trigger on the specific webpage element, thereby significantly reduced the event number that needs triggering, improved the method for rapidly collecting dynamic script website data of the gathering speed of dynamic script website data.
For achieving the above object, the present invention takes following technical scheme: a kind of method for rapidly collecting dynamic script website data is characterized in that: it may further comprise the steps: i) obtain the Index page and preservation, the Index page is added pending formation; Judge ii) whether described pending formation is empty, finishes if sky then shows to grasp, and withdraws from the extracting process; Otherwise from described pending formation, select a page arbitrarily, utilize page similarity to obtain the type of current page,, determine described type need trigger which incident on which page elements according to the described XPath route characteristic that extracts in the training step; Judge iii) whether described current page has the incident that did not trigger, if then do not jump to step I i); Otherwise trigger event judges that the page after whether described current page changes and change is a new page, if do not have variation or non-new page then to jump to step v); Otherwise continue execution in step iv); Vi) preserve described new page and described new page joined step I) in pending formation; V) return back to the preceding page status of trigger event, and jump to step I ii).
Described step I i) in, described training step comprises: 1) become the training sample set by manually providing all types of pages; 2) judge whether described training sample set is empty,, otherwise from described training sample set, select a page if sky then jumps to step 5); 3) judge the incident that whether did not trigger in addition on the current page, if then do not jump to step 2); Otherwise the triggering current event, the event information of record current event; 4) return back to the preceding page status of trigger event, and jump to step 3); 5) extract XPath route characteristic and the event type that all types of pages need trigger event according to described event information.
In the described step 3), the event information of described current page comprises whether the page elements of event type, trigger event, the page change, change the back page is for which kind of type.
In the described step 5), the extracting method of described XPath route characteristic is as follows: must be identical by the page elements title of the XPath path process of reduction, the sequence number of page elements is carried out reduction.
The present invention is owing to take above technical scheme, it has the following advantages: 1, the present invention is owing to utilized the page to apply mechanically the similarity of template, therefore the incidents such as click on which page elements can the predicted page of what character that will lead, thus the page type that utilizes page similarity to obtain is determined the type of current page and need be triggered which incident on which page elements.2, the XPath of the page elements of the present invention by guiding being needed the page carries out reduction, extract its feature and event type, therefore only trigger the particular event on the specific webpage element, significantly reduced the number of times of analog subscriber operation, improved the gathering speed of dynamic script website data.The present invention can be applicable to the webpage extracting work of diverse network search engine.
Description of drawings
Fig. 1 is the training step schematic flow sheet before the extracting data of the present invention
Fig. 2 is the grasping means steps flow chart synoptic diagram of breadth-first of the present invention
Embodiment
Below in conjunction with drawings and Examples the present invention is described in detail.
Be directed to not handling for the treatment of in the prior art that dynamic script website adopts, hard coded mode, the implementation of the inventive method comprises two parts, and first is training, and second portion is for grasping.By the similarity training of the page, can know which kind of incident the page of each type should trigger on which page elements.After finishing, training promptly can grasp, extracting process of the present invention can adopt multiple extracting strategy, in the breadth-first grasping means in the present embodiment, after having triggered an incident at every turn, the capital return back to the former page, till all Event triggered that need trigger of the former page are intact, remove to handle other the page again.
As shown in Figure 1, training step of the present invention is as follows:
1) becomes the training sample set by manually providing all types of pages.
2) whether the training of judgement sample set is empty, if sky then jumps to step 5), otherwise selects a page from the training sample set.
3) judge the incident that whether did not trigger in addition on the current page, if then do not jump to step 2); Otherwise the triggering current event, whether event type, the page elements of trigger event, the page of record current event changes, changes the back page is event informations such as which kind of type.
4) return back to the preceding page status of trigger event, and jump to step 3).
5) event information according to record extracts page elements XML path language (XML Path Language, XPath) feature and the event type that all types of pages need trigger event.The extracting method of XPath route characteristic is as follows: must be identical by the page elements title of the XPath path process of reduction, the sequence number of page elements is carried out reduction.As to "/html/body/div[4]/li[1]/a[1] " and "/html/body/div[4]/li[2]/a[1] " these two XPath paths, the present invention with their reduction be "/html/body/div[4]/li[*]/a[1] "; As also exist "/html/body/div[3]/li[1]/a[1] " this XPath path, the present invention with above three XPath path reduction be "/html/body/div[*]/li[*]/a[1] ".But for "/html/body/div[1] ", in "/html/body/span[1] " and "/html/body/div[1]/span[1] " three XPath paths any two all not by reduction.
As shown in Figure 2, extracting step of the present invention is as follows:
I) obtain the Index page and preservation by the URL that the Index page is provided, the Index page is added pending formation.Pending formation is dynamic change among the present invention, and the page deletion with handling adds the new page.
Judge ii) whether pending formation is empty, finishes if sky then shows to grasp, and withdraws from the extracting process; Otherwise from pending formation, select a page arbitrarily, utilize page similarity to obtain the type of current page, according to the XPath route characteristic that aforesaid training part extracts, the type of determining current page need trigger which incident on which page elements.
Judge iii) whether current page has the incident that did not trigger, if then do not jump to step I i); Otherwise trigger event judges that the page after whether current page changes and change is a new page, if do not have variation or non-new page then to jump to step v); Otherwise continue execution in step iv).
Vi) preserve new page and new page joined step I) in pending formation.
V) return back to the preceding page status of trigger event, and jump to step I ii).
Review pages with Netease's news is embodiment below, and the use and the superiority of the inventive method is described.
The review pages of Netease's news after being written into again to service end request comment data, and " onclick " incident control page turning by "<a〉" label.In the present embodiment, only consider " onclick " incident of "<a〉" label in whole process, every page of review pages is the page that needs.
At first, the sample training of two review pages is provided, trigger " onclick " incident on all "<a〉" labels successively, find to have only the page elements at the page number and " page up ", " following one page " place just triggering can be guided into required review pages.Sum up the feature of these page elements by training.
Secondly, when the process of extracting, present embodiment only triggers " onclick " incident on these page elements.By having selected the comment of 3 pieces of Netease's news at random, will adopt method of the present invention and not adopt to compare, grasp fully shown in the incident numerical table 1 that needs to trigger:
The incident that table 1 triggers
The comment number of pages Do not adopt the inventive method to need the trigger event number Adopt the inventive method to need the trigger event number
News 1 ??7 ??1114 ??78
News 2 ??8 ??1698 ??90
News 3 ??11 ??1609 ??156
Only considered the situation of " onclick " incident of "<a〉" label in the present embodiment, if consider other labels and other incidents, method then of the present invention is more obvious than the superiority of additive method.

Claims (5)

1, a kind of method for rapidly collecting dynamic script website data, it is characterized in that: it may further comprise the steps:
I) obtain the Index page and preservation, the Index page is added pending formation;
Judge ii) whether described pending formation is empty, finishes if sky then shows to grasp, and withdraws from the extracting process; Otherwise from described pending formation, select a page arbitrarily, utilize page similarity to obtain the type of current page,, determine described type need trigger which incident on which page elements according to the described XPath route characteristic that extracts in the training step;
Judge iii) whether described current page has the incident that did not trigger, if then do not jump to step I i); Otherwise trigger event judges that the page after whether described current page changes and change is a new page, if do not have variation or non-new page then to jump to step v); Otherwise continue execution in step iv);
Vi) preserve described new page and described new page joined step I) in pending formation;
V) return back to the preceding page status of trigger event, and jump to step I ii).
2, a kind of method for rapidly collecting dynamic script website data as claimed in claim 1 is characterized in that: described step I i), described training step comprises:
1) becomes the training sample set by manually providing all types of pages;
2) judge whether described training sample set is empty,, otherwise from described training sample set, select a page if sky then jumps to step 5);
3) judge the incident that whether did not trigger in addition on the current page, if then do not jump to step 2); Otherwise the triggering current event, the event information of record current event;
4) return back to the preceding page status of trigger event, and jump to step 3);
5) extract XPath route characteristic and the event type that all types of pages need trigger event according to described event information.
3, a kind of method for rapidly collecting dynamic script website data as claimed in claim 1, it is characterized in that: in the described step 3), the event information of described current page comprises whether the page elements of event type, trigger event, the page change, change the back page is for which kind of type.
4, a kind of method for rapidly collecting dynamic script website data as claimed in claim 2, it is characterized in that: in the described step 3), the event information of described current page comprises whether the page elements of event type, trigger event, the page change, change the back page is for which kind of type.
5, as claim 1 or 2 or 3 or 4 described a kind of method for rapidly collecting dynamic script website data, it is characterized in that: in the described step 5), the extracting method of described XPath route characteristic is as follows: must be identical by the page elements title of the XPath path process of reduction, the sequence number of page elements is carried out reduction.
CN2009100902885A 2009-08-04 2009-08-04 Method for rapidly collecting dynamic script website data Expired - Fee Related CN101625692B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100902885A CN101625692B (en) 2009-08-04 2009-08-04 Method for rapidly collecting dynamic script website data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100902885A CN101625692B (en) 2009-08-04 2009-08-04 Method for rapidly collecting dynamic script website data

Publications (2)

Publication Number Publication Date
CN101625692A true CN101625692A (en) 2010-01-13
CN101625692B CN101625692B (en) 2012-05-23

Family

ID=41521543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100902885A Expired - Fee Related CN101625692B (en) 2009-08-04 2009-08-04 Method for rapidly collecting dynamic script website data

Country Status (1)

Country Link
CN (1) CN101625692B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365919A (en) * 2012-04-09 2013-10-23 纽海信息技术(上海)有限公司 Webpage analysis container and method
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device
CN106371842A (en) * 2016-08-31 2017-02-01 北京字节跳动科技有限公司 Webpage conversion tracking method and system
CN108460008A (en) * 2018-03-20 2018-08-28 深圳中兴网信科技有限公司 Document generation method, system, computer equipment and readable storage medium storing program for executing

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365919A (en) * 2012-04-09 2013-10-23 纽海信息技术(上海)有限公司 Webpage analysis container and method
CN103365919B (en) * 2012-04-09 2018-07-31 北京京东尚科信息技术有限公司 Web analysis container and method
CN106126747A (en) * 2016-07-14 2016-11-16 北京邮电大学 Data capture method based on reptile and device
CN106371842A (en) * 2016-08-31 2017-02-01 北京字节跳动科技有限公司 Webpage conversion tracking method and system
CN106371842B (en) * 2016-08-31 2018-10-23 北京字节跳动科技有限公司 A kind of webpage conversion tracking and system
CN108460008A (en) * 2018-03-20 2018-08-28 深圳中兴网信科技有限公司 Document generation method, system, computer equipment and readable storage medium storing program for executing
CN108460008B (en) * 2018-03-20 2022-05-06 深圳中兴网信科技有限公司 Bill generation method, system, computer equipment and readable storage medium

Also Published As

Publication number Publication date
CN101625692B (en) 2012-05-23

Similar Documents

Publication Publication Date Title
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN102646129B (en) Topic-relative distributed web crawler system
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN100444591C (en) Method for acquiring front-page keyword and its application system
CN105243159A (en) Visual script editor-based distributed web crawler system
EP2414929A1 (en) Method and system of retrieving ajax web page content
CN102779169A (en) Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label
CN104572934B (en) A kind of webpage key content abstracting method based on DOM
CN103544176A (en) Method and device for generating page structure template corresponding to multiple pages
CN102760162A (en) Method and device for revealing and acquiring download link
CN104182482B (en) A kind of news list page determination methods and the method for screening news list page
CN101329687A (en) Method for positioning news web page
CN101727486A (en) Web forum information extraction system
CN103279567A (en) Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)
CN101625692B (en) Method for rapidly collecting dynamic script website data
CN104699841A (en) Method and device for providing list summary information of search results
CN108595468A (en) A kind of acquisition methods of web data, device, server, terminal and system
CN105095368A (en) Method and device for sequencing news information
CN101630315B (en) Quick retrieval method and system
CN105302876A (en) Regular expression based URL filtering method
CN103544283A (en) Website information combination and de-duplication method
CN102768663A (en) Method and device for extracting webpage title and information processing system
CN102609518B (en) Method and system for acquiring content of multistate AJAX (asynchronous javascript and extensible markup language) webpage
CN105183730B (en) The treating method and apparatus of webpage information
CN102819595A (en) Web page classification method, web page classification device and network equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120523

Termination date: 20140804

EXPY Termination of patent right or utility model