CN101625692A

CN101625692A - Method for rapidly collecting dynamic script website data

Info

Publication number: CN101625692A
Application number: CN200910090288A
Authority: CN
Inventors: 夏冰; 高军; 王腾蛟; 杨冬青
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2009-08-04
Filing date: 2009-08-04
Publication date: 2010-01-13
Anticipated expiration: 2029-08-04
Also published as: CN101625692B

Abstract

The invention relates to a method for rapidly collecting dynamic script website data, which is characterized by comprising the following steps: (i) obtaining and storing an Index page and adding the Index page into a queue to be processed; (ii) judging whether the queue to be processed is empty or not, if so, indicating capturing finish and quitting the capturing process, or else randomly selecting a page from the queue to be processed, obtaining the type of the current page by the similarity of pages and confirming an event on a page element, which needs to be triggered by the type according to the XPath characteristics extracted from the training step; (iii) judging whether the current page is provided with an untriggered event or not, if not, turning to the step (ii), or else triggering the event; judging whether the current page is changed or not and the changed page is a new page or not, if not, turning to the step (v), or else continuously executing the step (iv); (iv) storing the new page and adding the new page into the queue to be processed in the step (i); (v) returning to the page state before the event is triggered and turning to the step (iii).

Description

A kind of method for rapidly collecting dynamic script website data

Technical field

The present invention relates to a kind of network data acquisition method, particularly about a kind of method for rapidly collecting dynamic script website data.

Background technology

Along with the Web arrival in 2.0 epoch, the internet more and more server end and customer end adopted the mode of dynamic script carry out alternately.The content of webpage, from before static mode be converted to and from database, obtain Data Dynamic and generate.On the one hand, the webpage homepage downloads to also to be needed behind this locality to carry out for several times could obtaining whole data alternately with server end again, as the reading number of Sina's blog with comment on number be the page be written into after again to server transmission acquisition request; On the other hand, the link of web page contents much also no longer is traditional "＜a〉" label, and has been to use the mode of JavaScript, adopts JavaScript to control as the page turning mode of forum of Tengxun, forum of Netease etc.

Reptile is the first step of search engine, also is the basis that internet data is analyzed.The tradition reptile is treated the defective that there is inadequate natural endowment in existing website: the main flow search engine treats that dynamic script website is generally taked to avoid, the mode of hard coded and three kinds of attitudes that search engine interface is kept in some website.Have research institution to propose modelling customer behavior and click the technology that all page elements of the page grasp dynamic script website successively, but this mode speed is very slow, is not suitable for practical application.

Summary of the invention

At the problems referred to above, the purpose of this invention is to provide a kind of particular event that can only pass through to trigger on the specific webpage element, thereby significantly reduced the event number that needs triggering, improved the method for rapidly collecting dynamic script website data of the gathering speed of dynamic script website data.

For achieving the above object, the present invention takes following technical scheme: a kind of method for rapidly collecting dynamic script website data is characterized in that: it may further comprise the steps: i) obtain the Index page and preservation, the Index page is added pending formation; Judge ii) whether described pending formation is empty, finishes if sky then shows to grasp, and withdraws from the extracting process; Otherwise from described pending formation, select a page arbitrarily, utilize page similarity to obtain the type of current page,, determine described type need trigger which incident on which page elements according to the described XPath route characteristic that extracts in the training step; Judge iii) whether described current page has the incident that did not trigger, if then do not jump to step I i); Otherwise trigger event judges that the page after whether described current page changes and change is a new page, if do not have variation or non-new page then to jump to step v); Otherwise continue execution in step iv); Vi) preserve described new page and described new page joined step I) in pending formation; V) return back to the preceding page status of trigger event, and jump to step I ii).

Described step I i) in, described training step comprises: 1) become the training sample set by manually providing all types of pages; 2) judge whether described training sample set is empty,, otherwise from described training sample set, select a page if sky then jumps to step 5); 3) judge the incident that whether did not trigger in addition on the current page, if then do not jump to step 2); Otherwise the triggering current event, the event information of record current event; 4) return back to the preceding page status of trigger event, and jump to step 3); 5) extract XPath route characteristic and the event type that all types of pages need trigger event according to described event information.

In the described step 3), the event information of described current page comprises whether the page elements of event type, trigger event, the page change, change the back page is for which kind of type.

In the described step 5), the extracting method of described XPath route characteristic is as follows: must be identical by the page elements title of the XPath path process of reduction, the sequence number of page elements is carried out reduction.

The present invention is owing to take above technical scheme, it has the following advantages: 1, the present invention is owing to utilized the page to apply mechanically the similarity of template, therefore the incidents such as click on which page elements can the predicted page of what character that will lead, thus the page type that utilizes page similarity to obtain is determined the type of current page and need be triggered which incident on which page elements.2, the XPath of the page elements of the present invention by guiding being needed the page carries out reduction, extract its feature and event type, therefore only trigger the particular event on the specific webpage element, significantly reduced the number of times of analog subscriber operation, improved the gathering speed of dynamic script website data.The present invention can be applicable to the webpage extracting work of diverse network search engine.

Description of drawings

Fig. 1 is the training step schematic flow sheet before the extracting data of the present invention

Fig. 2 is the grasping means steps flow chart synoptic diagram of breadth-first of the present invention

Embodiment

Below in conjunction with drawings and Examples the present invention is described in detail.

Be directed to not handling for the treatment of in the prior art that dynamic script website adopts, hard coded mode, the implementation of the inventive method comprises two parts, and first is training, and second portion is for grasping.By the similarity training of the page, can know which kind of incident the page of each type should trigger on which page elements.After finishing, training promptly can grasp, extracting process of the present invention can adopt multiple extracting strategy, in the breadth-first grasping means in the present embodiment, after having triggered an incident at every turn, the capital return back to the former page, till all Event triggered that need trigger of the former page are intact, remove to handle other the page again.

As shown in Figure 1, training step of the present invention is as follows:

1) becomes the training sample set by manually providing all types of pages.

2) whether the training of judgement sample set is empty, if sky then jumps to step 5), otherwise selects a page from the training sample set.

3) judge the incident that whether did not trigger in addition on the current page, if then do not jump to step 2); Otherwise the triggering current event, whether event type, the page elements of trigger event, the page of record current event changes, changes the back page is event informations such as which kind of type.

4) return back to the preceding page status of trigger event, and jump to step 3).

5) event information according to record extracts page elements XML path language (XML Path Language, XPath) feature and the event type that all types of pages need trigger event.The extracting method of XPath route characteristic is as follows: must be identical by the page elements title of the XPath path process of reduction, the sequence number of page elements is carried out reduction.As to "/html/body/div[4]/li[1]/a[1] " and "/html/body/div[4]/li[2]/a[1] " these two XPath paths, the present invention with their reduction be "/html/body/div[4]/li[*]/a[1] "; As also exist "/html/body/div[3]/li[1]/a[1] " this XPath path, the present invention with above three XPath path reduction be "/html/body/div[*]/li[*]/a[1] ".But for "/html/body/div[1] ", in "/html/body/span[1] " and "/html/body/div[1]/span[1] " three XPath paths any two all not by reduction.

As shown in Figure 2, extracting step of the present invention is as follows:

I) obtain the Index page and preservation by the URL that the Index page is provided, the Index page is added pending formation.Pending formation is dynamic change among the present invention, and the page deletion with handling adds the new page.

Judge ii) whether pending formation is empty, finishes if sky then shows to grasp, and withdraws from the extracting process; Otherwise from pending formation, select a page arbitrarily, utilize page similarity to obtain the type of current page, according to the XPath route characteristic that aforesaid training part extracts, the type of determining current page need trigger which incident on which page elements.

Judge iii) whether current page has the incident that did not trigger, if then do not jump to step I i); Otherwise trigger event judges that the page after whether current page changes and change is a new page, if do not have variation or non-new page then to jump to step v); Otherwise continue execution in step iv).

Vi) preserve new page and new page joined step I) in pending formation.

V) return back to the preceding page status of trigger event, and jump to step I ii).

Review pages with Netease's news is embodiment below, and the use and the superiority of the inventive method is described.

The review pages of Netease's news after being written into again to service end request comment data, and " onclick " incident control page turning by "＜a〉" label.In the present embodiment, only consider " onclick " incident of "＜a〉" label in whole process, every page of review pages is the page that needs.

At first, the sample training of two review pages is provided, trigger " onclick " incident on all "＜a〉" labels successively, find to have only the page elements at the page number and " page up ", " following one page " place just triggering can be guided into required review pages.Sum up the feature of these page elements by training.

Secondly, when the process of extracting, present embodiment only triggers " onclick " incident on these page elements.By having selected the comment of 3 pieces of Netease's news at random, will adopt method of the present invention and not adopt to compare, grasp fully shown in the incident numerical table 1 that needs to trigger:

The incident that table 1 triggers

	The comment number of pages	Do not adopt the inventive method to need the trigger event number	Adopt the inventive method to need the trigger event number
	The comment number of pages		Adopt the inventive method to need the trigger event number	News 1	??7	??1114	??78
News 2	??8	??1698	??90	News 1	??7	??1114	??78
News 2	??8	??1698	??90	News 3	??11	??1609	??156

Only considered the situation of " onclick " incident of "＜a〉" label in the present embodiment, if consider other labels and other incidents, method then of the present invention is more obvious than the superiority of additive method.

Claims

1, a kind of method for rapidly collecting dynamic script website data, it is characterized in that: it may further comprise the steps:

I) obtain the Index page and preservation, the Index page is added pending formation;

Judge ii) whether described pending formation is empty, finishes if sky then shows to grasp, and withdraws from the extracting process; Otherwise from described pending formation, select a page arbitrarily, utilize page similarity to obtain the type of current page,, determine described type need trigger which incident on which page elements according to the described XPath route characteristic that extracts in the training step;

Judge iii) whether described current page has the incident that did not trigger, if then do not jump to step I i); Otherwise trigger event judges that the page after whether described current page changes and change is a new page, if do not have variation or non-new page then to jump to step v); Otherwise continue execution in step iv);

Vi) preserve described new page and described new page joined step I) in pending formation;

2, a kind of method for rapidly collecting dynamic script website data as claimed in claim 1 is characterized in that: described step I i), described training step comprises:

1) becomes the training sample set by manually providing all types of pages;

2) judge whether described training sample set is empty,, otherwise from described training sample set, select a page if sky then jumps to step 5);

3) judge the incident that whether did not trigger in addition on the current page, if then do not jump to step 2); Otherwise the triggering current event, the event information of record current event;

4) return back to the preceding page status of trigger event, and jump to step 3);

5) extract XPath route characteristic and the event type that all types of pages need trigger event according to described event information.

3, a kind of method for rapidly collecting dynamic script website data as claimed in claim 1, it is characterized in that: in the described step 3), the event information of described current page comprises whether the page elements of event type, trigger event, the page change, change the back page is for which kind of type.

4, a kind of method for rapidly collecting dynamic script website data as claimed in claim 2, it is characterized in that: in the described step 3), the event information of described current page comprises whether the page elements of event type, trigger event, the page change, change the back page is for which kind of type.

5, as claim 1 or 2 or 3 or 4 described a kind of method for rapidly collecting dynamic script website data, it is characterized in that: in the described step 5), the extracting method of described XPath route characteristic is as follows: must be identical by the page elements title of the XPath path process of reduction, the sequence number of page elements is carried out reduction.