CN101452463A

CN101452463A - Method and apparatus for directionally grabbing page resource

Info

Publication number: CN101452463A
Application number: CNA200710196590XA
Authority: CN
Inventors: 郑小林; 陈德人; 周涛; 叶勤勇
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2007-12-05
Filing date: 2007-12-05
Publication date: 2009-06-10

Abstract

The invention discloses a method for directionally snatching a webpage resource. The method comprises the followings steps: snatching webpage according with an amount threshold value in advance according to seed site URL; determining characteristic webpage in the pre-snatched webpage; generating a regular expression generalizing the characteristic webpage URL; matching the seed site URL with the regular expression, and maintaining the seed site URL meeting the matching condition as a snatching target URL; and snatching webpage according to the snatching target URL. The method can effectively improve the utilization percent and call back ratio of webpage resource snatching, thereby better helping people acquire required information with large scale, high efficiency and high accuracy on the internet.

Description

The method and apparatus of directed grabbing page resource

Technical field

The present invention relates to Internet resources and collect the field, particularly relate to a kind of method of directed grabbing page resource and a kind of device of directed grabbing page resource.

Background technology

Along with developing rapidly of network, WWW becomes the carrier of bulk information, and for extracting and utilize these information effectively, search engine (Search Engine) becomes the inlet and the guide of user capture WWW as the instrument of auxiliary people's retrieving information.The function that search engine extracts webpage on the WWW automatically realizes by web crawlers.

Present web crawlers can be divided into general reptile and focused crawler.General reptile is based on the thought of BFS (Breadth First Search), URL (Uniform ResourceLocator from one or several Initial pages, URL(uniform resource locator)) beginning, obtain the URL on the Initial page, in the process that grasps webpage, constantly extract new URL and put into formation, up to the certain stop condition that satisfies system from current page.And focused crawler is a program of downloading webpage automatically, is used for the directed related pages resource that grasps.It visits webpage and relevant linking on the WWW selectively according to set extracting target, obtains needed information.Different with general reptile, focused crawler is not pursued big covering, but target is decided to be extracting and the relevant webpage of a certain particular topic content, be that the user inquiring of subject-oriented is prepared data resource, thereby focused crawler is the main flow direction of present web crawlers technical development.

Usually, the workflow of focused crawler is according to certain web page analysis algorithm filtration and irrelevant the linking of theme, to remain with the link of usefulness and put it into and wait for the URL formation of grasping; Then, will from formation, select next step webpage URL that will grasp, and repeat said process, when reaching a certain condition of system, stop according to certain search strategy.In addition, all will be stored by system by the webpage of crawler capturing, carry out certain analysis, filtration, and set up index, so that inquiry afterwards and retrieval; For focused crawler, the resulting analysis result of this process also may provide feedback to later extracting process and instruct.Obviously, how to instruct focused crawler to concentrate and grasp the page relevant, and avoid having access to the relevant page of non-theme as far as possible, become the key issue that the focused crawler Technology Need solves with theme.

To this, a solution of prior art is, uses for reference the link analysis technology of universal search engine, and the importance of URL is sorted, and preferentially grasps the high page of those importance.In its paper (Junghoo 1998), utilize PageRank (Sergey1998) the value ordering URL of the page such as Junghoo Cho, preferentially grasp the high page of PageRank value.But this method is only calculated the importance values of waiting to grab the page, do not consider the page and particular topic correlativity, therefore the focused crawler with this algorithm directs is easy to get lost, the page under grabbing is seldom relevant with particular topic, if the page shared proportion relevant with theme is yield rate in all webpages that the statistics focused crawler grasps, the yield rate of so this solution is lower.

Another kind of solution of the prior art is, theme adjacency (topical locality) phenomenon of the internet page of finding based on Davison (Davison 2000), promptly two pages ratios, two pages at random that couple together with hyperlink have bigger similarity, predict the correlativity of waiting to grasp the page and particular topic, if promptly a page is relevant with particular topic, the possibility that the page that it connected away so is relevant with this theme is bigger, so more may find the relevant page of other themes along linking of going out of the page relevant with particular topic.Yet, the degree of correlation that this degree of correlation of only utilizing parent page and theme is come the predictor page and theme, and with this focused crawler as guidance, can miss many pages relevant inevitably with this theme, if the ratio of all theme related pages is the rate of recalling on theme related pages that the statistics focused crawler grasps and the internet, the rate of recalling of so this solution is lower.

Therefore, the yield rate that how to improve the web crawlers grabbing page resource with recall rate, in order to help people from the internet on a large scale, high-level efficiency, obtain required information accurately, become those skilled in the art and press for one of technical matters of solution.

Summary of the invention

Technical matters to be solved by this invention provides a kind of method of directed grabbing page resource, in order to improving yield rate that page resource grasps and to recall rate, thereby help people from the internet on a large scale, high-level efficiency, obtain required information accurately.

Another object of the present invention has provided a kind of device of directed grabbing page resource, in order to guarantee said method realization and application in practice.

For solving the problems of the technologies described above, the embodiment of the invention provides a kind of method of directed grabbing page resource to comprise:

URL grasps the page that meets amount threshold in advance according to the seed website;

In the described pre-extracting page, determine the feature page;

Generate the regular expression of summarizing described feature page URL;

Described seed website URL and described regular expression are mated, and the seed website URL of Satisfying Matching Conditions is left extracting target URL;

Grasp the page according to described extracting target URL.

Preferably, the step of described generation regular expression comprises:

Described feature page URL is divided into a plurality of URL subclass;

Described URL subclass is aggregated into a plurality of URL classes;

Extract the regular expression of described URL class.

Preferably, described URL comprises site parameter and path parameter, and the substep of a plurality of URL subclass of described division comprises:

The URL that will have the same site parameter is divided to same URL subclass;

The URL of the path parameter that catalogue quantity is identical is divided to same URL subclass.

Preferably, described URL also comprises query argument, and the substep of a plurality of URL subclass of described division also comprises:

The URL that will have the same queries parameter is divided to same URL subclass.

Preferably, the substep of a plurality of URL classes of described polymerization comprises:

Preset the clustering rule of described URL class;

From described URL subclass, read URL, and judge whether described URL meets the clustering rule of described URL class, if then described URL is dispensed in the described URL class; If not, then according to the newly-built URL class of described URL.

Preferably, the substep of a plurality of URL classes of described polymerization also comprises:

Add up the number of described URL class and the sum of URL;

Adjust the clustering rule of described URL class according to described statistics.

Preferably, before grasping page step in advance, also comprise:

Filter seed website URL, described seed website URL comprises the URL of seed website and the URL of link thereof.

Preferably, the step of described filter seed website URL comprises:

Read in described seed website URL to array, described array is sorted;

Extract the site parameter of certain URL, judge whether comprise described site parameter in the described array, if then keep described URL; If not, then remove described URL.

Preferably, the span of described amount threshold is 1000 to 5000.

Preferably, the described step that grasps the page in advance comprises:

Described seed website URL is write database;

From described database, read URL, and extract the site parameter of described URL;

The URL that upgrades described site parameter correspondence grasps quantity;

Judge that described URL grasps quantity and whether surpasses described amount threshold, if not, then described URL is added into URL and grasps in the tabulation;

Download the URL corresponding page in the described URL extracting tabulation, and generate corresponding page data section;

Upgrade described database according to described page data section.

Preferably, the described feature page comprises the theme related pages, and described theme related pages comprises the theme associated content pages and the theme associative directory page.

Preferably, described coupling step comprises:

If the regular expression of described seed website URL and described theme related pages coupling, then described seed website URL Satisfying Matching Conditions.

Preferably, the described feature page also comprises the theme page that has nothing to do.

Preferably, described coupling step also comprises:

If the regular expression coupling of the irrelevant page of described seed website URL and described theme, then described seed website URL is Satisfying Matching Conditions not.

Preferably, described method also comprises:

Add up the URL quantity of described regular expression coupling;

If described URL quantity is then deleted described regular expression less than the default threshold value of filtering.

The embodiment of the invention also discloses a kind of device of directed grabbing page resource, comprising:

Experiment reptile module is used for grasping the page that meets amount threshold in advance according to described seed website URL;

Sort module is used for determining the feature page at the described pre-extracting page;

The regular expression study module is used to generate the regular expression of summarizing described feature page URL;

Matching module is used for described seed website URL and described regular expression are mated, and the seed website URL of Satisfying Matching Conditions is left extracting target URL;

The focused crawler module is used for grasping the page according to described extracting target URL.

Preferably, described regular expression study module comprises:

Submodule is divided in set, is used for described feature page URL is divided into a plurality of URL subclass;

The cluster submodule is used for described URL subclass is aggregated into a plurality of URL classes;

Extract submodule, be used to extract the regular expression of described URL class.

Preferably, described URL comprises site parameter and path parameter, and described set is divided submodule and comprised:

First division unit, the URL that is used for having the same site parameter is divided to same URL subclass;

Second division unit, the URL that is used for the path parameter that catalogue quantity is identical is divided to same URL subclass.

Preferably, described URL also comprises query argument, and described set is divided submodule and also comprised:

The 3rd division unit, the URL that is used for having the same queries parameter is divided to same URL subclass.

Preferably, described cluster submodule comprises:

Rule is provided with the unit, is used to preset the clustering rule of described URL class;

Processing unit is used for reading URL from described URL subclass, and judges whether described URL meets the clustering rule of described URL class, if then described URL is dispensed in the described URL class; If not, then according to the newly-built URL class of described URL.

Preferably, described cluster submodule also comprises:

Statistic unit is used to add up the number of described URL class and the sum of URL;

The rule adjustment unit is used for the clustering rule that the described statistics of foundation is adjusted described URL class.

Preferably, described device also comprises:

The seed filtering module is used for filter seed website URL, and described seed website URL comprises the URL of seed website and the URL of link thereof.

Preferably, described seed filtering module comprises:

Array generates submodule, is used for reading in described seed website URL to array, and described array is sorted;

Website filters submodule, is used to extract the site parameter of certain URL, judges whether comprise described site parameter in the described array, if then keep described URL; If not, then remove described URL.

Preferably, described device also comprises:

The quantity statistics module is used to add up the URL quantity that described regular expression mates;

Regular expression screening module is used for when filtering threshold value, deleting described regular expression less than default in described URL quantity.

Compared with prior art, the present invention has the following advantages:

At first, the present invention is by learning out the URL regular expression of the irrelevant page of representative theme associated content pages, the theme associative directory page and theme automatically from the feature page, utilize these URL regular expressions to instruct the extracting of focused crawler then, thereby effectively improved yield rate that page resource grasps and recalled rate, with help better people from the internet on a large scale, high-level efficiency, obtain required information accurately;

Secondly, the present invention filters for the link URL of seed website by obtaining the seed website targetedly, thereby reduces the workload that the page grasps, and has effectively improved efficient and page extracting accuracy rate that the page grasps;

Moreover, the present invention utilizes the experiment reptile to grasp the basis of the page of some threshold values as regular expression study in advance, control the extracting degree of depth of the page, not only improved the extracting efficient of experiment reptile, also effectively guaranteed the representativeness of regular expression study;

In addition, the present invention utilizes regular expression to carry out text matches, and matching efficiency is higher, and regular expression is by URL being divided set, and cluster then further extracts and obtains, regular expression has suitable representativeness, thereby can further guarantee to grasp the accuracy of target;

At last, the present invention can also screen regular expression according to URL quantity statistics result, has further improved the representativeness of regular expression, thereby has improved accuracy and validity that the page grasps.

Description of drawings

Fig. 1 is the flow chart of steps of a kind of directed grabbing page resource method embodiment of the present invention;

Fig. 2 is the trend comparison figure that grasps the theme related web page in the performance test process;

Fig. 3 is the trend comparison figure of yield rate in the performance test process;

Fig. 4 is the structured flowchart of a kind of directed grabbing page resource device embodiment of the present invention;

Fig. 5 uses embodiment shown in Figure 4 to carry out the directed flow chart of steps that grasps of page resource.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

The present invention can be used in numerous general or special purpose calculation element environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multiprocessor device, comprise distributed computing environment of above any device or equipment or the like.

The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.

One of core idea of the embodiment of the invention is, draw by the page resource of analyzing the internet, Web page major part on the internet is a dynamic page, and these dynamic pages inquire about corresponding data recording and fill Page Template and form from database, and query argument generally is included among the URL.With the page URL that same template produces, often have only the inquiry field different, and the page theme that produces with same template often also is similar.In brief, promptly the page that produces with same Page Template belongs to same classification often, and its URL is very similar.

Owing on same website, belong to the similarity that the page URL of same theme has, thereby can summarize with (perhaps a several) regular expression.Such as for a books e-commerce website www.china-pub.com, can computing machine book information wherein be used regular expression http://www.china-pub.com/computers/common/info.asp? id=* summarizes, various book indexes of asterisk (metacharacter) representative number.This regular expression has been arranged, and the page that just can judge certain the URL representative in this website is a relevant calculation machine book information, and promptly whether by URL is mated with regular expression, can to draw this URL relevant with theme.

For focused crawler, it the relevant page of interested certain theme often in same website, have one or several such templates to generate.Therefore the present invention is based on two reptile strategy frameworks of experiment reptile and focused crawler, a kind of focused crawler based on the URL rule is proposed, this focused crawler can be learnt out the URL regular expression of the irrelevant page of representative theme related pages, the theme associative directory page and theme from each theme page associated stations, utilize these URL regular expressions to instruct the extracting of focused crawler then.

With reference to figure 1, show the process flow diagram of a kind of directed grabbing page resource method embodiment of the present invention, can may further comprise the steps:

Step 101, acquisition seed website URL;

Acquisition to described seed website URL in the present embodiment does not limit, for example, from the seed resource storehouse of presetting, call, or according to certain keyword lookup acquisition etc., it all is feasible that those skilled in the art adopt arbitrary method to obtain its required seed website URL, for example, obtaining the relevant seed site list in hardware industry field is:

cnsaw.com	www.fsonline.com.cn
cnsaw.com	www.fsonline.com.cn	www.beareyes.com.cn	www.gx.xinhuanet.com
www.beicha.com	www.hnwj.net	www.beareyes.com.cn	www.gx.xinhuanet.com
www.beicha.com	www.hnwj.net	www.bxg.cn	www.ieicn.com
www.ce.cn	www.ldmetals.com	www.bxg.cn	www.ieicn.com
www.ce.cn	www.ldmetals.com	www.cenn.cn	www.sealing.cn
www.chemeinfo.com	www.sg001.cn	www.cenn.cn	www.sealing.cn
www.chemeinfo.com	www.sg001.cn	www.chinabmb.com	www.wjjw.cn
www.chinabtob.net	www.wjw.cn	www.chinabmb.com	www.wjjw.cn
www.chinabtob.net	www.wjw.cn	www.cnpv.com	www.xmnn.cn
www.cutinfo.cn		www.cnpv.com	www.xmnn.cn

In practice, a website may comprise a large amount of hyperlink pages, such as Netease's website (www.163.com), many sensing physical culture are wherein just arranged, military affairs, the hyperlink of contents such as finance and economics, thereby, preferably, described seed website URL not only can comprise the URL of seed website, can also comprise the URL of all hyperlink in this seed website representative page.Yet, possible is, the website URL of some hyperlink URL has not been the URL of seed website, in above-mentioned Netease website, the hyperlink that has is other website by mode hyperlink such as friendly links, as Sohu, Sina etc., promptly non-seed website, in this case, this step can also comprise following substep:

Filter described seed website URL.

Be understandable that described filtration substep for example, when the page in Netease's website is grasped, can not grab the page of websites such as Sohu, Sina by the hyperlink in the friendly link in order to guarantee that reptile only carries out the degree of depth to the seed website and grasps.

Can learn by the data structure of analyzing URL, a URL can comprise protocol parameter (as the http agreement), site parameter (host), path parameter (path) and query argument (query), wherein path parameter can comprise a series of catalogues (directory), and query argument can comprise that a series of key assignments (Key-Value) are right.For example, for URL:http: //www.china-pub.com/member/buybook/view.asp? add=1﹠amp; Tid=203839, its site parameter are www.china-pub.com; Path parameter is/member/buybook/view.asp; The catalogue of this path parameter is member, buybook, view.asp; Query argument is add=1﹠amp; Tid=203839, promptly the key-value pair of this query argument is (add, 1) and (tid, 203839).The fundamental purpose that is appreciated that filtration step is, by extracting the site parameter of URL, judges that whether it is the seed website, if then keep this URL, if not, then removes this URL.Obtain the seed website by filtering, can reduce the workload that subsequent page grasps, grasp efficient and grasp accuracy rate thereby effectively improve.

In practice, can filter described seed website URL by following substep:

Substep A1, read in described seed website URL to array, described array is sorted;

Substep A2, extract the site parameter of certain URL, judge whether comprise described site parameter in the described array, if then carry out substep A3; If not, then carry out substep A4;

Substep A3, keep described URL;

Substep A4, remove described URL.

In practice, can adopt binary search in array, to search for, to judge whether comprise described site parameter in the described array.Be well known that the basic thought of binary search is, n element be divided into the roughly the same two halves of number, get a[n/2] make comparisons with the x that desires to search, if x=a[n/2] then find x, algorithm stops.Suppose that array element is ascending order and arranges, if x＜a[n/2 so], then as long as continue search x in the left side of array a; If x〉a[n/2], then as long as search for x in the right-hand part continuation of array a.

Certainly, above-mentioned filter method only is used for for example, and those skilled in the art adopt any filter method, for example, the website weighted value are set, and filters by described weighted value; Perhaps, extract keyword from URL, it all is feasible filtering etc. according to described keyword, and the present invention does not need this to limit.

Step 102, the described seed website URL of foundation grasp the page that meets amount threshold in advance;

This step is a kind of experiment reptile strategy, and this experiment reptile can grasp the degree of depth according to the default page and grasp the page, and described amount threshold is exactly that the page of presetting grasps depth value.Because along with the increase of grasping the degree of depth, the page to be crawled is exponentially level increase thereupon also, supposes that the degree of depth is d, every layer is grasped N the page, so to the d layer, will grasp Nd the page, in view of the situation, if every layer N is too big, the efficient of page extracting will be low excessively so; And if every layer N is too little, the distribution range that grasps the page so can be smaller, and the range deficiency promptly is not enough to produce representational regular expression.

Grasp efficient for improving, and the representativeness of assurance regular expression, the present invention is by a large amount of experiments, constantly adjust and obtain an amount threshold, the span of described amount threshold is 1000 to 5000, in practice, described amount threshold can (as data/config.xml) be provided with in configuration file, reads this amount threshold in use and get final product from configuration file.

Particularly, this step can realize based on Nutch reptile principle of the prior art.NutchCrawler (reptile) is mainly used in from network extracting webpage and also sets up index for these webpages.The emphasis of Crawler aspect two, the form and the implication of the workflow of Crawler and the data file that relates to.Data file mainly comprises three classes, be respectively WebDB database (Webdatabase, the Web database), a series of segment (page data section) adds index (index information), what WebDB stored is that reptile grasps the link structure information between the webpage, and its only uses in reptile Crawler work.The WebDB stored information of two kinds of entities: page (page) and link (link).The Page entity characterizes the webpage of a reality by the characteristic information of describing a webpage on the network, because webpage has a lot of to need to describe, by the URL of webpage and two kinds of indexing means of MD5 of web page contents these page entity has been carried out index among the WebDB.The web page characteristics of Page entity description mainly comprises the link number in the webpage, grasps the relevant extracting information such as time of this webpage, to the importance degree scoring of this webpage etc.Same, the Link entity description be two linking relationships between the page entity.WebDB has constituted the link structure figure of the webpage that grasps, and the Page entity is the node of figure among this figure, and the Link entity is the limit of representative graph then.

Once creeping of Crawler can produce a lot of segment, each segment stored to be Crawler once grasping the webpage caught in the circulation and the index of these webpages separately.When creeping, Crawler can generate each required fetchlist of circulation (grasping tabulation) that grasps according to certain strategy of creeping according to the relation of the link among the WebDB, Fetcher (download thread) can grasp these webpages and index by the URL among the fetchlist then, deposits it in segment then.Segment has the time limit, and after these webpages were grasped again by Crawler, the previous segment that produces that grasps had just cancelled.The Segment file is with generation time name, and the segments that cancels with convenient deletion is to save storage space.Index is the index of all webpages of Crawler extracting, and it is to handle gained by the index among all single segment is merged.

The principle of work of Crawler is: at first Crawler generates one according to WebDB and waits that the URL set of grasping webpage is called Fetchlist, then download thread Fetcher begins according to Fetchlist webpage to be grasped, if download thread has a lot of, so just generate a lot of Fetchlist, just the corresponding Fetchlist of Fetcher.Crawler upgrades according to grasping the webpage WebDB that returns then, generates new Fetchlist according to the WebDB after upgrading, and the inside is URL that do not grasp or newfound, and next round grasps to circulate and restarts then.This cyclic process can be called " generation/extracting/renewal " circulation.In addition, the URL that points to Web resource on the same main frame is assigned among the same Fetchlist usually, like this can prevent from that too much Fetchers from grasping simultaneously to a main frame to cause burden of main machine overweight.

In Nutch, the realization of Crawler operation is to finish by the realization of a series of child-operations.These child-operations comprise:

1, creates a new WebDB;

2, will grasp initial URL writes among the WebDB;

3, generate fetchlist and write corresponding segment according to WebDB;

4, grasp webpage according to the URL among the fetchlist;

5, according to grasping page refreshment WebDB;

6, circulation is carried out the 3-5 step until the predefined extracting degree of depth.

7, webpage scoring and the links that obtains according to WebDB upgrades segments;

8, the webpage that is grasped is carried out index;

9, in index, abandon the webpage of duplicate contents and the URL of repetition;

10, the index among the segments is merged the final index (merge) that generation is used to retrieve.

Crawler detailed operation flow process is: (step 1) after creating a WebDB; " generation/extracting/renewal " circulation (step 3-6) begins to start according to some seed URL.When thoroughly end of this circulation, Crawler creates index (step 7-10) according to the segments that generates in grasping.Remove that (before the step 9), the index of each segment all is (a step 8) independently carrying out repetition URL.Finally, each independently the segment index be merged into a final index index (step 10).

According to the principle of work of above-mentioned Nutch reptile, in the present embodiment, can make amendment to its generation of grasping tabulation, particularly, mainly can grasp the page in advance by following substep:

Substep B1, described seed website URL is write database (WebDB);

Substep B2, from described database, read URL, and extract the site parameter (host) of described URL;

The URL of substep B3, the described site parameter correspondence of renewal grasps quantity;

Substep B4, judge that described URL grasps quantity and whether surpasses described amount threshold, if not, then described URL is added into URL and grasps in the tabulation;

Substep B5, the described URL of download grasp the URL corresponding page in the tabulation, and generate corresponding page data section (Segment);

Substep B6, the described page data section of foundation are upgraded described database.

As the another kind of implementation of present embodiment, the step of described filter seed website URL also can be finished in this step, for example, grasps the page in advance by following substep:

Substep C1, described seed website URL is write database (WebDB);

Substep C2, from described database, read URL, and extract the site parameter (host) of described URL;

Substep C3, judge whether described site parameter mates described seed website, if then directly carry out substep C4; If not, then remove described URL;

The URL of substep C4, the described site parameter correspondence of renewal grasps quantity;

Substep C5, judge that described URL grasps quantity and whether surpasses described amount threshold, if not, then described URL is added into URL and grasps in the tabulation;

Substep C6, the described URL of download grasp the URL corresponding page in the tabulation, and generate corresponding page data section (Segment);

Substep C7, the described page data section of foundation are upgraded described database.

Certainly, the above-mentioned method that grasps the page in advance only is used for for example, and it all is feasible that those skilled in the art adopt any web crawlers that the page is grasped in advance according to actual conditions, and the present invention does not need this to limit.

Step 103, in the described pre-extracting page, determine the feature page;

In the present embodiment, the described feature page can comprise the theme related pages, and described theme related pages may further include the theme associated content pages and the theme associative directory page.Wherein, the described theme associative directory page is the link page that comprises link theme associated content pages.

Be that the accuracy further provide the page to grasp, the described feature page can also comprise the theme page that has nothing to do.

In the prior art, proposed many methods about page classifications, for example, a kind of page classifications method may further comprise the steps:

11) preset the sample storehouse, obtain the sample characteristics parameter at each sample calculation;

12) collect the network text that meets prerequisite in the internet, calculate network text corresponding text characteristic parameter;

13) characteristic parameter of each sample in contrast text feature parameter and the sample storehouse is finished the classification of network text successively.Generally speaking, a network text can be included into the highest sample class of its similarity in.

Perhaps, can adopt LingPipe (Java of a natural language processing of Alias-i exploitation increase income kit) to carry out page classifications, be categorized as example, can may further comprise the steps with news pages:

21) from the legacy data storehouse, read out newsletter archive content and classification thereof, comprise each document storage of news content at different files according to the classification handle then, leave under 1 this file such as the sports news content, entertainment news leaves under 2 these files, and these data are referred to as training data;

22) based on the LingPipe bag of increasing income, similarity threshold at first is set (can be obtained from configuration file data/config.xml, the big more similar ambiguity of threshold value is just high more, promptly easy more erroneous judgement), extract the content of text of the page then, mate, if matching feature with training data, then obtain the training data the most close, obtain the classification of the affiliated content type of this training data then as this page with this content of text;

23) extract keyword, and compose as required and give the certain weight of keyword, confirm further whether this page should be distributed in the described classification.

Obviously, can obtain the URL of theme related pages comparatively accurately by above step, the study that is regular expression provides a good resource base.

The regular expression of described feature page URL is summarized in step 104, generation;

Be well known that regular expression is the instrument that is used to carry out text matches, form by some common characters and some metacharacters (metacharacters) usually.Common character comprises the letter and number of capital and small letter, and metacharacter then has special implication.The coupling of regular expression can be understood as, and in given character string, seeks the part that is complementary with given regular expression.Might have a more than part to satisfy given regular expression in the character string, at this moment each such part be called as a coupling.Coupling can comprise three kinds of implications in this paper: a kind of is to describe part of speech, such as expression formula of a string matching; A kind of is verb, such as mates regular expression in character string; It is nominal also having a kind of, is exactly " the satisfying the part of given regular expression in the character string " that has just mentioned.

Below by way of example the create-rule of regular expression is described.

Suppose to search hi, then can use regular expression hi.This regular expression can accurately mate such character string: be made up of two characters, previous character is h, and back one is i.In practice, regular expression can ignorecase.If all comprise these two continuous characters of hi in a lot of words, such as him, history, high or the like.Search with hi, the hi of this this word the inside also can be found out.If accurately search this word of hi, then should use bhi b.Wherein, b be a metacharacter of regular expression, it is representing the beginning or the ending of word, the just boundary of word.Though usually English word is separated by space or punctuation mark or line feed, any one in b and these word separators that do not match, it only mates a position.If that look for is hi back and then Lucy nearby, then should with bhi b.* bLucy b.Wherein, be another metacharacter, any character of coupling except newline.* be metacharacter equally, what its was represented is quantity---it is inferior arbitrarily so that whole expression formula obtains coupling promptly to specify * content in front to repeat continuously.Now bhi b.* bLucy b the meaning just clearly: word hi before this is an any character (but can not be line feed) arbitrarily then, is this word of Lucy at last.

Preferably, this step can comprise following substep:

Substep D1, described feature page URL is divided into a plurality of URL subclass;

Substep D2, described URL subclass is aggregated into a plurality of URL classes;

The regular expression of substep D3, the described URL class of extraction.

Generally speaking, the substep of a plurality of URL subclass of described division can be divided to same URL subclass realization by the URL that will have the same site parameter.

More specifically, the substep of a plurality of URL subclass of described division can be realized by following steps:

The URL that will have the same site parameter is divided to same URL subclass;

In some cases, can realizing of a plurality of URL subclass of described division by following steps:

The URL that will have the same site parameter is divided to same URL subclass;

The URL of the path parameter that catalogue quantity is identical is divided to same URL subclass;

As can be seen, the URL division is according to certain standard the URL set to be divided into several subclass.Its input is a URL set, and output is several URL subclass.At first based on the site parameter relation partly of URL, identical URL is classified as same subclass site parameter, and whole like this URL set just is divided into several subclass, and the site parameter of URL is identical in each subclass.Can also further divide after dividing like this, for example divide according to the path parameter catalogue quantity partly of URL again, quantity is identical is classified as same subclass, or further partly divide according to the query argument of URL, query argument is identical is classified as same subclass, but in practice, part is inequality if URL has only query argument, then the page of these two URL indications substantially all is that same template produces, and generally also belongs to the same class page.So be often used for the comparison opposite extreme situations based on query argument part dividing subset.It mainly is convenience for next step cluster that URL set is divided, to save the processing time of cluster.

In this case, the substep of a plurality of URL classes of described polymerization can comprise:

Preset the clustering rule of described URL class;

Like this through in the URL Class Queue, just obtaining some URL classes after the polymerization, one class URL can comprise several similar URL, promptly the URL in URL class has certain similarity, thereby preferably, described clustering rule can realize that all URL satisfy this similar function each other in the promptly same class by similar function is set.

For obtaining the better cluster result, the substep of a plurality of URL classes of described polymerization can also comprise step:

Add up the number of described URL class and the sum of URL;

Then, each URL class is extracted the URL regular expression that can summarize and represent this class.Particularly, according to the data structure of foregoing URL, promptly URL is decomposed into host (site parameter), path (path parameter) and query (query argument) three parts, and path resolved into a series of directory, query is resolved into a series of key-value pairs.Because host partly is identical certainly, so directly the host note is got final product in regular expression, then each directory alignment of path part, if the directory on the relevant position is identical, then this part value is added in the regular expression, otherwise added in the regular expression with *; Add in the regular expression for query part also employing and the similar method of path part, thereby generate corresponding regular expression, described regular expression is generally a plurality of.

Step 105, described seed website URL and described regular expression are mated, and the seed website URL of Satisfying Matching Conditions is left grasps target URL;

Under the described feature page was the theme the situation of related pages, regular expression also was that the URL that summarizes the theme related pages generates, and in this case, described coupling step can comprise:

Under the described feature page was the theme the situation of the irrelevant page, regular expression also was that the URL that summarizes the irrelevant page of theme generates, and in this case, described coupling step may further include:

All mate by regular expression that seed website URL is relevant with theme and the irrelevant regular expression of theme, can determine whether to grasp the page of this representative better, thereby further improve validity and accuracy that the page grasps.

Certainly, it all is feasible that those skilled in the art are provided with corresponding matching condition according to actual conditions, and the present invention does not need this to limit.

Owing to may divide the theme related pages of minority into the theme irrelevant page in the reality, perhaps the irrelevant page of the theme of minority is divided into the theme related pages, the URL regular expression that the representative theme of learning out from a website is relevant, just may exist some in fact to represent the URL regular expression of the irrelevant page of theme like this.For filtering these regular expressions, present embodiment can also may further comprise the steps:

Add up the URL quantity of described regular expression coupling;

Usually, the URL quantity of these URL regular expression couplings is smaller, falls so can be used as noise filtering to the few especially URL regular expression of coupling URL quantity.

Particularly, described URL quantity can be added up based on following coupling relative strategy:

If there is the regular expression of many theme related urls that match in 1 URL, that then that the URL quantity of URL and coupling is maximum regular expression couplings;

If there is many themes that the match URL regular expression that has nothing to do in 2 URL, that then that the URL quantity of URL and coupling is maximum regular expressions couplings.

In this case, if the described feature page comprises the theme related pages and the irrelevant page of theme, described coupling step can comprise substep:

Substep S1, with described seed website URL respectively with the regular expression of described theme related pages, and the regular expression of the irrelevant page of described theme mates;

Substep S2, add up the pairing URL quantity of theme related pages regular expression that described seed website URL is mated respectively, and the pairing URL quantity of regular expression of the irrelevant page of theme, and compare;

Substep S3, if comparing result meet the noise filtering threshold value, then described seed website URL is Satisfying Matching Conditions not.

Step 106, the described extracting target URL of foundation grasp the page.

The page described in this step grasps and can realize with reference to aforementioned Nutch reptile, also can adopt other method of the prior art to realize that the present invention does not limit this.

In practice, (the study regular expression generality of coming out is too high also may " overlearning ", the URL of other classifications has been mated) or " owe study " (regular expression that comes out of study just mates the part of a classification, cause other URL of this classification not mate) problem, in order to address this problem, strategy below can also when the page grasps, carrying out:

1, whether the site parameter of judging new URL is the website of deriving, if otherwise do not grasp the page of this URL, process ends;

2, from URL regular expression tabulation, seek the URL regular expression of a theme related pages that mates with current URL, if otherwise do not grasp the page of this URL, process ends;

If 3 exist the URL regular expression of many theme related pages that match, then select the regular expression of that optimum, optimum is meant that the URL quantity of its conclusion is maximum;

4, from the tabulation of URL regular expression, seek the URL regular expression of the irrelevant page of a theme that matches, if there is no then grasp the page of this URL, process ends;

If 5 exist the URL regular expression of the irrelevant page of many themes that match, then select the regular expression of that optimum;

6, if P/N〉f, then grasp this page, otherwise do not grasp.

Wherein, the URL quantity that on behalf of the URL regular expression of optimum theme related pages, P conclude, the URL quantity that on behalf of the URL regular expression of the irrelevant page of optimum theme, N conclude, f be one greater than 0 filtration threshold value.

For making those skilled in the art understand the present invention better, below the embodiment of the invention is specified by an object lesson.

(1) obtains seed website URL, the hyperlink URL in the seed website URL representative page is filtered, obtain " the website url list of deriving " relevant with theme.

Detailed process is: at first read in URL to array of seed website from a seed stations dot file, then this array is sorted; For URL to be grasped, at first take out its site parameter, utilize this site parameter binary search in array, to search for then, if find to comprise this site parameter in the array then return this URL, this URL is not filtered in this expression, should grasp the page of this URL representative.If do not comprise this site name in the array, then filter out this URL, expression should not grasped the page of this URL representative.

This step can reduce the workload of experiment reptile, has improved efficient and the degree of accuracy of experiment reptile simultaneously.

(2) utilize the experiment crawler capturing page to be learnt:

According to the seed website url list (the website url list of promptly deriving) after filtering, remove to grasp webpage with an experiment reptile according to these URL, its employing BFS (Breadth First Search) algorithm grasps maximum N and opens the page (N is 1000-5000) from each seed website.This is tested reptile and realizes based on Nutch (Apache2007), but has revised the extracting list producing device of Nutch reptile in this example, and the flow process of the extracting list producing device after revising is:

1, reads the URL amount threshold that (data/config.xml) is provided with in the configuration file, initialization MAP (site parameter, URL quantity);

2, extract the site parameter of the URL that will grasp, search in MAP whether corresponding site parameter is arranged, if then directly carry out 4; If not, then carry out 3;

3, in MAP, add site parameter, and the URL quantity that it is corresponding is initialized as 0;

4, the URL quantity of corresponding site parameter correspondence is added 1;

5, whether the URL quantity of judging described site parameter correspondence is above amount threshold, if then do not add this URL in grasping tabulation; If not, then in grasping tabulation, add this URL.

(3) N that grabs is opened the page and classify, determine that a page is the theme associated content pages or the theme associative directory page, the described theme associated content pages and the theme associative directory page constitute the set of theme related pages;

(4) from the set of theme related pages, learn out the URL regular expression:

Concrete learning process is;

1.URL distance:

(1) url data structure

A URL is divided into three parts (removing the http protocol section): host, path, query.Wherein path is made up of a series of directory, and query is made up of a series of key-value pairs.Such as URLhttp: //www.china-pub.com/member/buybook/view.asp? add=1﹠amp; Tid=203839, its host are www.china-pub.com; Path is/member/buybook/view.asp that the directory that forms this path is member, buybook, view.asp; Query is add=1﹠amp; Tid=203839, the key-value pair of forming this query is (add, 1) and (tid, 203839).Represent that with java the url data structure is as follows:

public?class?URLtruct{

private?String?host；

private?String[]path；

private?ArrayList<Pair<String，String>>query；

(2) tolerance of URL distance (similarity)

After a URL resolved into above-mentioned url data structure, just can calculate the distance of URL based on the distance of this url data structure various piece.Two URL i and j apart from d _URL(i, j) can pass through following formulate:

d _URL(i，j)＝(d _Host(i，j)+1)×(d _Path(i，j)+1)×(d _Query(i，j)+1)-1

Wherein, d _Host(i j) is the distance of the host part of i and j URL, d _Path(i j) is the distance of the path part of i and j URL, d _Query(i j) is the distance of the query part of i and j URL.

The distance calculation principle of above-mentioned three parts is as follows:

A, inequality, then d as if the host of two URL _Host(i, j)=32, otherwise d _Host(i, j)=0.

D is set like this _Host(i j), helps stoping the URL from different host to gather into same class.

The path part of b, note i and j URL has m directory and n directory respectively, establishes m≤n.The quantity of the directory that does not wait on preceding m the relevant position of note is k, then

d _Path(i，j)＝k×2+(n-m)×4

C, to d _Query(i j) simply is provided with, if equate, and d then _Query(i, j)=0; Otherwise d _Query(i, j)=1.

Because for most of websites, if other parts equate and have only the query part not wait, the page of these two URL indications is that same template produces substantially usually, generally also is to belong to the same class page.

2.URL set is divided:

It is according to certain standard the URL set to be divided into several subclass that URL divides.Its input is a URL set, and output is several URL subclass.In this example, can be classified as same subclass to the identical URL of host according to the host relation partly of URL, whole like this URL set just is divided into several subclass, and the host of the URL in each subclass is identical, the host difference of the URL between different subclass; Perhaps, divide according to the directory quantity of the path of URL part, quantity is identical is classified as same subclass again.

3, URL polymerization algorithm:

To just obtaining several URL subclass after the URL set division, such URL subclass is cluster URL, and each bunch URL is carried out the polymerization algorithm, aggregates into some class URL.The detailed process of polymerization can be divided into following a few step:

(1) divides certain URL bunch that obtains from URL set and read url list;

(2) newly-built URL class tabulation C _URL={ C _URL(1), C _URL(2) ..., C _URL(j) ... C _URL(m) }, and initialization, make C _URL(1)=and URL (1), distance threshold h=1 is set;

(3) the url list relaying from aggregate objects URL bunch is resumed studies and is got URL (i), and i=2, if 3...n is i 〉=n then finishes;

(4) from URL class tabulation C _URLThe middle URL class C that mates with URL (i) that seeks _URL(j), if the match is successful, then URL (i) is added this C _URL(j) in; If the URL class that is not complementary, a then newly-built URL class C _URL(j+1), then URL (i) is added newly-built URL class C _URL(j+1) in, and this newly-built URL class is inserted into URL class tabulation C _URLIn, forward step 3 to.

A URL and a URL class coupling are meant that the distance of URL all in this URL and this URL class all is not more than a distance threshold h, promptly

d _URL(i，j)≤h

Through just obtain some URL classes after the polymerization in the URL Class Queue, all URL distance each other in the same class all is not more than distance threshold h like this, so the URL in the same class has certain similarity.

Whether the URL class tabulation that polymerization obtains satisfies following condition:

Wherein, m is the number of the URL class in the polymerization result that produces of static polymerization, and n is the URL sum, and l is C _URL(j) comprise the maximal value of URL number in, p is the quantity induction parameter, 0＜p＜1, and the number m of the URL class that the big more then polymerization of p is come out is few more, and the URL number that each URL class comprises is also just many more.P can relatively obtain a suitable value by experiment.If satisfy then end, otherwise carry out h=h+1, forward step 3 to and continue polymerization.

By above-mentioned polymerization process, just can obtain the tabulation C of a URL class _URL={ C _URL(1), C _URL(2) ..., C _URL(j) ... C _URL(m) }, each class C in this tabulation _URL(j) one or more URL have been represented respectively.

4, the URL regular expression extracts:

Cluster URL is carried out having produced several URL classes C after the polymerization _URL(j), ensuing extraction process is exactly that each URL class is extracted the URL regular expression that can summarize and represent this class.

According to foregoing url data structure, all URL are decomposed into host, path and query three parts, and path resolved into a series of directory, query is resolved into a series of key-value pairs.Because host partly is identical certainly, so just host is remembered in regular expression as it is.Each directory alignment of path part, then this part value is added in the regular expression if the directory on the relevant position is identical, otherwise added in the regular expression with *.Also adopt and the similar method of path part for the query part.The last relevant regular expression of several hardware industries that just can obtain.

(5) focused crawler carries out Web page extracting work according to " the website url list of deriving " under regular expression instructs.Key step is as follows:

1, reads the website url list of deriving;

2, read the URL regular expression listing file (comprising relevant regular expression of theme and the irrelevant regular expression of theme) that obtains on last stage, according to regular expression each website URL that derives is carried out positive counter-example coupling then, to determine whether to grasp the Web page at this URL place of deriving, generate " grasping the target url list " according to the URL that derives that grasps;

3, carry out Web page extracting work according to " grasping the target url list ".

Focused crawler in this example also is based on the realization of Nutch reptile, has equally just revised extracting list producing device part.The flow process of the extracting list producing device after revising is:

1, initialization.Website filtrator of initialization and a URL regular expression filtrator;

Can the rule of website filtrator be to find in the site list of deriving according to the Host that judges the URL that will grasp, if can find, then this URL is by the website filtrator, can not be filtered, this is in order to guarantee that reptile can not grab the site page in the non-theme associated stations tabulation, improves the vertical efficient that grasps.

The rule of URL regular expression filtrator is to judge whether the URL that will grasp mates with regular expression, if then need not filter; If not, then it is filtered out.

2, from WebDB, read a URL who needs extracting;

3, this URL is filtered with the website filtrator, filter if desired and then forward 6 to;

4, this URL is filtered with URL regular expression filtrator, filter if desired and then forward 6 to;

5, need this URL adding to grasp in the url list;

Finish then finish if 6 read WebDB, otherwise forward 2 to.

(6) noise filtering

In the reality, may exist the theme related pages of minority is judged to the irrelevant page of theme, perhaps the irrelevant page of the theme of minority is judged to the problem of theme related pages, will cause like this from the URL regular expression that the representative theme that a website learns out is correlated with, may existing some in fact to represent the URL regular expression of the irrelevant page of theme.Thereby need filter these URL regular expressions.Because the URL quantity that these URL regular expressions are concluded is smaller, falls so can be used as noise filtering to the few especially URL regular expression of the URL quantity of the sort of conclusion.The standard of filtering is:

n _URL≤N×v

Wherein, n _URLBe the URL number that conclusion obtains, v is " a noise filtering threshold value ", common 0＜v＜1.

For technique effect of the present invention is described better, below by using focused crawler of the present invention (UBFC), BFS (Breadth First Search) reptile (BFSC), the performance comparison explanation of basic focused crawler (BLFC) extracting hardware industry news pages.

Described performance can be estimated by yield rate and the rate of recalling, and particularly, yield rate can be represented the page shared proportion relevant with theme in all webpages that web crawlers grasps, and can represent the accuracy rate of extracting.Can calculate by following formula:

g＝p/d

Wherein, g represents yield rate, the theme associated nets number of pages that the p representative is grasped, the webpage number that the d representative is grasped altogether.

The rate of recalling can be represented the ratio of theme related pages all on theme related pages that web crawlers grasps and the internet, can be by following formula calculating:

r＝p/ps

Wherein, r represents the rate of recalling, the theme related pages number that the p representative is grasped, and ps represents the in esse related pages of theme altogether number.

Need to prove, because how many theme related pages can't add up has on the internet, thereby the very difficult in practice rate of recalling of calculating, thereby experiment is based on, and the simulated data collection carries out, in this case, the simulated data collection is as the internet in the simulated experiment, and the theme related pages quantity that comprises with the simulated data collection is calculated the rate of recalling as theme related pages quantity altogether.

Use focused crawler of the present invention (UBFC) respectively, BFS (Breadth First Search) reptile (BFSC), basic focused crawler (BLFC) has carried out a real network to the relevant seed site list in aforementioned hardware industry field and has grasped, the degree of depth that grasps is 4, the thread of opening is 57, and the result who adopts various crawler capturings is shown in following extracting table with test results.Page quantity is grasped in the S representative altogether in the table, and P represents related pages quantity, and G represents yield rate, and R represents the rate of recalling.It should be noted that BFS (Breadth First Search) reptile (BFSC) rate of not recalling, is the rate of recalling that is used for calculating other two reptiles as reference value because BFSC grasps the related pages quantity of getting off.

BFSC, BLFC, the extracting test result of UBFC is:

As can be seen from the above table, websites such as beareyes, ce, chemeinfo, chinabmb, cnpv, fsonline, ieicn, sg001, xmnn almost do not have the theme related pages, but BFSC has still downloaded 51135 pages from these several websites, wherein have 401 relevant with theme.BLFC has only downloaded 593 pages from these websites, wherein have 4 relevant with theme, and UBFC work is more better, it has downloaded 575 pages, wherein have 42 relevant with theme, UBFC is undoubtedly performance the best from this point performance is gone up.

In general, BFSC has grasped 82946 pages altogether, wherein have only 3558 relevant with theme, yield rate is 0.04.BLFC has grasped 1629 pages altogether, wherein have 95 relevant with theme, yield rate is 0.06, only than BFSC height a bit, and its rate of recalling has only 0.03, on this point, the use value of BFSC is very little, because there is a large amount of related pages not grab.This is to make a speech because only grasp the URL that comes out from the relevant page link of theme, because not all theme related pages all is linked at together, and the present page often comprises a large amount of links, wherein the overwhelming majority may all be that theme is irrelevant, and this also is why BLFC has only grasped the reason of the few page like this.UBFC has grasped 5670 pages altogether, and wherein 1514 pages are relevant with theme, and its yield rate is 0.27, and the rate of recalling is 0.43.Can see that the page quantity that UBFC grasps altogether has only 1/15th of BFSC, its rate of recalling is more than 10 times of BLFC then near 1/2nd, and yield rate is more than 6 times of BFSC especially, are more than 4 times of BLFC.

Further can be with reference to figure 2, this is the trend comparison figure that grasps the theme related web page in the performance test process, this comparison diagram and time correlation.The related pages quantity of three crawler capturings is similar when just beginning, but As time goes on, the growth of UBFC is more and more faster, thereby proves that advantage of the present invention begins to embody.And BFSC and BLFC increases slowly, and be last, and the related pages that BLFC grasps is even less than BFSC, this be because, one, whether BLFC put into a unknown URL in decision and need judge whether its parent page is the theme related pages, and this needs the time when grasping tabulation; Two, BLFC is when whether decision will put into the extracting tabulation to a unknown URL, carried out more filtration with respect to BFSC, so the page that its each layer need grasp with respect to BLFC seldom, finished the extracting of one deck page so soon, making it that a large amount of time be arranged is not to grasp carrying out the actual page, but consume the self of data structure of Nutch safeguarded.

Fig. 3 further illustrates yield rate and the time relation of three reptiles, as can be seen from the figure, the yield rate of BFSC and BLFC tends towards stability soon, and the time that the yield rate of UBFC rises more more for a long time, and gradually stable, and its yield rate will be higher than other reptile far away.

Can draw by above analysis, the present invention has effectively improved yield rate that page resource grasps and has recalled rate, can help better people from the internet on a large scale, high-level efficiency, obtain required information accurately.

For aforesaid each method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.

With reference to figure 4, show the structured flowchart of the device embodiment of a kind of directed grabbing page resource of the present invention, can comprise with lower module:

Seed filtering module 401 is used for seed website URL is filtered, and described seed website URL comprises the URL of seed website and the URL of link thereof;

Experiment reptile module 402 is used for grasping the page that meets amount threshold in advance according to described seed website URL;

Sort module 403 is used for determining the feature page at the described pre-extracting page;

Regular expression study module 404 is used to generate the regular expression of summarizing described feature page URL;

Matching module 405 is used for described seed website URL and described regular expression are mated, and the seed website URL of Satisfying Matching Conditions is left extracting target URL;

Focused crawler module 406 is used for grasping the page according to described extracting target URL.

Preferably, described regular expression study module 404 can comprise following submodule:

Preferably, described URL comprises site parameter, and described set is divided submodule and can be comprised with lower unit:

First division unit, the URL that is used for having the same site parameter is divided to same URL subclass.

More preferred, described URL can also comprise path parameter, and described set is divided submodule and can also be comprised with lower unit:

Further, described URL can also comprise query argument, and described set is divided submodule and can also be comprised with lower unit:

Preferably, described cluster submodule can comprise with lower unit:

More preferred, described cluster submodule can also comprise with lower unit:

Preferably, described seed filtering module can comprise with lower unit:

Preferably, present embodiment can also comprise with lower module:

With reference to figure 5, show the method flow diagram of using the directed grabbing page resource of embodiment shown in Figure 4, can may further comprise the steps:

Step 501, seed filtering module filter seed website URL;

Preferably, described seed website URL comprises the URL of seed website and the URL of link thereof, and in this case, the treatment step of described seed filtering module comprises:

Array generates submodule and reads in described seed website URL to array, and described array is sorted;

Website filters the site parameter that submodule extracts certain URL, judges whether comprise described site parameter in the described array, if then keep described URL; If not, then remove described URL.

Step 502, experiment reptile module grasp the page that meets amount threshold in advance according to described seed website URL;

Preferably, the span of described amount threshold is 1000 to 5000, and described experiment reptile module can grasp the page by following steps:

Described seed website URL is write database;

The URL that upgrades described site parameter correspondence grasps quantity;

Upgrade described database according to described page data section.

Step 503, sort module are determined the feature page in the described pre-extracting page;

Preferably, the described feature page can comprise the theme related pages and the irrelevant page of theme, and described theme related pages specifically can comprise the theme associated content pages and the theme associative directory page.

Step 504, regular expression study module generate the regular expression of summarizing described feature page URL;

Preferably, described regular expression study module can pass through the regular expression of following steps learning characteristic page URL:

Set is divided submodule described feature page URL is divided into a plurality of URL subclass;

The cluster submodule aggregates into a plurality of URL classes with described URL subclass;

Extract the regular expression that submodule extracts described URL class.

Particularly, described URL can comprise site parameter, path parameter and query argument, and submodule is divided in described set can divide the URL subclass by following steps:

The URL that first division unit will have the same site parameter is divided to same URL subclass;

The URL of the path parameter that second division unit is identical with catalogue quantity is divided to same URL subclass;

The URL that the 3rd division unit will have the same queries parameter is divided to same URL subclass.

In this case, described cluster submodule can pass through following steps polymerization URL class:

Rule is provided with the clustering rule that the unit presets described URL class;

Processing unit reads URL from described URL subclass, and judges whether described URL meets the clustering rule of described URL class, if then described URL is dispensed in the described URL class; If not, then according to the newly-built URL class of described URL;

Statistic unit is added up the number of described URL class and the sum of URL;

The rule adjustment unit is adjusted the clustering rule of described URL class according to described statistics.

Then, extract submodule each URL class is extracted the URL regular expression that can summarize and represent this class.Particularly, the data structure according to foregoing URL promptly is decomposed into host to URL, path and query three parts, and path resolved into a series of directory, query is resolved into a series of key-value pairs.Because host partly is identical certainly, so directly the host note is got final product in regular expression, then each directory alignment of path part, if the directory on the relevant position is identical, then this part value is added in the regular expression, otherwise added in the regular expression with *.Also adopt and the similar method of path part for the query part.Just can obtain at last a plurality of corresponding regular expressions.

Step 505, matching module mate described seed website URL and described regular expression, and the seed website URL of Satisfying Matching Conditions is left extracting target URL;

In the present embodiment, for the theme related pages, if the regular expression of described seed website URL and described theme related pages coupling, then described seed website URL Satisfying Matching Conditions; For the irrelevant page of theme, if the regular expression coupling of the irrelevant page of described seed website URL and described theme, then described seed website URL is Satisfying Matching Conditions not.

Step 506, focused crawler module grasp the page according to described extracting target URL.

Preferably, present embodiment may further include following steps:

The quantity statistics module is added up the URL quantity of described regular expression coupling;

Regular expression screening module when filtering threshold value, is deleted described regular expression less than default in described URL quantity.

For device embodiment, because it is substantially corresponding to method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.

More than the method for a kind of directed grabbing page resource provided by the present invention and a kind of device of directed grabbing page resource are described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1, a kind of method of directed grabbing page resource is characterized in that, comprising:

In the described pre-extracting page, determine the feature page;

Generate the regular expression of summarizing described feature page URL;

Grasp the page according to described extracting target URL.

2, the method for claim 1 is characterized in that, the step of described generation regular expression comprises:

Described feature page URL is divided into a plurality of URL subclass;

Described URL subclass is aggregated into a plurality of URL classes;

Extract the regular expression of described URL class.

3, method as claimed in claim 2 is characterized in that, described URL comprises site parameter and path parameter, and the substep of a plurality of URL subclass of described division comprises:

The URL that will have the same site parameter is divided to same URL subclass;

4, method as claimed in claim 3 is characterized in that, described URL also comprises query argument, and the substep of a plurality of URL subclass of described division also comprises:

As claim 2,3 or 4 described methods, it is characterized in that 5, the substep of a plurality of URL classes of described polymerization comprises:

Preset the clustering rule of described URL class;

6, method as claimed in claim 5 is characterized in that, the substep of a plurality of URL classes of described polymerization also comprises:

Add up the number of described URL class and the sum of URL;

7, the method for claim 1 is characterized in that, before grasping page step in advance, also comprises:

8, method as claimed in claim 7 is characterized in that, the step of described filter seed website URL comprises:

Read in described seed website URL to array, described array is sorted;

9, the method for claim 1 is characterized in that, the span of described amount threshold is 1000 to 5000.

As claim 1,7 or 9 described methods, it is characterized in that 10, the described step that grasps the page in advance comprises:

Described seed website URL is write database;

The URL that upgrades described site parameter correspondence grasps quantity;

Upgrade described database according to described page data section.

11, the method for claim 1 is characterized in that, the described feature page comprises the theme related pages, and described theme related pages comprises the theme associated content pages and the theme associative directory page.

12, method as claimed in claim 11 is characterized in that, described coupling step comprises:

13, method as claimed in claim 11 is characterized in that, the described feature page also comprises the theme page that has nothing to do.

14, method as claimed in claim 13 is characterized in that, described coupling step also comprises:

15, as claim 11 or 13 described methods, it is characterized in that, also comprise:

Add up the URL quantity of described regular expression coupling;

16, a kind of device of directed grabbing page resource is characterized in that, comprising:

17, device as claimed in claim 16 is characterized in that, described regular expression study module comprises:

18, device as claimed in claim 17 is characterized in that, described URL comprises site parameter and path parameter, and described set is divided submodule and comprised:

19, device as claimed in claim 18 is characterized in that, described URL also comprises query argument, and described set is divided submodule and also comprised:

As claim 17,18 or 19 described devices, it is characterized in that 20, described cluster submodule comprises:

21, device as claimed in claim 20 is characterized in that, described cluster submodule also comprises:

22, device as claimed in claim 16 is characterized in that, also comprises:

23, device as claimed in claim 22 is characterized in that, described seed filtering module comprises:

24, device as claimed in claim 16 is characterized in that, also comprises: