CN103841173A

CN103841173A - Vertical web spider

Info

Publication number: CN103841173A
Application number: CN201210495397.7A
Authority: CN
Inventors: 郑世超; 苏晓华
Original assignee: DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Current assignee: DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority date: 2012-11-27
Filing date: 2012-11-27
Publication date: 2014-06-04

Abstract

The invention relates to a vertical web spider. The vertical web spider is a concept which is opposite to a web spider of a universal search engine. The difference between a vertical search engine and the universal search engine lies in that the vertical search engine serves for a specific group and only pays attention to information in a specific field, and thus traversing of a whole Web is not needed when searching is conducted by the vertical web spider, and the vertical web spider only needs to choose to have access to pages relevant to the field. Compared with a universal web spider, the webpage acquisition technology of the vertical web spider is extremely different from that of the universal web spider, and the algorithm and the working process are more complex. When Web searching is conducted by the vertical web spider, the subject relevance of a webpage needs to be judged according to a certain webpage analysis algorithm, subject prediction and reorganization are conducted on a found URL, and useful links are kept and are placed into a URL queue waiting to be grabbed; then a webpage URL needing to be grabbed in the next step is selected from the queue according to a certain search strategy, and the process is executed repeatedly until the system meets a certain condition.

Description

A kind of perpendicular network spider

Technical field

The present invention relates to Web Spider technology, particularly a kind of perpendicular network spider for vertical search engine.

Background technology

Web Spider is the basic part of search engine, and it is the starting point in search engine workflow, and its performance directly affects the overall performance of search engine.The Web Spider of universal search engine is in the time gathering Web information, normally from one " subset ", by http protocol request and download the Web page, analyze the page and extract link, and then access newfound link, travel through access Web by the mode of this continuous diffusion.From whole Internet network topological diagram, Web Spider is several discrete points from the beginning, by the limit that between the page, link forms, progressively have access to the each node on whole topological diagram, and this is the typical working method of universal network spider.According to graph traversal mode, universal network spider can be taked the mode such as depth-first, breadth-first, and its deficiency is mainly reflected in the poor in timeliness of the low and page of the Web page coverage of crawl.

Perpendicular network spider can be called again specialized network spider or Topic web crawler, is a concept relative with the Web Spider of universal search engine.Different from universal search engine is, vertical search engine is served specific crowd, its concern be the information of a certain professional domain, therefore perpendicular network spider there is no need whole Web to travel through in search procedure, only needs to select a page relevant to this area to conduct interviews.Perpendicular network spider, compared with universal network spider, is very different web retrieval is technical, and its algorithm and workflow are more complicated.Perpendicular network spider, in the time of search Web, need to judge the topic relativity of webpage according to certain web page analysis algorithm, and the URL finding is carried out to theme prediction and identification, remains with the link of use and puts it into and wait for the URL queue capturing.Then, it will select next step webpage URL that will capture from queue according to certain search strategy, and repeats said process, until stop while reaching a certain condition of system.In addition, all crawled webpages will be stored by system, carry out certain analysis, filtration, and set up index, so that retrieval and indexing afterwards.For perpendicular network spider, the analysis result that this process obtains also may provide feedback and instruct later crawl process.

Summary of the invention

The problem existing for solving prior art, the present invention will design a kind of perpendicular network spider: comprise the following steps:

A, theme goal description

A1, appointment initial seed URL

According to the target web feature in field, initial seed URL given in advance, the start page that Web Spider is creeped;

A2, set up theme feature keyword

First automatically extract characteristic key words by collections of web pages, and then by artificial screening and adjustment;

After theme feature is established, vertical spider can also dynamic study be expanded keyword set in the crawl webpage that deepens continuously;

B, Webpage search:

B1, search strategy

Adopt best preferential Best-First search strategy; A URL queue to be creeped of dynamic structure, then sorts to the URL in queue according to certain Evaluation Strategy, selects best URL at every turn and preferentially creeps;

B2, URL Evaluation Strategy

Adopt the evaluation method based on web page contents; Use theme method of discrimination to calculate the degree of subject relativity of this webpage, degree of subject relativity value is dropped lower than the webpage of a certain threshold value;

C, degree of subject relativity are judged

Take the vector space model based on web page contents and structure; Its idiographic flow is the following aspects;

C1, preliminary treatment

Before Web Spider gathers, first the subset page of describing theme is carried out to extraction and the weighting of keyword, thereby obtain the characteristic vector of this theme and the weight of vector;

C2, text manipulation

The page body that spider is gathered is carried out participle, removes stop words, retains keyword, then according to formula TF _i=aTF _m+ bTF _t+ cTF _k+ dTF _d+ eTF _a, the diverse location occurring in article according to keyword calculates weighted frequency;

C3, keyword expansion

According to the characteristic vector of setting in theme, the page key words obtaining is adjusted and expanded;

The similarity of C4, the calculating page and theme

Calculate the similarity of the page and theme according to following formula;

Sim (D) = \cos θ = \frac{Σ_{i = 1}^{n} D_{i} \times T_{i}}{\sqrt{(Σ_{i = 1}^{n} D_{i}^{2}) \times (Σ_{i = 1}^{n} T_{i}^{2})}}

C5, judge that whether the page is relevant to theme

Compare according to the size of similarity value and predefined threshold value d, if similarity value is more than or equal to d, representation page and Topic relative, downloads this page and remains into this locality; Otherwise be judged to uncorrelatedly, abandon this page.

Compared with prior art, the present invention has following beneficial effect:

1, need not travel through whole Web and just can find much more as far as possible and the webpage of Topic relative, so not only reduce the flow of the network bandwidth, also saved local memory space and computing time simultaneously;

2, the webpage capturing due to needs is a lot of less, makes upgrading in time of webpage become possibility;

3, just can index the webpage of more and Topic relative with less hardware costs.

Accompanying drawing explanation

2, the total accompanying drawing of the present invention, wherein:

Fig. 1 is perpendicular network spider system architecture figure;

Fig. 2 is the workflow diagram of perpendicular network spider.

Embodiment

Perpendicular network spider system architecture figure and detailed operation flow process are respectively as shown in Figure 1 and Figure 2.With respect to the Web Spider of universal search engine, perpendicular network spider also needs to solve three subject matters, is respectively theme goal description, Webpage search strategy and degree of subject relativity decision algorithm.The execution mode of every part is as follows.

A, theme goal description

A1, appointment initial seed URL

According to the target web feature in field, initial seed URL given in advance, the start page that Web Spider is creeped.The selection of planting subpage frame will directly affect the quality of vertical spider search, the principle of choosing initial seed URL is that kind of subpage frame itself will have higher topic relativity and extensively quote the subject resource in other authoritative website, it can be both the homepage of a website, can be also the subpage frame of website.Web Spider starts to crawl from these network address, not only can obtain rich in natural resources, and can expand the width of subject search, covers as much as possible subject resource, the final maximization that realizes crawl target.

A2, set up theme feature keyword

First automatically extract characteristic key words by collections of web pages, and then by artificial screening and adjustment, in the hope of reaching best effect.The field web page resources scope of initial collection will be extensively and is quantitatively guaranteed, and keyword feature vector distribution is just wider like this, and the weights of statistics are just more accurate, and the hit rate of the collection of later subject resource can be very high.After theme feature is established, vertical spider can also dynamic study be expanded keyword set in the crawl webpage that deepens continuously, and strives for accomplishing farthest Covering domain information, judges exactly topic relativity.

B, Webpage search

B1, search strategy

Adopt best preferential (Best-First) search strategy.The basic thought of this algorithm is the dynamic URL queue to be creeped that builds, and then according to certain Evaluation Strategy, the URL in queue is sorted, and selects best URL at every turn and preferentially creeps.

B2, URL Evaluation Strategy

Adopt the evaluation method based on web page contents.Because web page contents can be explained the theme of webpage exactly, if two webpages link together with the form of hyperlink, so they to belong to the possibility of same theme very large, therefore can predict according to the degree of correlation between text message and theme in webpage the degree of correlation of the URL comprising in webpage.The webpage that degree of subject relativity is large, the priority of the URL that it comprises is just high, thereby has determined the priority orders of URL in queue to be creeped.Certainly, also may there is in some cases mistake in this prediction, but this mistake can't affect the quality of Web Spider collecting web page, because by page download corresponding URL before this locality, need to use theme method of discrimination to calculate the degree of subject relativity of this webpage, degree of subject relativity value is dropped lower than the webpage of a certain threshold value, and under this situation, just the performance of Web Spider is subject to impact to a certain extent.

C, degree of subject relativity are judged

Take the vector space model based on web page contents and structure.Its idiographic flow is the following aspects.

C1, preliminary treatment

Before Web Spider gathers, first the subset page of describing theme is carried out to extraction and the weighting of keyword, thereby obtain the characteristic vector of this theme and the weight of vector.

C2, text manipulation

The page body that spider is gathered is carried out participle, removes stop words, retains keyword, then according to formula TF _i=aTF _m+ bTF _t+ cTF _k+ dTF _d+ eTF _a, the diverse location occurring in article according to keyword calculates weighted frequency.

C3, keyword expansion

According to the characteristic vector of setting in theme, the page key words obtaining is adjusted and expanded.

The similarity of C4, the calculating page and theme

Calculate the similarity of the page and theme according to following formula.

Sim (D) = \cos θ = \frac{Σ_{i = 1}^{n} D_{i} \times T_{i}}{\sqrt{(Σ_{i = 1}^{n} D_{i}^{2}) \times (Σ_{i = 1}^{n} T_{i}^{2})}}

C5, judge that whether the page is relevant to theme

Threshold value mainly relies on the strategy that combined training statistics is manually set to obtain, it is relatively higher that starting stage manually arranges threshold value, prevent that the starting stage from may have a large amount of uncorrelated webpages to enter, causing to continue to crawl has the collected and unnecessary expense that causes of a large amount of irrelevant webpages in process.Can extract some related web pages counting statistics relevance degree, calculate average relevance degree, using mean value as initial threshold value.Then at set intervals stochastical sampling some crawl the original html document getting off, artificial judgment correlation, calculates correlation and gathers accuracy rate.Repeatedly add up accuracy rate, if hit rate is very stable and remain on a very high position, reduce threshold value by certain amplitude, make to crawl theme and reach covering to greatest extent.If hit rate is very low and unstable, improve threshold value by certain amplitude, improve the hit rate that crawls theme.Repeat this process, final statistical computation obtains reaching maximum hit rate with some threshold values.

Claims

1. a perpendicular network spider, is characterized in that: comprise the following steps:

A, theme goal description

A1, appointment initial seed URL

A2, set up theme feature keyword

B, Webpage search:

B1, search strategy

B2, URL Evaluation Strategy

C, degree of subject relativity are judged

C1, preliminary treatment

C2, text manipulation

C3, keyword expansion

The similarity of C4, the calculating page and theme

Calculate the similarity of the page and theme according to following formula;

Sim (D) = \cos θ = \frac{Σ_{i = 1}^{n} D_{i} \times T_{i}}{\sqrt{(Σ_{i = 1}^{n} D_{i}^{2}) \times (Σ_{i = 1}^{n} T_{i}^{2})}}

C5, judge that whether the page is relevant to theme