CN101520798A

CN101520798A - Webpage classification technology based on vertical search and focused crawler

Info

Publication number: CN101520798A
Application number: CN200910025724A
Authority: CN
Inventors: 王攀; 张顺颐; 宫婷
Original assignee: RUNTREND TECHNOLOGY Inc
Current assignee: RUNTREND TECHNOLOGY Inc
Priority date: 2009-03-06
Filing date: 2009-03-06
Publication date: 2009-09-02

Abstract

The invention provides a method for identifying webpage classification based on vertical search and focused crawler. The method comprises two parts, namely a webpage source code acquisition method and a webpage content analysis method, wherein the webpage content analysis method is a key method, and comprises two main parts, namely extraction of structured information of the webpage and crawling strategy of the focused crawler. First, a URL is selected from a navigation site URL list to acquire a source file of the URL; and then, all classified URL of the navigation URL sites can be identified and acquired by the webpage content analysis method. The key method in the method is the webpage content analysis method, which is to first extract the webpage structured information, then carry out URL snatch by a directional breadth-first search strategy based on webpage content feature, and finally store the snatched URL and corresponding website classification in a list Category.

Description

Webpage classification technology based on vertical search and focused crawler

Technical field

The present invention be directed to the research of the Web page classifying recognition methods in the fixing navigation type list of websites vertical search engine, how main research effectively obtain the classified information of webpage based on vertical search and focused crawler technology, and designed the model of cognition and the algorithm of Web page classifying, it is multi-field to relate to vertical search, focused crawler, web data extraction, machine learning, data mining and natural language etc.

Background technology

Along with the continuous expansion of information, people more and more be unable to do without search engine.Though universal search engine such as Baidu, Google provides a lot of facilities to people, but along with people's demand the variation and more and more higher to the requirement of Search Results quality, universal search engine can not satisfy people's requirement in some specialized fields, so vertical search is just arisen at the historic moment, it is the precise search technology of serving local professional domain, professional more, the result who returns has more specific aim, domain knowledge by the specific industry theme, inquiry according to semantic information can be provided, thereby can satisfy user's special search need.

Along with the increased popularity of vertical search engine, also seem more and more important as the gordian technique-focused crawler of vertical search engine.Focused crawler is a program of downloading webpage automatically, and it visits webpage and relevant linking on the WWW selectively according to set extracting target, obtains needed information.

Web page classifying identification at vertical search and focused crawler technology possesses certain degree of difficulty, and following reason is arranged:

The first, how focused crawler be difficult to judges that choosing the webpage that most probable comprises subject correlation message from the URL formation of waiting to creep creeps.

The second, now many crawler systems of increasing income do not possess the directed function of extracting Web page structural information from the webpage that grasps.

Three, the content and structure of same webpage often changes, and the heavily visit strategy of focused crawler is difficult to adapt to this variation.

Therefore the focused crawler technology that adopts tradition to increase income is difficult to the different classes of webpage of accurate recognition.Therefore, must look for another way.

By to vertical search and focused crawler Study on Technology, we can solve following problem:

(1) utilize vertical search to obtain the corresponding network address of different classification from navigation website with focused crawler.

(2) can return precise search result targetedly to the special search of specific industry theme according to the user.

(3) obtain on the classifieds website Web page classifying under the unknown URL by vertical search and focused crawler.

Summary of the invention

Technical matters: the objective of the invention is to set up a kind of Web page classifying recognition methods based on vertical search and focused crawler technology at the navigation type website, and design its model of cognition and algorithm, by identification to the navigation type website, obtain the URL of different classification in the navigation type website, be convenient to the precise search of user, can provide the affiliated Web page classifying of unknown URL simultaneously the website.

Technical scheme: the present invention proposes the technological frame of the URL that respectively classifies in a kind of effective navigation by recognition class website, and detailed design recognizer.System is divided into three aspects, is successively from the bottom up: data collection layer, Context resolution layer and application presentation layer.

The key method of this paper is based on the Web page classifying recognition methods of vertical search and focused crawler technology, and this method comprises two parts: the webpage source code obtains and the web page contents analytical method.Wherein the web page contents analytical method is a core, and it comprises two main portions: extract the structured message of webpage and the strategy of creeping of focused crawler.By we find to the research of navigation website webpage source code, the navigation type website is made up of-the sub-directory page of the master catalogue page and each classification two kinds of pages basically, the master catalogue page comprises the link of each classification subpage frame of a large amount of sensings, and the sub-directory page of each classification then comprises the link of the website that belongs to this classification.The sub-directory page of each classification on the same navigation website also has very strong similarity, that is to say similar structure is all arranged in these pages, can summarize the structured message of the page with (or a several) regular expression by pattern learning, so as long as find the regular expression of representing these page structure information just can well instruct focused crawler to grasp and the relevant webpage of classifying as far as possible.With Www.hao123.comBe example, we want to search all URL of " amusement and recreation " classification, can write regular expression href s*=s* (?: ＂ (?＜1〉[^ ＂] *) ＂ | (?＜1〉S+)), be used for shape such as href=＂ in the matched character string ... the link of ＂, just can obtain all URL of " amusement and recreation " classification.In order to adapt to the irregular renewal of navigation website, better extract the Web page structural information of catalog page, we have increased the timing learner of URL regular expression, can adapt to the continuous variation of the website that navigates.The present invention has simultaneously proposed the directed BFS (Breadth First Search) strategy based on the web page contents feature with reference to three kinds of search strategies to URL.The basic thought of this search strategy is: in the process that webpage grasps, according to the directed structured message that extracts webpage of the content characteristic of webpage, grasp webpage with breadth-first strategy from structured message then earlier.This method can reduce the quantity of gathering the page effectively, has also saved the network bandwidth simultaneously, improves the efficient of information search.

Below introduce each aspect of this design in detail and based on the Web page classifying recognition methods and the web page contents analytical method of vertical search and focused crawler.

1. data collection layer

Function: the main effect of this aspect is to finish collection to web data by various Web agreements, gives the last layer face with the page that collects then and does further processing.

Interface: this aspect is the interface of focused crawler and the Internet, with the interface of last layer face be webpage source code string data, provide the input data to the upper strata.

2. Context resolution layer

Function: this aspect is the core aspect of whole framework, is that the page that data collection layer collects carries out the HTML parsing according to lower floor mainly, extracts hyperlink wherein, joins in the URL formation.The URL that provides in the page link generally is multiple form, may be complete, comprise agreement, website and path, also may omit partial content, or a relative path, therefore need structured message with web page contents analytical method extraction webpage, from structured message, grasp webpage URL with breadth-first strategy, obtain the mapping table Category of websites collection and URL, use the search of presentation layer Web page classifying to satisfy the last layer face.

Interface: the Web page classifying identification of this aspect should be a mapping table, i.e. websites collection and URL corresponding tables with the interface of application.

The main method of this layer is the web page contents analytical method, and it comprises two main portions: extract the structured message of webpage and the strategy of creeping of focused crawler.At first extract Web page structural information, use directed BFS (Breadth First Search) strategy to carry out the extracting of URL then based on the web page contents feature.

◆ the web page contents analytic method.Method processing procedure such as accompanying drawing 2.

(1) utilize focused crawler to grasp the source file of webpage;

(2) judge that whether this webpage satisfies the structure of web page feature that the pattern learning of regular expression timing learner obtains, if do not satisfy, changes step (6);

(3) utilize regular expression to extract the structured message of webpage, this structured message is the content piece of network address classified information;

(4) from the structured message piece, extract satisfactory new url according to regular expression;

(5) new url is added in the URL formation;

(6) judge that whether the URL formation is empty, if empty, then changes step (8);

(7) take out a URL, judge whether this URL satisfies search strategy,, then this URL is joined among the network address classification table Category, and turn to step (1) simultaneously if satisfy; Otherwise, turn to step (6);

(8) finish.

3. the application presentation layer of Web page classifying identification

Function: user's the input and the feedback of Search Results are provided.The user can be by the network address of input key word precise search to specific area; For the URL an of the unknown, the user also can inquire the websites collection under it.

■ is based on the Web page classifying recognition methods of vertical search and focused crawler technology.Method processing procedure such as accompanying drawing 1.

(1) reads the url list of presetting the Web side navigation website, judge that whether url list is empty, if empty, then changes step (8);

(2) take out a website URL, put it in the url list (UV_URL tabulation) of not visit.

(3) if the UV_URL tabulation for empty, is then changeed step (1);

(4) from UV_URL tabulation, take out a URL, judge whether accessed mistake of this URL according to Table V _ URL, if then change step (3);

(5) URL that obtains is carried out the webpage source code and obtain, utilize vertical search technology and focused crawler technology that web page contents is resolved, obtain website information corresponding in webpage classification information under this website and each classification;

(6) website information corresponding in webpage classification information and each classification is joined in the Category tabulation;

(7) from table UV_URL, delete URL, and it is added among the V_URL, turn to (1);

(8) finish.

The ■ beneficial effect

Identification for Web page classifying has very wide significance and using value.Mainly can be applied in:

◆ the specific crowd of professional domain is to the vertical search of customizing messages;

◆ the search of Deep Web and excavation;

◆ the website structure elucidation;

◆ the analysis of central issue of Internet user interest;

◆ improve the search efficiency of topic search engine;

◆ construction of digital library;

Description of drawings

Fig. 1 is based on the Web page classifying recognition methods overall flow figure of vertical search and focused crawler technology.Provided each processing procedure of identification Web page classifying among the figure.

Fig. 2 is the process flow diagram of web page contents analytic method.Provided each processing procedure of web page contents analytic method among the figure.

Embodiment

What navigation website warehouse-in engine of developing according to this method and broadband networks user behavior analysis system adopted is the B/S framework, and development platform is vs2005+oracle 9i, and the user can be as required, is linked into easily in the existing system that needs websites collection.Only need revise configuration file during deployment, can on a PC, move, also can operation simultaneously on multiple pc.This system has obtained concrete checking in our development ﹠ construction.Utilize URL coverage rate in Chinese website ALEXATOP100 of this system grabs to reach 98%, the coverage rate among the Global Site ALEXA TOP 500 reaches 87%, and the URL coverage rate of local characteristic website reaches 56%.By actual motion and test in our the development ﹠ construction process, well embodied implementation result based on the recognition methods of the Web page classifying of vertical search and focused crawler, verified the accuracy of the method.

Claims

1. Web page classifying recognition methods based on vertical search and focused crawler technology is characterized in that steps of the method are:

(3) if the UV_URL tabulation for empty, is then changeed step (1);

(8) finish.

2. web page contents analytical method, it is based on the core methed in the Web page classifying recognition methods of vertical search and focused crawler technology.It is characterized in that coming the network address classification of accurate navigation by recognition website and the website information under the corresponding classification by vertical search and focused crawler technology, its method step is:

(1) utilize focused crawler to grasp the source file of webpage;

(5) new url is added in the URL formation;

(8) finish.