CN102591992A

CN102591992A - Webpage classification identifying system and method based on vertical search and focused crawler technology

Info

Publication number: CN102591992A
Application number: CN2012100341952A
Authority: CN
Inventors: 曹武龙; 王国圃
Original assignee: SUZHOU YAXINFENG INFORMATION TECHNOLOGY Co Ltd
Current assignee: SUZHOU YAXINFENG INFORMATION TECHNOLOGY Co Ltd
Priority date: 2012-02-15
Filing date: 2012-02-15
Publication date: 2012-07-18

Abstract

The invention discloses a webpage classification identifying system based on vertical search and focused crawler technology. The system is characterized by comprising an application expressing module, a data acquisition module and a content analyzing module, wherein the data acquisition module finishes acquisition of webpage data through a Web protocol, and transfers the acquired page data to the content analyzing module; the content analyzing module performs HTML (hyper text mark-up language) analysis on the page data acquired by the data acquisition module, extracts hyperlink in a page and adds the hyperlink into a URL (uniform resource locator) queue to obtain a correspondence table between the website type and URL; and the application expressing module receives the keyword input by a user for search, and feeds the searched website of a specific field and/or the website type thereof back to the user. Through actual operation and test in the development and construction process, the implementation effect of the webpage classification identification method based on vertical search and focused crawler is perfectly reflected, and the accuracy of the method is verified.

Description

Web page classifying recognition system and method based on vertical search and focused crawler technology

Technical field

The invention belongs to the web page search engine technical field, be specifically related to a kind of Web page classifying recognition system and method based on vertical search and focused crawler technology.

Background technology

Along with the continuous expansion of information, People more and more be unable to do without search engine.Though universal search engine such as Baidu, Google provides a lot of facilities to people; But the variation of Along with people's demand and to the Search Results quality require increasingly highly, universal search engine can not satisfy people's requirement in some specialized fields, so vertical search is just arisen at the historic moment; It is the precise search technology of serving local professional domain; Professional more, the result who returns has more specific aim, through the domain knowledge of specific industry theme; Inquiry according to semantic information can be provided, thereby can satisfy user's special search need.

Along with the increased popularity of vertical search engine, also seem more and more important as gordian technique one focused crawler of vertical search engine.Focused crawler is a program of downloading webpage automatically, and it visits webpage and relevant linking on the WWW selectively according to set extracting target, obtains needed information.

Web page classifying identification to vertical search and focused crawler technology possesses certain degree of difficulty, and following reason is arranged: the first, how focused crawler is difficult to judge that from the URL formation of waiting to creep, choosing the webpage that most probable comprises subject correlation message creeps.The second, now many crawler systems of increasing income do not possess the directed function of extracting Web page structural information from the webpage that grasps.Three, the content and structure of same webpage often changes, and the heavily visit strategy of focused crawler is difficult to adapt to this variation.Therefore the focused crawler technology that adopts tradition to increase income is difficult to the different classes of webpage of accurate recognition.Therefore, must look for another way.The present invention therefore.

Summary of the invention

The object of the invention is to provide a kind of Web page classifying recognition system based on vertical search and focused crawler technology; Set up a kind of Web page classifying recognition methods to the navigation type website based on vertical search and focused crawler technology; And design its model of cognition and algorithm, through identification, obtain the URL of different classification in the navigation type website to the navigation type website; Be convenient to the precise search of user, can provide the affiliated Web page classifying of unknown URL simultaneously the website.

In order to solve these problems of the prior art, technical scheme provided by the invention is:

A kind of Web page classifying recognition system based on vertical search and focused crawler technology; It is characterized in that said system comprises application representation module, data acquisition module and Context resolution module; Said data acquisition module is accomplished the collection to web data through the Web agreement, gives the Context resolution module with the page data that collects then; Said Context resolution module is carried out the HTML parsing according to the page data of data collecting module collected, extracts the hyperlink in the page, and hyperlink is joined in the URL formation, obtains the mapping table of network address classification and URL; Said application representation module is accepted user entered keyword search, and the network address of the specific area that searches and/or affiliated network address classification result are fed back to the user.

Preferably, said system is arranged between focused crawler process and the Internet network, and said focused crawler process grasps the guidance station dot information of Internet network automatically according to rule.

Another object of the present invention is to provide the said system of a kind of employing to carry out the Web page classifying recognition methods, it is characterized in that said method comprising the steps of:

(1) create the focused crawler process, the focused crawler process reads the url list of preset Web side navigation website;

(2) the data acquisition module website URL that need carry out data acquisition from the url list taking-up carries out the webpage source code to the URL that obtains and obtains; The Context resolution module utilizes vertical search technology and focused crawler technology that web page contents is resolved; Obtain website information corresponding in webpage classification information and each webpage classification under this website, and website information corresponding in webpage classification information and each webpage classification is joined in the Category tabulation; Circulate successively and all travel through until url list; The URL that said Category list storage has been discerned and its network address classification that is subordinate to.

Preferably,, then travel through directly and finish when url list is empty in said method step (2).

Preferably; Said method step (2) is after data acquisition module need carry out the website URL of data acquisition from the url list taking-up; The website URL that data acquisition module earlier carries out data acquisition with needs puts into and does not visit url list, when not visiting the url list non-NULL, never visits url list and takes out a URL; And the URL that obtains is carried out the webpage source code obtain; And through the Context resolution module webpage source code is resolved, and this URL joined visit in the url list, and never visit this URL of deletion in the url list.

Preferably, when not visiting url list for sky, notice focused crawler process reads the url list of preset Web side navigation website in the said method step (2).

Preferably, the URL that ought never visit the url list taking-up in the said method visited, and then continued not visit next URL of url list.

Preferably, said method step (2) the Context resolution module step of carrying out Context resolution comprises:

A1) the webpage source file of the URL that obtains of focused crawler process extracting, then according to regular expression regularly the structure of web page characteristic use regular expression that obtains of learner pattern learning extract the structured message of webpage;

A2) use regular expression from the structured message of webpage, to extract the new url that meets the network address classified information; And new url added in the URL formation;

A3) from the URL formation, take out URL, whether this URL of cycle criterion satisfies the search strategy of using representation module, if satisfy search strategy, then this URL is joined among the network address classification table Category with corresponding network address classification.

Preferably, said method step A2) said regular expression grasps new url according to the strategy of breadth First from source file.

Preferably, said method step A1) when regular expression timing learner can not be discerned the structure of web page characteristic, directly carry out whether satisfying in the URL formation judgement of the search strategy of using representation module.

Preferably, said method step A3) if when not satisfying search strategy, continues then to judge whether next URL in the URL formation satisfies the step of search strategy in.

The present invention can solve following problem: 1) utilize vertical search to obtain the corresponding network address of different classification with focused crawler from navigation website through to vertical search and focused crawler Study on Technology.2) can return precise search result targetedly to the special search of specific industry theme according to the user.3) obtain Web page classifying that unknown URL is affiliated on the classifieds website through vertical search and focused crawler.

The present invention is based on the Web page classifying recognition methods of vertical search and focused crawler technology, the technological frame of the URL that respectively classifies in a kind of effective navigation by recognition class website be provided, and detailed design recognizer.System is divided into three modules, is successively from the bottom up: data acquisition module, Context resolution layer and application presentation layer.

The present invention is based on the Web page classifying recognition methods of vertical search and focused crawler technology, committed step is two parts: the webpage source code obtains and the web page contents analytical method.Wherein the web page contents analytical method is a core, and it comprises two main portions: extract the structured message of webpage and the strategy of creeping of focused crawler.Through discovering navigation website webpage source code: the navigation type website is made up of the sub-directory page of a master catalogue page and each classification basically two kinds of pages; The master catalogue page comprises the link of each classification subpage frame of a large amount of sensings, and the sub-directory page of each classification then comprises the link of the website that belongs to this classification.The sub-directory page of each classification on the same navigation website also has very strong similarity; That is to say similar structure is all arranged in these pages; Can summarize the structured message of the page with (or a several) regular expression through pattern learning, so as long as find the regular expression of representing these page structure information just can well instruct focused crawler to grasp and the relevant webpage of classifying as far as possible.With www.hao123.com is example; As to search all URL of " amusement and recreation " classification; Can write regular expression href (?: " (? < 1>[^ "] *) " | (? < 1 >)); be used for the link of shape such as href=" ... " in the matched character string, just can obtain all URL of " amusement and recreation " classification.

In order to adapt to the irregular renewal of navigation website, better extract the Web page structural information of catalog page, the invention provides the timing learner of URL regular expression, can adapt to the continuous variation of the website that navigates.The present invention has simultaneously proposed the directed BFS strategy based on the web page contents characteristic with reference to three kinds of search strategies of URL.The basic thought of this search strategy is: in the process that webpage grasps, according to the directed structured message that extracts webpage of the content characteristic of webpage, from structured message, grasp webpage with breadth-first strategy then earlier.This method can reduce the quantity of gathering the page effectively, has also practiced thrift the network bandwidth simultaneously, improves the efficient of information search.

With respect to scheme of the prior art, advantage of the present invention is:

Utilize URL coverage rate in Chinese website ALEXA TOP100 of this system grabs to reach 98%, the coverage rate among the Global Site ALEXA TOP 500 reaches 87%, and the URL coverage rate of local characteristic website reaches 56%.Through actual motion and test in the development & construction process, well embodied implementation result based on the recognition methods of the Web page classifying of vertical search and focused crawler, verified the accuracy of the method.The present invention has very wide significance and using value for the identification of Web page classifying.Mainly can be applied in the specific crowd of professional domain to aspects such as the search of the vertical search of customizing messages, Deep Web and excavation, website structure elucidation, the analysis of central issue of Internet user interest, the search efficiency that improves topic search engine, construction of digital library.

Description of drawings

Below in conjunction with accompanying drawing and embodiment the present invention is further described:

Fig. 1 is based on the Web page classifying recognition methods overall flow figure of vertical search and focused crawler technology; Wherein provided each processing procedure of identification Web page classifying.

Fig. 2 is the process flow diagram of web page contents analytic method; Wherein provided each processing procedure of web page contents analytic method.

Embodiment

Below in conjunction with specific embodiment such scheme is further specified.Should be understood that these embodiment are used to the present invention is described and are not limited to limit scope of the present invention.The implementation condition that adopts among the embodiment can be done further adjustment according to the condition of concrete producer, and not marked implementation condition is generally the condition in the normal experiment.

Embodiment

Navigation website that present embodiment is developed warehouse-in engine and broadband networks user behavior analysis system employing be the B/S framework, development platform is vs2005+oracle 9i, the user can be as required, is linked into easily in the existing system that needs the network address classification.Only need revise configuration file during deployment, can on a PC, move, also can operation simultaneously on multiple pc.

Below introduce each module of this design in detail and based on the Web page classifying recognition methods of vertical search and focused crawler.The concrete processing procedure of method such as the accompanying drawing 1 of Web page classifying identification, carry out according to following steps:

(1) read the url list of presetting the Web side navigation website, judge whether url list is empty,

If empty, then change step (8);

(2) take out a website URL, put it in the url list (UV_URL tabulation) of not visit.

(3) if the UV_URL tabulation for empty, is then changeed step (1);

(4) from the UV_URL tabulation, take out a URL, judge according to Table V _ URL whether this URL was visited, if then change step (3);

(5) URL that obtains is carried out the webpage source code and obtain, utilize vertical search technology and focused crawler technology that web page contents is resolved, obtain the website information of correspondence in webpage classification information and each classification under this website;

(6) website information corresponding in webpage classification information and each classification is joined in the Category tabulation;

(7) from table UV_URL, delete URL, and it is added among the V_URL, turn to (1);

(8) finish.

Web page classifying recognition methods based on vertical search and focused crawler needs following module: data acquisition module, Context resolution module and application representation module.

The function of data acquisition module: the main effect of this module is to accomplish the collection to web data through various Web agreements, gives a last module with the page that collects then and does further processing.

The interface of data acquisition module: this module is the interface of focused crawler and the Internet, with the interface of a last module be webpage source code string data, to the upper strata input data are provided.

The function of Context resolution module: this module is the nucleus module of whole framework, is that the page that data collecting module collected is got off carries out the HTML parsing according to next module mainly, extracts hyperlink wherein, joins in the URL formation.The URL that provides in the page link generally is multiple form, possibly be complete, comprises agreement, website and path; Also possibly omit partial content; Or a relative path, therefore need structured message with web page contents analytical method extraction webpage, from structured message, grasp webpage URL with breadth-first strategy; Obtain the mapping table Category of network address classification and URL, to satisfy of the search of a last module application representation module to Web page classifying.

The interface of Context resolution module: the Web page classifying identification of this module should be a mapping table with the interface of application module, i.e. network address classification and URL correspondence table.

The main method of Context resolution module is the web page contents analytical method, and it comprises two main portions: extract the structured message of webpage and the strategy of creeping of focused crawler.At first extract Web page structural information, use directed BFS strategy to carry out the extracting of URL then based on the web page contents characteristic.

Concrete web page contents analytic method processing procedure such as accompanying drawing 2, carry out according to following steps:

(1) utilize focused crawler to grasp the source file of webpage;

(2) judge that whether this webpage satisfies the structure of web page characteristic that the pattern learning of regular expression timing learner obtains, if do not satisfy, changes step (6);

(3) utilize regular expression to extract the structured message of webpage, this structured message is the content blocks of network address classified information;

(4) from the structured message piece, extract satisfactory new url according to regular expression;

(5) new url is added in the URL formation;

(6) judge that whether the URL formation is empty, if empty, then changes step (8);

(7) take out a URL, judge whether this URL satisfies search strategy,, then this URL is joined among the network address classification table Category, and turn to step (1) simultaneously if satisfy; Otherwise, turn to step (6);

(8) finish.

Wherein: UV_URL is used to deposit the not URL of visit; V_URL is used to deposit the URL that has visited; Category is used to deposit URL that has discerned and the network address classification that is subordinate to.

Use the function of representation module: user's the input and the feedback of Search Results are provided.The user can be through the network address of input key word precise search to specific area; For the URL an of the unknown, the user also can inquire the network address classification under it.

Above-mentioned instance only is explanation technical conceive of the present invention and characteristics, and its purpose is to let the people who is familiar with this technology can understand content of the present invention and enforcement according to this, can not limit protection scope of the present invention with this.All equivalent transformations that spirit is done according to the present invention or modification all should be encompassed within protection scope of the present invention.

Claims

1. Web page classifying recognition system based on vertical search and focused crawler technology; It is characterized in that said system comprises application representation module, data acquisition module and Context resolution module; Said data acquisition module is accomplished the collection to web data through the Web agreement, gives the Context resolution module with the page data that collects then; Said Context resolution module is carried out the HTML parsing according to the page data of data collecting module collected, extracts the hyperlink in the page, and hyperlink is joined in the URL formation, obtains the mapping table of network address classification and URL; Said application representation module is accepted user entered keyword search, and the network address of the specific area that searches and/or affiliated network address classification result are fed back to the user.

2. one kind is adopted the system of claim 1 to carry out the Web page classifying recognition methods, it is characterized in that said method comprising the steps of:

3. method according to claim 2 is characterized in that said method step (2) when url list is sky, then travels through directly and finishes.

4. method according to claim 2; It is characterized in that said method step (2) is after data acquisition module need carry out the website URL of data acquisition from the url list taking-up; The website URL that data acquisition module earlier carries out data acquisition with needs puts into and does not visit url list, when not visiting the url list non-NULL, never visits url list and takes out a URL; And the URL that obtains is carried out the webpage source code obtain; And through the Context resolution module webpage source code is resolved, and this URL joined visit in the url list, and never visit this URL of deletion in the url list.

5. method according to claim 4 is characterized in that in the said method step (2) that notice focused crawler process reads the url list of preset Web side navigation website when not visiting url list when empty.

6. method according to claim 4 is characterized in that the URL that ought never visit the url list taking-up in the said method visited, and then continued not visit next URL of url list.

7. method according to claim 2 is characterized in that said method step (2) Context resolution module carries out the step of Context resolution and comprise:

8. method according to claim 7 is characterized in that said method step A2) said regular expression grasps new url according to the strategy of breadth First from source file.

9. method according to claim 7 is characterized in that said method step A1) when regular expression timing learner can not be discerned the structure of web page characteristic, directly carry out whether satisfying in the URL formation judgement of the search strategy of using representation module.

10. method according to claim 7 is characterized in that said method step A3) in if when not satisfying search strategy, continue then to judge whether next URL in the URL formation satisfies the step of search strategy.