CN106649823A

CN106649823A - Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler

Info

Publication number: CN106649823A
Application number: CN201611247621.5A
Authority: CN
Inventors: 掌明; 卢艳宏; 杨瑞; 樊纪山; 王经卓; 宋永献; 孙巧榆; 张金学; 洪露
Original assignee: Huaihai Institute of Techology
Current assignee: Huaihai Institute of Techology
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2017-05-10

Abstract

The invention discloses a webpage classification recognition method based on comprehensive subject term vertical search and focused crawler, and belongs to the technical field of webpage search engines. According to the method, research is performed aiming at a webpage classification recognition method in a subject term vertical search engine which is dynamically changed in a webpage, and the judgment of a fact that whether a dynamically changed webpage is related to a subject term is mainly searched; by computing the subject term correlation degree in the webpage, a URL highly related to a comprehensive subject term is screened out and enters a queue for crawl; classified information of the webpage is obtained through vertical search and focused crawler technologies; a webpage classification recognition model and algorithm are designed; different classifications of URLs are obtained through the recognition of the dynamically changed webpage; accurate webpage search is provided for users, and the webpage classification of an unknown URL can be further provided. The method has very wide significance and a high application value for the classification recognition of the dynamic webpage.

Description

Based on comprehensive descriptor vertical search and the Web page classifying recognition methods of focused crawler

Technical field

The present invention relates to web page search engine technical field, be specifically related to it is a kind of based on comprehensive descriptor vertical search and The Web page classifying recognition methods of focused crawler.

Background technology

With the increased popularity of vertical search engine, also seem as the key technology-focused crawler of vertical search engine It is more and more important.Focused crawler is a program for downloading webpage automatically, and it is selectively accessed according to set crawl target Webpage on WWW is linked to related, the information required for obtaining；The topmost process object of reptile is exactly URL, its root File content required for obtaining according to URL addresses, is then further processed to it.

With the rapid growth of internet, also in volatile presentation, people pay special attention to such as the information content on network Where effective information is obtained in the information of magnanimity, universal search engine gives people to provide many facilities, but cannot meet Personalized, variation and the demand of precision, so the appearance of vertical search receives common concern, it is specific that it searches for some The information of industry or theme, specific aim and purpose it is higher；Semantic information inquiry is provided by descriptor, can be met specific The specific demand of user；It is more professional, and the result of return also more targetedly, can be covered using little server resource Cover the data of a certain specific industry, theme.And focused crawler is visited as the core component of vertical search according to specified descriptor Ask webpage related on internet and link, capture the information for needing.

Basic vertical search and the Web page classifying recognition methods of focused crawler comprises the following steps：

(1) it is input into comprehensive descriptor to be checked；

(2) reptile is created；

(3) url list of default Web side navigation website is read；

(4) judge whether url list is empty, if it is empty, then go to step (8)；

(5) a website URL is taken out, in putting it into the url list (UVURL lists) not accessed；

(6) judge whether UVURL lists are empty, if it is empty, then go to step (3)；

(7) URL is taken out from UVURL lists, judges whether this URL is accessed according to Table V URL, if so, then turned Step (6)；

(8) URL to obtaining carries out webpage source code acquisition, using vertical search technology and focused crawler technology in webpage Hold parsing, obtain corresponding website information in the webpage classification information and each classification under this website；

(9) corresponding website information in webpage classification information and each classification is added in Category lists；

(10) URL is deleted from table UVURL, and is added in VURL, gone to step (6)；

(11) terminate.

There is certain difficulty in the method, there is following reason：Focused crawler is difficult from URL queues to be creeped to select and master The close queue of creeping of topic information relationship；Web crawlers in URL extraction process, using search strategies such as depth, width, easily " dimension calamity " problem of generation；It is existing much increase income crawler system from crawl webpage in obtain structured message function compared with It is weak；Existing focused crawler strategy is difficult the dynamic change of the content and structure for adapting to webpage.In sum, traditional focusing is climbed The different classes of webpage discrimination of worm technology is relatively low, it is necessary to look for another way.

The content of the invention

1. the technical problem to be solved

The technical problem to be solved in the present invention is to provide a kind of based on comprehensive descriptor vertical search and focused crawler Web page classifying recognition methods, by the vertical search and focused crawler technical research based on comprehensive descriptor, we can be compared with Good solution following point：

(1) URL queues to be creeped are built using hyperlink value and comprehensive theme word correlation value.

(2) can obtain that there is targetedly precisely search knot according to the special search of the specific synthesis descriptor of user Really.

(3) the webpage classification belonging to unknown URL is obtained by comprehensive descriptor vertical search and focused crawler.

2. technical scheme

To solve the above problems, the present invention is adopted the following technical scheme that：

By finding following rule to website observation and analysis：Website is substantially made up of catalog page and content page, Catalog page includes many links for pointing to the various different content pages, and content page then includes belonging to the net of the content of pages Stand link.Belonging to has very strong similitude between the same category of page, that is, have similar structure, can pass through regular expressions Formula is obtaining the structured message of the page.In order to adapt to the irregular change of web page contents, the net of page feature is preferably extracted Page structure information, introduces URL regular expressions learner to adapt to the dynamic change of webpage and solve descriptor isolated island ask Topic, needs the canonical table for obtaining the catalog page related to the URL regular expressions of descriptor related pages and descriptor simultaneously Up to formula, the URL with this two classes matching regular expressions is only captured.At the same time the present invention proposes determining based on comprehensive descriptor To depth-first search strategy.

It is a kind of based on comprehensive descriptor vertical search and the Web page classifying recognition methods of focused crawler, comprise the steps：

(1) it is input into comprehensive descriptor to be checked；

(2) reptile is created；

(3) invoking page content analysis algorithms；

(4) address searching table Search is read；

(5) judge whether address searching table Search is empty, then go to step if it is empty (15)；

(6) first URL in Search tables is taken out, in putting it into UVURL lists；

(7) first URL in Search tables is deleted；

(8) judge whether UVURL lists are empty, then go to step if it is empty (4)；

(9) if UVURL lists are not sky, a URL is taken out from UVURL lists；

(10) judge whether this URL is accessed according to Table V URL, if so, then go to step (8)；

(11) if above-mentioned URL is not accessed, the corresponding webpage source codes of the URL are obtained；

(12) web page contents are parsed using distributed vertical search and focused crawler technology, obtains the web page class of the URL Other information and corresponding website information；

(13) webpage classification information and corresponding website information are added in Category lists；

(14) URL is deleted from table UVURL, and is added in VURL, gone to step (8)；

(15) terminate.

Further, content of pages parser is described in step (3)：By the calculating of the descriptor degree of association, obtain The N number of page maximum with the comprehensive descriptor degree of association, accurately identified by vertical search and focused crawler the page classification and Corresponding website information, comprises the following steps that：

1) using the source file of focused crawler technical limit spacing webpage；

2) judge whether the webpage matches the related to comprehensive descriptor of URL regular expression timings learner acquisition simultaneously The regular expression of the catalog page of the regular expression of the content page of the page and comprehensive descriptor related pages, if not Match somebody with somebody, then go to step 9)；

3) structured message of webpage is extracted using regular expression；

4) comprehensive descriptor calculation of relationship degree method is called, the comprehensive descriptor association angle value of the page is obtained；

5) comprehensive descriptor degree of association R of the page is read, and judges whether the threshold values α more than setting, if it is not, then abandoning 1) page, go to step；

If 6) comprehensive descriptor degree of association R of the page is more than the threshold values α for setting, the comprehensive descriptor pass of the page Connection degree R values are inserted in contingency table Relevance；

7) new url is extracted from the structured message of the page using regular expression；

8) this is filled up to new chain in corresponding Relevance tables, and the descending mode according to Relevance values is arranged Sequence；

9) judge whether Relevance tables are empty, if it is empty, then go to step 13)；

10) first URL of Relevance tables is taken out, judges whether this URL meets search strategy, if being unsatisfactory for, turned To step 9)；

11) URL for meeting search strategy is added in address searching table Search, while deleting in Relevance tables First URL；

12) step 1 is turned to)；

13) terminate.

Further, step 4) described in comprehensive descriptor calculation of relationship degree method be：By comprehensive descriptor Different weighted values embody the tight ness rating of the descriptor of the page to be searched, and according to word frequency page feature Xiang Ku is built, and according to every Diverse location of the individual characteristic item in the page arranges different weights to obtain the degree of association of the page and comprehensive descriptor, concrete step It is rapid as follows：

1. the comprehensive weight vector q=(q of M descriptor are built₁,q₂,...,q_M), whereinq_iRepresent i-th Weights of the individual descriptor in query expression；

2. the characteristic item page to be extracted is obtained；

3. word stem is extracted in the page：Extract text participle do filtration treatment -- filter out it is abstract or to retrieve nothing The word of pass, and remove unrelated prefix and suffix；

4. the word frequency of the word for extracting is calculated；

5. filter out characteristic item of the word frequency less than setting threshold values T, choose n characteristic item composition page feature Xiang Ku (if Characteristic item number of the word frequency more than T is more than n in the page, then n characteristic item is chosen from big to small by word frequency, if word in the page Characteristic item number of the frequency more than T is less than n, then not enough word frequency characteristic item is all 0), is set to p=(p₁,p₂,…,p_n)；

If 6. the characteristic item in feature database is located at<title>In label, if r=5.0, if characteristic item exists<meta>In, if r =3.0, if characteristic item exists<a>In, if r=2.0, other situations divide into r=1.0.Constitutive characteristic item weight vectors set r= (r₁, r₂..., r_n)；

7. its corresponding p is searched successively in page feature Xiang Ku to M descriptor_iIf not finding in characteristic item storehouse, 0 is then designated as, the vector of composition is p '=(p₁′,p₂′,…,p_n′)；

8. comprehensive descriptor degree of association R in the page is calculated, its formula is as follows：

9. terminate.

3. beneficial effect

The present invention builds with the degree of association size of comprehensive descriptor according to webpage during web page characteristics are captured, first and searches Rope table, orientation extracts the structured message of webpage, is then captured from structured message with depth-first strategy and is closed with descriptor It is close webpage.Finally obtain URL and classification information with the big webpage of the descriptor degree of association to be put into table Category.Should Method can effectively reduce the quantity of the collection page, while saving the network bandwidth and improving the efficiency of information search.

Present invention is primarily intended to the webpage for being directed to dynamic change set up it is a kind of based on comprehensive descriptor vertical search and The Web page classifying recognition methods of focused crawler technology, provides identification model and related algorithm, is known by the webpage to dynamic change Not, the URL of different classifications is obtained, is accurate search of the user to offer webpage, can also provide the affiliated webpages point of unknown URL Class.

The present invention has very general sense and higher using value for the Classification and Identification of dynamic web page.Mainly may be used To be applied to：Vertical search of the professional to customizing messages in professional domain；Deep search and excavation；Effectively retrieve hidden net Network resource and utilization；WEB page is analyzed；Improve the efficiency of multiple descriptor search；Set up digital library.

Description of the drawings

Fig. 1 is vertical search and the focused crawler Web page classifying recognition methods flow chart for being based on comprehensive descriptor, wherein, The URL that the storage of UVURL tables is not accessed, the URL that the storage of VURL tables has been accessed, Category storage identified URL；

Fig. 2 is the flow chart of web page contents analytic method；

Fig. 3 is the page and descriptor calculation of relationship degree method flow diagram.

Specific embodiment

With reference to the accompanying drawings and examples the present invention is further detailed explanation.

Embodiment

The present invention propose it is a kind of can effectively recognize the Technical Architecture of all kinds of URL in dynamic web page, and give detailed Algorithm.System is divided into three layers, top-down to be followed successively by：Acquisition layer, analytic sheaf and expression layer.

1. collecting webpage data layer

Function：The major function of this layer is to realize collection to dynamic web page data, and gives last layer face and do content solution Analysis is processed.

Interface：This layer is the interface of focused crawler and network, is responsible for upper layer and provides webpage source code character string input number According to

2. web page contents analytic sheaf

Function：This layer is the core layer of whole design, mainly carried out according to the page that collecting webpage data layer is collected in Hold parsing, effective hyperlink is obtained according to descriptor associated weight, build URL queue sequences table to be creeped.Descriptor is related The diversity of the URL format in page link needs the structured message that webpage is obtained using web page contents analytical algorithm, structure The theme dictionary of correlation is built, with distributed vertical searching method the URL of webpage to be creeped is obtained, obtain comprehensive theme dictionary association The mapping table Category of degree and URL, for meeting search of the last layer to Web page classifying.

Interface：The comprehensive descriptor degree of association webpage identification of this layer and the interface of last layer are a mapping tables, i.e., Comprehensive descriptor degree of association table corresponding with URL.

The main method of this layer：Web page contents analytical algorithm, it mainly has three parts：Obtain the knot with regard to dynamic web page Structure information, the plan of specifically creeping for calculating the page and the descriptor degree of association, building URL relation tables to be creeped and focused crawler Slightly.

The page and comprehensive descriptor calculation of relationship degree method.Idiographic flow is as shown in Figure 3：

1. the comprehensive weight vector q=(q of M descriptor are built₁,q₂,…,q_M), whereinq_iRepresent i-th Weights of the descriptor in query expression；

2. the characteristic item page to be extracted is obtained；

4. the word frequency of the word for extracting is calculated；

9. terminate.

Web page contents analytical algorithm.Specific algorithm flow process is as shown in Figure 2：

1) using the source file of focused crawler technical limit spacing webpage；

3) structured message of webpage is extracted using regular expression；

12) step 1 is turned to)；

13) terminate.

3. the application expression layer that Web page classifying is recognized

Function：Provide the user the feedback of descriptor input and Search Results.User can be with by the multiple descriptor of input Accurately search the network address in particular range；Websites collection belonging to unknown URL can also be supplied to user.

Vertical search and focused crawler technology Web page classifying recognition methods based on comprehensive descriptor.Method flow diagram is as schemed Shown in 1：

(1) it is input into comprehensive descriptor to be checked；

(2) reptile is created；

(3) invoking page content analysis algorithms；

(4) address searching table Search is read；

(6) first URL in Search tables is taken out, in putting it into UVURL lists；

(7) first URL in Search tables is deleted；

(8) judge whether UVURL lists are empty, then go to step if it is empty (4)；

(9) if UVURL lists are not sky, a URL is taken out from UVURL lists；

(14) URL is deleted from table UVURL, and is added in VURL, gone to step (8)；

(15) terminate.

The present invention be directed to the web page identification method in webpage in the descriptor distributed vertical search engine of dynamic change How research, main research judges whether the webpage of a dynamic change is related to descriptor, by the descriptor for calculating the page The degree of association, sifts out the URL larger with the comprehensive descriptor degree of association and enters queue to be creeped, using vertical search and focused crawler skill Art obtains the classification information of webpage, devises Web page classifying identification model and algorithm.

Specifically, the present invention is first big with the degree of association of comprehensive descriptor according to webpage during web page characteristics are captured Little to build search table, orientation extracts the structured message of webpage；Then with depth-first strategy capture from structured message with Descriptor webpage in close relations；Finally obtain URL and classification information with the big webpage of the descriptor degree of association and be put into table In Category.The method can effectively reduce the quantity of the collection page, while saving the network bandwidth and improving the effect of information search Rate.

Search and Network Users'Behaviors Analysis system based on comprehensive descriptor adopts B/S frameworks, uses vs2005+oracle 9i can conveniently be linked into the existing system for needing and carrying out websites collection as development environment, user.Only need to modification configuration text Part just can run on one or more PC.The system is verified in the sharp wound communication Co., Ltd in Suzhou.This is System accurately obtains the success rate of the URL big with the comprehensive descriptor degree of association in Chinese website ALEXA TOP100 and reaches 97%, 87% coverage rate can be reached in Global Site ALEXA TOP 500, is obtained on some Featured Sites and is closed with descriptor The big URL ratios of connection degree reach 53%.The method is demonstrated by the operation and test in the sharp wound communication Co., Ltd in Suzhou Accuracy.

Those of ordinary skill in the art it should be appreciated that the embodiment of the above be intended merely to explanation the present invention, And be not used as limitation of the invention, as long as in the spirit of the present invention, the change to embodiment described above Change, modification all will fall in scope of the presently claimed invention.

Claims

1. it is a kind of based on comprehensive descriptor vertical search and the Web page classifying recognition methods of focused crawler, it is characterised in that to create After reptile, address searching table Search is obtained by content of pages parser, comprised the following steps that：

(1) using the source file of focused crawler technical limit spacing webpage；

(2) judge whether the webpage matches the architectural feature of associated content pages and catalog page simultaneously, if mismatching, turn step Suddenly (9)；

(3) structured message of webpage is extracted using regular expression；

(4) comprehensive descriptor calculation of relationship degree method is called, the comprehensive descriptor association angle value of the page is obtained, it is described comprehensive main Write inscription concretely comprising the following steps for calculation of relationship degree method：

1. the comprehensive weight vector q=(q of M descriptor are built₁,q₂,...,q_M), whereinq_iRepresent i-th theme Weights of the word in query expression；

2. the characteristic item page to be extracted is obtained；

3. word stem is extracted in the page：The participle for extracting text does filtration treatment, filters out abstract or unrelated to retrieving Word, and remove unrelated prefix and suffix；

4. the word frequency of the word for extracting is calculated；

5. characteristic item of the word frequency less than setting threshold values T is filtered out, n characteristic item is chosen and is constituted page feature Xiang Ku, be set to p= (p₁,p₂,…,p_n)；

If 6. the characteristic item in feature database is located at<title>In label, if r=5.0, if characteristic item exists<meta>In, if r= 3.0, if characteristic item exists<a>In, if r=2.0, other situations divide into r=1.0.Constitutive characteristic item weight vectors set r=(r₁, r₂..., r_n)；

7. its corresponding p is searched successively in page feature Xiang Ku to M descriptor_iIf not finding in characteristic item storehouse, remember For 0, the vector of composition is p '=(p₁′,p₂′,…,p_n′)；

R = Σ_{i = 1}^{M} P_{i} * {p^{'}}_{i} * r_{i}

(5) comprehensive descriptor degree of association R of the page is read, and judges whether the threshold values α more than setting, if it is not, then abandoning this The page, goes to step (1)；

(6) if comprehensive descriptor degree of association R of the page is more than the threshold values α for setting, the comprehensive theme word association of the page Degree R values are inserted in contingency table Relevance；

(7) new url is extracted from the structured message of the page using regular expression；

(8) this is filled up to new chain in corresponding Relevance tables, and the descending mode according to Relevance values sorts；

(9) judge whether Relevance tables are empty, if it is empty, then go to step (13)；

(10) first URL in Relevance tables is taken out, judges whether this URL meets search strategy, if being unsatisfactory for, turned To step (9)；

(11) URL for meeting search strategy is added in address searching table Search, while deleting the in Relevance tables One URL；

(12) step (1) is turned to；

(13) terminate；

After obtaining address searching table Search, address searching table Search is read, then carry out obtaining big with the descriptor degree of association Webpage URL and the work of classification information.

2. according to claim 1 a kind of based on comprehensive descriptor vertical search and the Web page classifying identification side of focused crawler Method, it is characterised in that introduce URL regular expressions learner in step (2) to obtain the URL with descriptor associated content pages Whether the regular expression of the regular expression catalog page related to descriptor, verify the webpage by regular expression Architectural feature with associated content pages and catalog page.

3. according to claim 1 a kind of based on comprehensive descriptor vertical search and the Web page classifying identification side of focused crawler Method, it is characterised in that step (4) is 5. middle when choosing n characteristic item and constituting page feature item storehouse, if word frequency is more than T in the page Characteristic item number be more than n, then choose n characteristic item from big to small by word frequency；If characteristic item of the word frequency more than T in the page Number is less than n, then not enough word frequency characteristic item all 0.