CN103309862B

CN103309862B - Webpage type recognition method and system

Info

Publication number: CN103309862B
Application number: CN201210058024.3A
Authority: CN
Inventors: 蔡兵; 彭默; 徐羽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2012-03-07
Filing date: 2012-03-07
Publication date: 2017-05-17
Anticipated expiration: 2032-03-07
Also published as: CN103309862A

Abstract

The embodiment of the invention provides a webpage type recognition method and system. The method comprises the following steps of calculating a content type alignment value of a webpage according to the contextual content of the webpage; extracting the webpage structural characteristics of the webpage; and utilizing the content type alignment value and the webpage structural characteristics to recognize a type of the webpage. By utilizing the method and system, the webpage is classified by comprehensively considering the dimensionality of the contextual content and the dimensionality of the webpage structure, so that the classification accuracy is higher. Moreover, through the data filter, noises which are nonrelated to the recognition type in the webpage such as tags, links and advertisement can be efficiently eliminated, and the classification effect is better.

Description

A kind of webpage type identification method and system

Technical field

Embodiment of the present invention is related to technical field of internet application, more particularly, to a kind of type of webpage identification side Method and system.

Background technology

With developing rapidly for computer technology and network technology, the Internet (Internet) daily life, The effect played in study and work is also increasing.The newest internet development announced according to CNNIC is adjusted Look into report to show, China Internet number of netizen reaches Chinese webpage in 5.13 hundred million, 2010 60,000,000,000, and global webpage is then at least Have 1,000,000,000,000.

How the information numerous and complicated that numerous webpages are included on the Internet, accurately sort out these webpages in order to follow-up work Work is a stern challenge.Such as：In terms of web advertisement, show that the advertisement related to type of webpage will greatly promote use Family clicking rate.In addition, nearly 2 years with the development of mobile Internet, the demand of mobile reading is in blowout shape, and news is undoubtedly used One of type that family is paid close attention to the most, if can recognize that news web page, it is also possible to provide cleaner number to mobile reading application According to while can also extract to the page that corresponding help is provided.

At present, generally content of text is recognized using the file classification method of naive Bayesian in the prior art, mainly Including：Mark training sample, by the use of text word as feature, classification of text, etc. is estimated by the method for counting.

First, mainly classified according to web page contents in currently available technology, and carried out only according to web page contents If classification, classification accuracy is not high.Secondly, compared with the webpage on the Internet, the data source of text classification is due to excessively It is simple and impracticable.

The content of the invention

Embodiment of the present invention proposes a kind of webpage type identification method, to improve Web page classifying accuracy.

Embodiment of the present invention also proposes a kind of type of webpage identifying system, to improve Web page classifying accuracy.

The concrete scheme of embodiment of the present invention is as follows：

A kind of webpage type identification method, the method includes：

The content type propensity value of the webpage is calculated according to the content of text of webpage；

Extract the structure of web page feature of the webpage；

Using the content type propensity value and webpage described in the structure of web page feature identification type.

A kind of type of webpage identifying system, it is single that the system includes that content type propensity value computing unit, architectural feature are extracted Unit and type identification unit, wherein：

Content type propensity value computing unit, the content type for calculating the webpage according to the content of text of webpage is inclined to Value；

Architectural feature extraction unit, for extracting the structure of web page feature of the webpage；

Type identification unit, for using the content type propensity value and webpage described in the structure of web page feature identification Type.

From above-mentioned technical proposal as can be seen that in embodiments of the present invention, the net is calculated according to the content of text of webpage The content type propensity value of page；Extract the structure of web page feature of the webpage；Recycle content type propensity value and structure of web page special Levy the type for recognizing the webpage.As can be seen here, after using embodiment of the present invention, two dimensions are carried out first against webpage Classification：One is that, based on the dimension of content of text, another is based on the dimension of structure of web page；Finally according to the two dimensions Classification results, combination determines the classification of webpage.Therefore embodiment of the present invention not only allows for the content of text dimension of webpage Degree, it is also contemplated that structure of web page dimension has considered the two dimensions to classify webpage classifying webpage, Therefore the accuracy of classification is higher.

Description of the drawings

Fig. 1 is the webpage type identification method flow chart according to embodiment of the present invention；

Fig. 2 is the webpage type identification method exemplary flow chart according to embodiment of the present invention；

Fig. 3 is the type of webpage identifying system structure chart according to embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawings further is made to the present invention Detailed description.

In embodiments of the present invention, the classification of two dimensions is carried out for webpage.One is based on the dimension of content of text Degree, another is based on the dimension of structure of web page.Then, according to the classification results of the two dimensions, webpage is determined in combination Classification.

Fig. 1 is the webpage type identification method flow chart according to embodiment of the present invention.

As shown in figure 1, the method includes：

Step 101：The content type propensity value of the webpage is calculated according to the content of text of webpage.

Here, it is related to carry out preliminary classification to type of webpage based on the dimension of content of text.Classified according to content of text Relate generally to using statistical machine learning sorting algorithm, certain page is calculated by training sample and feature for particular type The probability of (such as news type).

Specifically, participle can be carried out to the content of text of webpage first with dictionary, and calculates the weight of participle feature To form characteristic vector, then further according to the content type tendency of the web page contents classifier calculated this feature vector for pre-setting Value, wherein the content type propensity value for calculating can be used as corresponding to type of webpage representated by this kind of web page contents grader Probability.

In addition to text message, Webpage usually contains many other irrelevant contents.It is found through experiments, only profit With all sentences in webpage as categorical data source, the noises such as label, link, advertisement can be effectively removed so that classifying quality More preferably.Therefore, in one embodiment, before participle is carried out to the content of text of webpage using dictionary, can be from text Sentence of the whole sentence length less than predetermined value is filtered off in content, to strengthen classifying quality.

And, in order to reduce the cost of artificial mark data band, can attempt using various websites (such as, some news Website) go to capture data as entrance, and by simple manual examination and verification, substantial amounts of (such as thousands of) news data is obtained, Then by the use of word as characteristic of division, and binding characteristic selects scheduling algorithm to carry out dimensionality reduction.

In another embodiment, grader can be calculated using logistic regression (Logistic Regression) classification Method calculates the content type propensity value of characteristic vector.Logistic regression is a kind of linear classifier, and calculating speed quickly, is relatively adapted to The application scenarios of real-time grading.

In one embodiment, specifically can be calculated using word frequency-anti-document frequency (TF-IDF) weighting algorithm point The weight of word feature.

TF-IDF weighting algorithms are a kind of conventional weighting techniques prospected for information retrieval and information, to assess a word Word is for the significance level of a copy of it file in a file set or a corpus.In TF-IDF weighting algorithms, word The importance of word is directly proportional increase, but while the frequency that can occur in corpus with it with the number of times that it occurs hereof Rate is inversely proportional to decline.

The various forms of TF-IDF weightings is often searched engine application, used as degree of correlation between file and user's inquiry Tolerance is graded.In addition to TF-IDF, the Search engine on the Internet can be also used based on the ranking method for linking analysis, with Determine the order that file occurs in search result.

Step 102：Extract the structure of web page feature of the webpage.

Here, it is related to carry out preliminary classification to type of webpage based on the dimension of content of text.Specifically, can first to net Page builds DOM Document Object Model (DOM) tree, then some structure of web page features is extracted by traveling through dom tree, using as textural classification Foundation.

According to W3C DOM specifications, DOM is a kind of interface unrelated with browser, platform, language so that user can visit Ask page others standard package.DOM solves the Javascript of Netscape (Netscape) and Microsoft (Microsoft) Conflict between Jscript, gives web designer and developer the method for one standard, in order to access website in data, Script and presentation layer are to picture.DOM is the set of the node or pieces of information organized with hierarchical structure.This hierarchical structure is allowed out The personnel of sending out navigate in tree and find customizing messages.Analyzing the structure generally needs to load whole document and tectonic remnant basin structure, so Any work can be just done afterwards.Because it is based on level of information, thus DOM is considered as based on tree or object-based.

Such as：The structure of web page feature for traveling through dom tree and extracting can include：

1) URL features.Such as URL ends are index.html etc., then substantially can be determined that as index page.If URL Containing " content " or date, then the probability for content pages is larger.

2) Text Link Ratio.Calculate text (Pure Text) length and link text (Anchor) length inside webpage Ratio.

3) maximum text size.Calculate one section of text size most long in webpage.As a length gauge of content pages Value.

4) most long continuous text ratio.The text size concentrated accounts for the ratio of the total text size of webpage.In general, it is interior The text message for holding page is concentrated mainly on one piece, and such as thematic page etc., although its text size is long, but is distributed relative distribution.

5) secondary navigation information；

6) web page title, etc..

Although enumerating some specific structure of web page features in detail above, it will be appreciated by those of skill in the art that real The structure of web page feature adopted on border is not limited thereto, and the protection domain of embodiment of the present invention is also not limited to This.

Step 103：Using content type propensity value and the type of structure of web page feature identification webpage.

Here, the structure of web page that the content type propensity value and step 102 for being calculated based on step 101 is extracted is special Levy, the threshold value and combined strategy of each feature can be determined by the various many judgment criterions for pre-setting, finally draw The type of the page.

Such as：When the content of text in step 101 according to webpage calculates the news type propensity value of the webpage, then judge Criterion specifically can include：

1) when news type propensity value is more than the news type first threshold for pre-setting, the type of webpage is directly judged For news.

For example, it is assumed that the span of news type propensity value is 0-100, the news type propensity value for calculating is 90, And news type first threshold is 85.Now, because the news type propensity value for calculating is more than news type first threshold, because This can consider the webpage and news height correlation, can not now consider structure of web page feature and directly judge the class of the webpage Type is news.

2) when news type propensity value is more than the news type Second Threshold for pre-setting, and include in structure of web page feature During news category information, the type of the webpage is judged as news, wherein news type first threshold is more than the threshold of news type second Value.

For example, it is assumed that the span of news type propensity value is 0-100, the news type propensity value for calculating is 70, And news type first threshold is 85, news type Second Threshold is 60.Now, because the news type propensity value for calculating is little In news type first threshold, therefore the webpage can not be directly assert for news type, but due to the news type for calculating Propensity value is more than news type Second Threshold, then it is considered that the webpage is related to news type, it is therefore desirable to combine and calculate News type propensity value and structure of web page feature whether carry out the synthetic determination webpage be news type.Now, structure of web page is worked as (contain " news " in such as web page title) when also including news category information in feature simultaneously, then can be determined that the type of the webpage For news.

When the news type propensity value for calculating be less than news type Second Threshold, then can directly assert the webpage with it is new Hear type uncorrelated.

In embodiments of the present invention, for the webpage of news type, final recognition accuracy can reach 95% with On, and recall rate is more than 80%.

Although being described in detail to embodiment of the present invention by example of news type above, those skilled in the art It is to be appreciated that being based on above-mentioned detailed teachings, the type of webpage that embodiment of the present invention can essentially be suitable for not merely is wrapped News type is included, and can be including multiple types such as knowledge question type, forum's zone of discussion type or online transaction type of webpage Type.

In said method flow process, require to have no strict demand for the execution sequence of step 101 and step 102.It is actual On, step 101 and step 102 can be carried out simultaneously, it is also possible to first carried out step 101, then execution step 102, or performed Execution step 101 again after step 102.

And, identified after type of webpage based on above-mentioned flow process, many can be performed with reference to the type of webpage for identifying Plant application.

Such as：Recognized type of webpage can be based on, the advertisement degree of association of the webpage is calculated；Can also be based on and be recognized Type of webpage, perform Personalize News for the webpage and recommend；Recognized type of webpage is also based on, from the webpage Extract Web page structural data；Or based on the type of webpage for being recognized, the data screening for reading class application is performed for the webpage, Etc..

Based on above-mentioned labor, below to differentiate whether webpage is exemplary flow of the news type as example to the present invention Journey is illustrated.

Fig. 2 is the webpage type identification method exemplary flow chart according to embodiment of the present invention.

As shown in Fig. 2 for the operation You Liangge branches of webpage.Left side branch includes step 201, step 202 and step 203, right branch includes step 204 and step 205.Liang Ge branches are summarized in step 206.Wherein left side branch includes：

Step 201：Perform data filtering.In order to prevent webpage noise, only extract some long sentences in webpage as text This, can filter off sentence of the whole sentence length less than predetermined value, to strengthen classifying quality from content of text herein.

Step 202：Participle is carried out to text using characteristic set dictionary, the weight (profit of each participle feature is then calculated With characteristic set and the feature weight computational methods of such as TF-IDF), form a characteristic vector.

Step 203：Using characteristic vector as the input of grader, an output valve (span is 0-100 point) is obtained, That is news content type propensity value, for representing tendency degree that its content is news.Wherein can by training sample and feature, The grader is previously obtained by logistic regression algorithm.

Right branch includes：

Step 204：Build dom tree.Including：Dom tree is set up using the html tag of webpage, and comprising the letter such as each tag attributes Breath.

Step 205：Structure type feature, such as secondary navigation, Text Link Ratio etc. are extracted based on dom tree.

Left and right branch is summarised in step 206：Combination judges.Using the output and the output of step 205 of step 203, utilize Preset strategy and carry out optimum and determine whether news content page.

It is discussed in detail based on above-mentioned, embodiment of the present invention also proposed a kind of type of webpage identifying system.

As shown in figure 3, the system includes：Content type propensity value computing unit 301, the and of architectural feature extraction unit 302 Type identification unit 303.

Wherein：Content type propensity value computing unit 301, for calculating the content of the webpage according to the content of text of webpage Type propensity value；

Architectural feature extraction unit 302, for extracting the structure of web page feature of the webpage；

Type identification unit 303, for using described in the content type propensity value and the structure of web page feature identification The type of webpage.

In one embodiment, the system further includes type of process unit (not shown in FIG.).Type of process Unit, for performing at least one of following steps：Based on the type of webpage for being recognized, the advertisement for calculating the webpage is related Degree；Based on the type of webpage for being recognized, perform Personalize News for the webpage and recommend；Based on the type of webpage for being recognized, from Web page structural data are extracted in the webpage；Or based on the type of webpage for being recognized, perform for the webpage and read class application Data screening.

Specifically, content type propensity value computing unit 301, for being carried out to the content of text of webpage point using dictionary Word, and the weight of participle feature is calculated to form characteristic vector；And according to the web page contents classifier calculated spy for pre-setting Levy the content type propensity value of vector.

Preferably, content type propensity value computing unit 301, is further used in the content of text using dictionary to webpage Before carrying out participle, sentence of the whole sentence length less than predetermined value is filtered off from the content of text.

Specifically, architectural feature extraction unit 302, for setting up the DOM Document Object Model dom tree of the webpage, and from described Structure of web page feature is extracted in dom tree.

In one embodiment, content type propensity value computing unit 301, for being calculated according to the content of text of webpage The news type propensity value of the webpage；Now type identification unit 302 is used to perform at least one of following steps：Work as news When type propensity value is more than the news type first threshold for pre-setting, directly judge the type of webpage as news；Or work as news Type propensity value includes news category information more than the news type Second Threshold for pre-setting in the structure of web page feature When, judge the type of webpage as news；Wherein news type first threshold is more than news type Second Threshold.

Similarly, the type of webpage that the type of webpage identifying system in embodiment of the present invention is suitable for not merely includes News type, and can be including knowledge question type, forum's zone of discussion type or online transaction type of webpage, etc..

In sum, in embodiments of the present invention, the content type for calculating the webpage according to the content of text of webpage inclines To value；Extract the structure of web page feature of the webpage；Recycle content type propensity value and webpage described in structure of web page feature identification Type.As can be seen here, after using embodiment of the present invention, for webpage the classification of two dimensions is carried out.One is to be based on The dimension of content of text, another is that, based on the dimension of structure of web page, finally according to the classification results of the two dimensions, combination is true Make the classification of webpage.Therefore embodiment of the present invention not only allows for content of text dimension, it is also contemplated that structure of web page dimension To classify to webpage, webpage is classified by considering the two dimensions, therefore the accuracy classified is higher.

And, in embodiments of the present invention, by data filtering, can effectively remove unrelated with identification types in webpage The noises such as label, link, advertisement so that classifying quality is more preferably.

The above, only presently preferred embodiments of the present invention is not intended to limit protection scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvements made etc. should be included in the protection of the present invention Within the scope of.

Claims

1. a kind of webpage type identification method, it is characterised in that the method includes：

Extract the structure of web page feature of the webpage；

Using the content type propensity value and webpage described in the structure of web page feature identification type；

The content of text according to webpage calculates the content type propensity value of the webpage and is specially：According to the content of text of webpage Calculate the news type propensity value of the webpage；Wherein：

Using news type propensity value and the type of structure of web page feature identification webpage, at least in following steps is specifically included It is individual：

When the news type propensity value is more than the news type first threshold for pre-setting, the class of the webpage is directly judged Type is news；Or

When the news type propensity value is more than the news type Second Threshold for pre-setting, and wrap in the structure of web page feature During category information containing news, judge the type of the webpage as news；

Wherein described news type first threshold is more than news type Second Threshold.

2. webpage type identification method according to claim 1, it is characterised in that the method is further comprising the steps At least one of：

Based on the type of webpage for being recognized, the advertisement degree of association of the webpage is calculated；

Based on the type of webpage for being recognized, perform Personalize News for the webpage and recommend；

Based on the type of webpage for being recognized, Web page structural data are extracted from the webpage；Or

Based on the type of webpage for being recognized, the data screening for reading class application is performed for the webpage.

3. webpage type identification method according to claim 1, it is characterised in that the content of text meter according to webpage The content type propensity value for calculating the webpage is specifically included：

Participle is carried out to the content of text of the webpage using dictionary, and calculates the weight of participle feature to form characteristic vector；

According to the content type propensity value of the web page contents classifier calculated this feature vector for pre-setting.

4. webpage type identification method according to claim 3, it is characterised in that in the text using dictionary to webpage Appearance is carried out before participle, and the method is further included：Sentence of the whole sentence length less than predetermined value is filtered off from the content of text.

5. webpage type identification method according to claim 3, it is characterised in that the weight of the calculating participle feature For：The weight of participle feature is calculated using the anti-document frequency IDF weighting algorithms of word frequency TF-.

6. webpage type identification method according to claim 3, it is characterised in that in the method：

The web page contents grader calculates the content type propensity value of this feature vector using logistic regression sorting algorithm.

7. webpage type identification method according to claim 1, it is characterised in that the structure of web page of the extraction webpage Feature is specifically included：

Set up the DOM Document Object Model dom tree of the webpage；

Structure of web page feature is extracted from the dom tree.

8. webpage type identification method according to claim 7, it is characterised in that the structure of web page feature includes following At least one of information：

Secondary navigation information；

Text Link Ratio；

Uniform resource position mark URL；

Web page title；

Maximum text size；Or

Most long continuous text ratio.

9. webpage type identification method according to claim 1, it is characterised in that the type of the webpage includes news category Type, knowledge question type, forum's zone of discussion type or online transaction type of webpage.

10. a kind of type of webpage identifying system, it is characterised in that the system includes content type propensity value computing unit, structure Feature extraction unit and type identification unit, wherein：

Content type propensity value computing unit, for calculating the content type propensity value of the webpage according to the content of text of webpage；

Type identification unit, for using the content type propensity value and webpage described in the structure of web page feature identification class Type；

11. type of webpage identifying systems according to claim 10, it is characterised in that the system is further included at type Reason unit, the type of process unit is used to perform at least one of following steps：

12. type of webpage identifying systems according to claim 10, it is characterised in that

The content type propensity value computing unit, divides for participle being carried out to the content of text of webpage using dictionary, and being calculated The weight of word feature is forming characteristic vector；And according to the content of the web page contents classifier calculated this feature vector for pre-setting Type propensity value.

13. type of webpage identifying systems according to claim 10, it is characterised in that

The content type propensity value computing unit, be further used for using dictionary the content of text of webpage is carried out participle it Before, sentence of the whole sentence length less than predetermined value is filtered off from the content of text.

14. type of webpage identifying systems according to claim 10, it is characterised in that

The architectural feature extraction unit, for setting up the DOM Document Object Model dom tree of the webpage, and carries from the dom tree Take structure of web page feature.

15. type of webpage identifying systems according to claim 10, it is characterised in that the type of the webpage includes news Type, knowledge question type, forum's zone of discussion type or online transaction type.