CN103309862A

CN103309862A - Webpage type recognition method and system

Info

Publication number: CN103309862A
Application number: CN2012100580243A
Authority: CN
Inventors: 蔡兵; 彭默; 徐羽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2012-03-07
Filing date: 2012-03-07
Publication date: 2013-09-18
Anticipated expiration: 2032-03-07
Also published as: CN103309862B

Abstract

The embodiment of the invention provides a webpage type recognition method and system. The method comprises the following steps of calculating a content type alignment value of a webpage according to the contextual content of the webpage; extracting the webpage structural characteristics of the webpage; and utilizing the content type alignment value and the webpage structural characteristics to recognize a type of the webpage. By utilizing the method and system, the webpage is classified by comprehensively considering the dimensionality of the contextual content and the dimensionality of the webpage structure, so that the classification accuracy is higher. Moreover, through the data filter, noises which are nonrelated to the recognition type in the webpage such as tags, links and advertisement can be efficiently eliminated, and the classification effect is better.

Description

A kind of type of webpage recognition methods and system

Technical field

Embodiment of the present invention relates to technical field of internet application, more specifically, relates to a kind of type of webpage recognition methods and system.

Background technology

Along with the develop rapidly of computer technology and network technology, the effect that internet (Internet) brings into play in daily life, study and work is also increasing.According to the up-to-date internet development survey report demonstration that announce the CNNIC, the China Internet number of netizen reaches 5.13 hundred million, and Chinese webpage had 60,000,000,000 in 2010, and global webpage then has 1,000,000,000,000 at least.

How the information numerous and complicated that numerous webpages comprise on the internet is accurately sorted out these webpages so that follow-up work is a stern challenge.Such as: aspect web advertisement, show that the advertisement relevant with type of webpage will promote user's clicking rate greatly.In addition, development along with mobile Internet in nearly 2 years, the demand of mobile reading is the blowout shape, news is undoubtedly one of type that the user pays close attention to the most, if can identify news web page, also can use cleaner data to mobile reading are provided, can also extract to the page simultaneously provides corresponding help.

At present, usually adopt in the prior art the file classification method of naive Bayesian to identify content of text, mainly comprise: the mark training sample, utilize the text word as feature, estimate the classification of text by the method for statistics, etc.

At first, mainly be to classify according to web page contents in the prior art at present, and only classify according to web page contents, classify accuracy is not high.Secondly, compare with the webpage on the internet, the data source of text classification is because too simple and impracticable.

Summary of the invention

Embodiment of the present invention proposes a kind of type of webpage recognition methods, to improve the Web page classifying accuracy.

Embodiment of the present invention also proposes a kind of type of webpage recognition system, to improve the Web page classifying accuracy.

The concrete scheme of embodiment of the present invention is as follows:

A kind of type of webpage recognition methods, the method comprises:

Calculate the content type propensity value of this webpage according to the content of text of webpage;

Extract the structure of web page feature of this webpage;

Utilize described content type propensity value and described structure of web page feature to identify the type of described webpage.

A kind of type of webpage recognition system, this system comprise content type propensity value computing unit, architectural feature extraction unit and type identification unit, wherein:

Content type propensity value computing unit is for the content type propensity value of calculating this webpage according to the content of text of webpage;

The architectural feature extraction unit is for the structure of web page feature of extracting this webpage;

The type identification unit is used for utilizing described content type propensity value and described structure of web page feature to identify the type of described webpage.

Can find out from technique scheme, in embodiment of the present invention, calculate the content type propensity value of this webpage according to the content of text of webpage; Extract the structure of web page feature of this webpage; Recycling content type propensity value and structure of web page feature are identified the type of described webpage.This shows, use after the embodiment of the present invention, at first carry out the classification of two dimensions for webpage: a dimension that is based on content of text, another is based on the dimension of structure of web page; According to the classification results of these two dimensions, the classification of webpage is determined in combination at last.Therefore embodiment of the present invention has not only been considered the content of text dimension of webpage, has considered that also the structure of web page dimension comes webpage is classified, and has considered these two dimensions and has come webpage is classified, and therefore the accuracy of classification is higher.

Description of drawings

Fig. 1 is the type of webpage recognition methods process flow diagram according to embodiment of the present invention;

Fig. 2 is the type of webpage recognition methods exemplary flow chart according to embodiment of the present invention;

Fig. 3 is the type of webpage recognition system structural drawing according to embodiment of the present invention.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with accompanying drawing.

In embodiment of the present invention, carry out the classification of two dimensions for webpage.A dimension that is based on content of text, another is based on the dimension of structure of web page.Then, according to the classification results of these two dimensions, the classification of webpage is determined in combination.

Fig. 1 is the type of webpage recognition methods process flow diagram according to embodiment of the present invention.

As shown in Figure 1, the method comprises:

Step 101: the content type propensity value of calculating this webpage according to the content of text of webpage.

Here, relate to based on the dimension of content of text type of webpage is carried out preliminary classification.Classification relates generally to and utilizes the statistical machine learning sorting algorithm according to content of text, calculates the probability that certain page is particular type (such as the news type) by training sample and feature.

Particularly, can at first utilize dictionary that the content of text of webpage is carried out participle, and the weight of calculating participle feature is to form proper vector, and then according to the content type propensity value of this proper vector of web page contents classifier calculated that sets in advance, the content type propensity value that wherein calculates can be used as the probability corresponding to this kind web page contents sorter representative type of webpage.

Except text message, Webpage contains much other irrelevant contents usually.Found through experiments, only utilize all sentences in the webpage as the grouped data source, can effectively remove the noises such as label, link, advertisement, so that classifying quality is better.Therefore, in one embodiment, before the content of text that utilizes dictionary to webpage carries out participle, can be from content of text the whole sentence of elimination length less than the sentence of predetermined value, to strengthen classifying quality.

And, the cost that brings in order to reduce artificial mark data, can attempt utilizing various websites (such as, some news websites) go to grasp data as entrance, and by simple manual examination and verification, obtain altogether a large amount of (such as thousands of) news data, then utilize word as characteristic of division, and carry out dimensionality reduction in conjunction with the feature selecting scheduling algorithm.

In another embodiment, sorter can utilize the content type propensity value of logistic regression (Logistic Regression) sorting algorithm calculated characteristics vector.Logistic regression is a kind of linear classifier, and computing velocity is very fast, relatively is fit to the application scenarios of real-time grading.

In one embodiment, specifically can utilize word frequency-anti-document frequency (TF-IDF) weighting algorithm to calculate the weight of participle feature.

The TF-IDF weighting algorithm is a kind of weighting technique commonly used of prospecting for information retrieval and information, in order to assess a words for the significance level of a copy of it file in a file set or the corpus.In the TF-IDF weighting algorithm, the number of times that the importance of words occurs hereof along with it increase that is directly proportional, but the decline that can be inversely proportional to along with the frequency that it occurs in corpus simultaneously.

The various forms of TF-IDF weighting is often used by Search engine, as tolerance or the grading of degree of correlation between file and the user's inquiry.Except TF-IDF, the Search engine on the Internet also can use the ranking method of analyzing based on linking, with the order of determining that file occurs in search result.

Step 102: the structure of web page feature of extracting this webpage.

Here, relate to based on the dimension of content of text type of webpage is carried out preliminary classification.Particularly, can build DOM Document Object Model (DOM) tree to webpage first, then extract some structure of web page features by the traversal dom tree, with the foundation as textural classification.

According to W3C DOM standard, DOM is the interface of a kind of and browser, platform, language independent so that the user can accession page other standard package.DOM has solved the conflict between the Jscript of the Javascript of Netscape (Netscape) and Microsoft (Microsoft), give the method for web designer and a standard of developer, so that the data in the access site, script and presentation layer are to picture.DOM is with the node of hierarchical structure tissue or the set of pieces of information.This hierarchical structure allows the developer to navigate in tree and seeks customizing messages.Analyze this structure and usually need to load whole document and synthem aggregated(particle) structure, then just can do any work.Because it is based on level of information, thereby DOM is considered to based on tree or object-based.

Such as: traversal dom tree and the structure of web page feature extracted can comprise:

1) URL feature.Be index.html etc. such as the URL end, then basically can be judged to be index page.If URL contains " content " or date, then be that the possibility of content pages is larger.

2) Text Link Ratio.Calculate text (Pure Text) length of webpage the inside and the ratio of link text (Anchor) length.

3) maximum text size.Calculate one section the longest text size in the webpage.A length threshold value as content pages.

4) the longest continuous text ratio.The text size of namely concentrating accounts for the ratio of the total text size of webpage.In general, the text message of content pages mainly concentrates on one, and such as special topic page or leaf etc., although its text size is long, relatively dispersion distributes.

5) secondary navigation information;

6) web page title, etc.

Although above some concrete structure of web page features of having enumerated in detail it will be appreciated by those of skill in the art that the structure of web page feature that in fact adopts is not limited to this, and the protection domain of embodiment of the present invention also are not limited to this.

Step 103: the type of utilizing content type propensity value and structure of web page feature identification webpage.

Here, the structure of web page feature that the content type propensity value that calculates based on step 101 and step 102 extract can be determined by the various many judgment criterion that set in advance threshold value and the combined strategy of each feature, finally draws the type of this page.

Such as: when calculating the news type propensity value of this webpage according to the content of text of webpage in the step 101, then judgment criterion specifically can comprise:

1) when news type propensity value during greater than the news type first threshold that sets in advance, judges that directly the type of webpage is news.

For example, suppose that the span of news type propensity value is 0-100, the news type propensity value that calculates is 90, and news type first threshold is 85.At this moment, because the news type propensity value that calculates, therefore can be thought this webpage and news height correlation greater than news type first threshold, can not consider the structure of web page feature at this moment and judge that directly the type of this webpage is news.

2) when news type propensity value greater than the news type Second Threshold that sets in advance, and when comprising news category information in the structure of web page feature, judge that the type of described webpage is news, wherein news type first threshold is greater than news type Second Threshold.

For example, suppose that the span of news type propensity value is 0-100, the news type propensity value that calculates is 70, and news type first threshold is 85, and news type Second Threshold is 60.At this moment, because the news type propensity value that calculates is less than news type first threshold, therefore can not assert directly that this webpage is the news type, but because the news type propensity value that calculates is greater than news type Second Threshold, can think that then this webpage is relevant with the news type, therefore need to come in conjunction with the news type propensity value that calculates and structure of web page feature whether this webpage of synthetic determination is the news type.At this moment, when also comprising news category information simultaneously in the structure of web page feature (containing " news " in such as web page title), can judge that then the type of this webpage is news.

When the news type propensity value that calculates less than news type Second Threshold, can assert directly that then this webpage is uncorrelated with the news type.

In embodiment of the present invention, for the webpage of news type, final recognition accuracy can reach more than 95%, and recall rate is more than 80%.

Although abovely take the news type as example embodiment of the present invention is described in detail, those skilled in the art can recognize, based on above-mentioned detailed instruction, the type of webpage that in fact embodiment of the present invention can be suitable for not merely comprises the news type, but can comprise the polytypes such as knowledge question type, forum's zone of discussion type or online transaction type of webpage.

In the said method flow process, require there is no strict demand for the execution sequence of step 101 and step 102.In fact, step 101 and step 102 can be carried out simultaneously, also can first execution in step 101, and execution in step 102 again, perhaps execution in step 101 again after the execution of step 102.

And, identify after the type of webpage based on above-mentioned flow process, can carry out many kinds in conjunction with the type of webpage that identifies and use.

Such as: can based on the type of webpage of identifying, calculate the advertisement degree of correlation of this webpage; Also can based on the type of webpage of identifying, carry out Personalize News for this webpage and recommend; Can also based on the type of webpage of identifying, from this webpage, extract the Web page structural data; Or based on the type of webpage of identifying, carry out the data screening of reading the class application for this webpage, etc.

Based on above-mentioned labor, the below is to differentiate whether webpage describes exemplary flow of the present invention as the news type as example.

Fig. 2 is the type of webpage recognition methods exemplary flow chart according to embodiment of the present invention.

As shown in Figure 2, the operation for webpage has two branches.Left side branch comprises step 201, step 202 and step 203, and right branch comprises step 204 and step 205.Two branches are summarized in step 206.Wherein left side branch comprises:

Step 201: executing data filters.For for preventing the webpage noise, only extract some long sentences in the webpage as text, herein can be from content of text the whole sentence of elimination length less than the sentence of predetermined value, to strengthen classifying quality.

Step 202: utilize the characteristic set dictionary that text is carried out participle, then calculate the weight (utilize characteristic set and such as the feature weight computing method of TF-IDF) of each participle feature, form a proper vector.

Step 203: with the input of proper vector as sorter, obtain an output valve (span is that 0-100 divides), namely news content type propensity value is used for representing that its content is the tendency degree of news.Wherein can be by training sample and feature, obtain in advance this sorter by the logistic regression algorithm.

Right branch comprises:

Step 204: build dom tree.Comprise: utilize the html tag of webpage to set up dom tree, and comprise the information such as each tag attributes.

Step 205: extract the structure type feature based on dom tree, such as secondary navigation, Text Link Ratio etc.

Left and right sides branch is summarised in step 206: combination is judged.Utilize the output of step 203 and the output of step 205, utilize to preset strategy and carry out optimum and determine whether the news content page or leaf.

Based on above-mentioned detailed discussion, embodiment of the present invention has also proposed a kind of type of webpage recognition system.

As shown in Figure 3, this system comprises: content type propensity value computing unit 301, architectural feature extraction unit 302 and type identification unit 303.

Wherein: content type propensity value computing unit 301, for the content type propensity value of calculating this webpage according to the content of text of webpage;

Architectural feature extraction unit 302 is for the structure of web page feature of extracting this webpage;

Type identification unit 303 is used for utilizing described content type propensity value and described structure of web page feature to identify the type of described webpage.

In one embodiment, this system further comprises type of process unit (not shown in FIG.).At least one of following steps be used for to be carried out: based on the type of webpage of identifying, calculate the advertisement degree of correlation of this webpage in the type of process unit; Based on the type of webpage of identifying, carry out Personalize News for this webpage and recommend; Based on the type of webpage of identifying, from this webpage, extract the Web page structural data; Or based on the type of webpage of identifying, carry out the data screening of reading the class application for this webpage.

Particularly, content type propensity value computing unit 301 is used for utilizing dictionary that the content of text of webpage is carried out participle, and the weight of calculating participle feature is to form proper vector; And according to the content type propensity value of this proper vector of web page contents classifier calculated that sets in advance.

Preferably, content type propensity value computing unit 301 was further used for before the content of text that utilizes dictionary to webpage carries out participle, and the whole sentence of elimination length is less than the sentence of predetermined value from described content of text.

Particularly, architectural feature extraction unit 302 is used for setting up the DOM Document Object Model dom tree of this webpage, and extracts the structure of web page feature from described dom tree.

In one embodiment, content type propensity value computing unit 301 is for the news type propensity value of calculating this webpage according to the content of text of webpage; This moment, type identification unit 302 was used for carrying out at least one of following steps: when news type propensity value during greater than the news type first threshold that sets in advance, judge that directly the type of webpage is news; Or when news type propensity value greater than the news type Second Threshold that sets in advance, and when comprising news category information in the described structure of web page feature, judge that the type of webpage is news; Wherein news type first threshold is greater than news type Second Threshold.

Similarly, the type of webpage that the type of webpage recognition system in the embodiment of the present invention is suitable for not merely comprises the news type, but can comprise knowledge question type, forum's zone of discussion type or online transaction type of webpage, etc.

In sum, in embodiment of the present invention, calculate the content type propensity value of this webpage according to the content of text of webpage; Extract the structure of web page feature of this webpage; Recycling content type propensity value and structure of web page feature are identified the type of described webpage.This shows, use after the embodiment of the present invention, carry out the classification of two dimensions for webpage.A dimension that is based on content of text, another is based on the dimension of structure of web page, and according to the classification results of these two dimensions, the classification of webpage is determined in combination at last.Therefore embodiment of the present invention has not only been considered the content of text dimension, has considered that also the structure of web page dimension comes webpage is classified, and comes webpage is classified by considering these two dimensions, and therefore the accuracy of classification is higher.

And, in embodiment of the present invention, by data filtering, can effectively remove in the webpage label irrelevant with identification types, link, the noise such as advertisement, so that classifying quality is better.

The above is preferred embodiment of the present invention only, is not for limiting protection scope of the present invention.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a type of webpage recognition methods is characterized in that, the method comprises:

Extract the structure of web page feature of this webpage;

2. type of webpage recognition methods according to claim 1 is characterized in that, at least one during the method is further comprising the steps:

Based on the type of webpage of identifying, calculate the advertisement degree of correlation of this webpage;

Based on the type of webpage of identifying, carry out Personalize News for this webpage and recommend;

Based on the type of webpage of identifying, from this webpage, extract the Web page structural data; Or

Based on the type of webpage of identifying, carry out the data screening of reading the class application for this webpage.

3. type of webpage recognition methods according to claim 1 is characterized in that, the content type propensity value that described content of text according to webpage calculates this webpage specifically comprises:

Utilize dictionary that the content of text of this webpage is carried out participle, and the weight of calculating participle feature is to form proper vector;

Content type propensity value according to this proper vector of web page contents classifier calculated that sets in advance.

4. type of webpage recognition methods according to claim 3 is characterized in that, before the content of text that utilizes dictionary to webpage carried out participle, the method further comprised: the whole sentence of elimination length is less than the sentence of predetermined value from described content of text.

5. type of webpage recognition methods according to claim 3 is characterized in that, the weight of described calculating participle feature is: utilize the anti-document frequency IDF of word frequency TF-weighting algorithm to calculate the weight of participle feature.

6. type of webpage recognition methods according to claim 3 is characterized in that, in the method:

Described web page contents sorter utilizes the logistic regression sorting algorithm to calculate the content type propensity value of this proper vector.

7. type of webpage recognition methods according to claim 1 is characterized in that, the structure of web page feature of described this webpage of extraction specifically comprises:

Set up the DOM Document Object Model dom tree of this webpage;

From described dom tree, extract the structure of web page feature.

8. type of webpage recognition methods according to claim 7 is characterized in that, described structure of web page feature comprises at least one in the following information:

Secondary navigation information;

Text Link Ratio;

Uniform resource position mark URL;

Web page title;

Maximum text size; Or

The longest continuous text ratio.

9. type of webpage recognition methods according to claim 1 is characterized in that,

The content type propensity value that described content of text according to webpage calculates this webpage is specially: the news type propensity value of calculating this webpage according to the content of text of webpage; Wherein:

Utilize the type of news type propensity value and structure of web page feature identification webpage, at least one in specifically may further comprise the steps:

When described news type propensity value during greater than the news type first threshold that sets in advance, judge that directly the type of described webpage is news; Or

When described news type propensity value greater than the news type Second Threshold that sets in advance, and when comprising news category information in the described structure of web page feature, judge that the type of described webpage is news;

Wherein said news type first threshold is greater than news type Second Threshold.

10. type of webpage recognition methods according to claim 1 is characterized in that, the type of described webpage comprises news type, knowledge question type, forum's zone of discussion type or online transaction type of webpage.

11. a type of webpage recognition system is characterized in that, this system comprises content type propensity value computing unit, architectural feature extraction unit and type identification unit, wherein:

12. type of webpage recognition system according to claim 11 is characterized in that this system further comprises the type of process unit, described type of process unit is used for carrying out at least one of following steps:

13. type of webpage recognition system according to claim 11 is characterized in that,

Described content type propensity value computing unit is used for utilizing dictionary that the content of text of webpage is carried out participle, and the weight of calculating participle feature is to form proper vector; And according to the content type propensity value of this proper vector of web page contents classifier calculated that sets in advance.

14. type of webpage recognition system according to claim 11 is characterized in that,

Described content type propensity value computing unit was further used for before the content of text that utilizes dictionary to webpage carries out participle, and the whole sentence of elimination length is less than the sentence of predetermined value from described content of text.

15. type of webpage recognition system according to claim 11 is characterized in that,

Described architectural feature extraction unit is used for setting up the DOM Document Object Model dom tree of this webpage, and extracts the structure of web page feature from described dom tree.

16. type of webpage recognition system according to claim 11 is characterized in that,

Described content type propensity value computing unit is for the news type propensity value of calculating this webpage according to the content of text of webpage;

Described type identification unit is used for carrying out at least one of following steps:

17. type of webpage recognition system according to claim 11 is characterized in that, the type of described webpage comprises news type, knowledge question type, forum's zone of discussion type or online transaction type.