CN103377243B

CN103377243B - A kind of method and apparatus that format classification is carried out to webpage

Info

Publication number: CN103377243B
Application number: CN201210127531.8A
Authority: CN
Inventors: 蔡兵; 黄钰; 徐羽; 张凯
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2012-04-27
Filing date: 2012-04-27
Publication date: 2017-09-08
Anticipated expiration: 2032-04-27
Also published as: CN103377243A

Abstract

The invention discloses a kind of method and apparatus that format classification is carried out to webpage：When needing to classify to any Web page, following handle is carried out：The information of page layout feature can be embodied by obtaining in the Web page；Information according to getting determines that the Web page is belonging respectively to the probability of N number of different format classifications set in advance, and N is the positive integer more than 1；It regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page.Using scheme of the present invention, it is possible to increase the accuracy of classification results.

Description

A kind of method and apparatus that format classification is carried out to webpage

Technical field

The present invention relates to Internet technology, more particularly to a kind of method and apparatus that format classification is carried out to webpage.

Background technology

At present, for Web page, two kinds of mode classifications are primarily present, one kind is classifying content, another is format point Class.

Wherein, classifying content is, using the different as classification angle of page body content, can be divided into news page and question and answer page Deng；Format classification is, using the different as classification angle of page main structural frame, can be divided into blog page and forum's page etc..

For classifying content, current research comparative maturity, but then slightly inadequate for the research of format classification. In practical application, the result of format classification can be used for setting up web page model, and can provide reference information for page info extraction, also Class discrimination available for search-engine results etc., it is significant.

In the prior art, typical URL (URL, Uniform Resource is mainly added by list Locator) mode of feature come realize format classify, implement including：

For any Web page X, its URL is matched first with list, be may include in the list a series of Different domain names and the corresponding format classification of difference etc., a domain name in such as list is hi.baidu.com, corresponding version Formula classification is blog page, then, if Web page X URL includes " hi.baidu.com ", it can determine that Web page X Affiliated format classification is blog page；If the format classification belonging to Web page X can not be determined using list, one can be entered Step is determined using some typical URL features, and such as Web page X URL includes " bbs ", then can determine that Web nets Format classification belonging to page X is forum's page.

But, the problem of aforesaid way can have certain in actual applications：Because the domain name that can be covered in list is non- It is often limited, and be not in such as " bbs " typical URL features in the URL of many Web pages, therefore will cause a lot Web page can not correctly be classified.

The content of the invention

In view of this, the invention provides a kind of method and apparatus that format classification is carried out to webpage, it is possible to increase classification As a result accuracy.

To reach above-mentioned purpose, the technical proposal of the invention is realized in this way：

A kind of method that format classification is carried out to webpage, when needing to classify to any Web page, carries out following locate Reason：

The information of page layout feature can be embodied by obtaining in the Web page；

Information according to getting determines that the Web page is belonging respectively to the general of N number of different format classifications set in advance Rate, N is the positive integer more than 1；

It regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page.

A kind of device that format classification is carried out to webpage, including：

First processing module, for when needing to classify to any Web page, carrying out following handle：Obtain described The information of page layout feature can be embodied in Web page, and is sent to Second processing module；

The Second processing module, for determining that the Web page is belonging respectively to preset according to the information got N number of different format classifications probability, N is positive integer more than 1；It regard the maximum corresponding format classification of probability of value as institute State the format classification belonging to Web page.

It can be seen that,, can be according to the embodiment Web page got for any Web page using scheme of the present invention The information of page layout feature determines that the Web page is belonging respectively to the probability of different format classifications, and by maximum general of value The corresponding format classification of rate is used as the format classification belonging to the Web page.Compared to prior art, scheme of the present invention need not Dependent on list and typical URL features, arbitrary Web page is applicable, so as to preferably improve the standard of classification results True property.Moreover, scheme of the present invention implements simple and convenient, it is easy to popularize and promotes.

Brief description of the drawings

Fig. 1 carries out the flow chart of the embodiment of the method for format classification for the present invention to webpage.

Fig. 2 carries out the process schematic of format classification for the present invention to webpage.

Fig. 3 is two-stage format mode classification schematic diagram of the present invention.

Fig. 4 carries out the composition structural representation of the device embodiment of format classification for the present invention to webpage.

Embodiment

For problems of the prior art, propose to carry out format classification to webpage after a kind of improvement in the present invention Scheme.

To make technical scheme clearer, clear, develop simultaneously embodiment referring to the drawings, to of the present invention Scheme is described in further detail.

Fig. 1 carries out the flow chart of the embodiment of the method for format classification for the present invention to webpage.When needing to any Web page When being classified, handled respectively according to flow shown in Fig. 1.

Step 11：The information of page layout feature can be embodied by obtaining in Web page X.

For ease of statement, any Web page is represented with Web page X.

In this step, Web page X document object model (DOM, Document Object Model) can be initially set up Tree；Afterwards, the content source information and structure feature information in Web page X are extracted according to the dom tree set up.

Wherein, content source information may include：Label and short text；Structure feature information may include：URL, secondary navigation and Title.

As a rule, page layout feature will not be embodied in long text, such as text and sentence, therefore, can only extract Web Short text and label in webpage X etc., as content source information, and extract in Web page X URL, Web page X two Level navigation and title etc. are as structure feature information, and title is the web page title for referring to Web page X, and short text refers to that webpage surpasses Do not include punctuate in text mark up language (HTML, Hypertext Markup Language) source file and text size is limited Character string, be generally used to illustrate some prompt messages of webpage.

How to set up dom tree and how to extract content source information and structure feature information may be referred to prior art, This is not repeated.

Step 12：Information according to getting determines that Web page X is belonging respectively to N number of different format classifications set in advance Probability, N is positive integer more than 1；It regard the maximum corresponding format classification of probability of value as the format belonging to Web page X Classification.

In this step, first, a text vector can be generated according to the content source information extracted, specific generating mode can For：Participle is carried out to the content source information extracted；The text vector of one M dimension of generation, M value and the text previously generated The Feature Words number that dictionary includes is identical, the feature that each component in text vector is corresponded respectively in text dictionary Record has the Feature Words for the page layout feature that can embody each format classification in word, text dictionary；For each participle knot Really, determine if respectively it is identical with a Feature Words in text dictionary, if it is, can by text vector with this feature The corresponding component of word is set to 1, otherwise can be 0.Text dictionary is usually what human-edited generated.

Such as, it is assumed that Web page X is forum's page, and the content source information extracted from Web page X is carried out at participle After reason, following word segmentation result is obtained：Post, reply, edition owner, building-owner, and assume that these word segmentation results have been both present in text word In allusion quotation, then, then the corresponding component of these word segmentation results can be set to 1.

Afterwards, Web, using the Logic Regression Models previously generated, can be calculated respectively according to the text vector generated Webpage X corresponds to the tendency degree of each format classification, and N number of result of calculation is obtained.

Afterwards, the Piao previously generated can be utilized according to the structure feature information extracted and the N number of tendency degree calculated Plain Bayesian model, calculates the probability that Web page X belongs to each format classification, N number of result of calculation is obtained respectively.

Logic Regression Models are a kind of linear classification model, and with the features such as speed is fast, effect is good, it is utilized in the present invention To determine that Web page X corresponds to the tendency degree of different format classifications；Model-naive Bayesian is independent vacation between a kind of feature based If forecast model, use it in the present invention and determine final page layout class probability.In the embodiment of the present invention, logic is returned It is what off-line training was completed to return model and model-naive Bayesian, how to be trained for prior art, equally, how to calculate tendency Degree and probability are also prior art.

In addition, as shown in step 12, calculating respectively after Web page X belongs to the probability of each format classification, can It regard the maximum corresponding format classification of probability of value as the format classification belonging to Web page X.Or, it is that further improve is divided The accuracy of class result, after also can belonging to the probability of each format classification calculating Web page X respectively, first determines value Whether maximum probability is more than predetermined threshold, if it is, regarding the maximum corresponding format classification of probability of value as Web page Format classification belonging to X；Otherwise, the format belonging to Web page X is determined according to the existing mode of list plus typical URL features Classification.

To sum up, Fig. 2 carries out the process schematic of format classification for the present invention to webpage.

By process shown in Fig. 2, it can be achieved to classify for the one-level format of webpage, on this basis, can also further enter Two grades of format classification of row.

Correspondingly, after the format classification belonging to Web page X is determined, it can also further determine that out belonging to Web page X Subclass；Format classification belonging to Web page X is further divided into Z subclass, and Z is the positive integer more than 1.N's and Z Specific value can be decided according to the actual requirements.

Fig. 3 is two-stage format mode classification schematic diagram of the present invention.(the i.e. version as shown in figure 3, one-level format classification results Formula classification) include：Blog page, novel page, forum's page；Wherein, two grades of format classification results (i.e. subclass of blog page of blog page Do not include)：Blog content page, blogroll page, two grades of format classification results (i.e. the subclass of novel page) of novel page include： Novel list page, novel content pages, novel lobby page, two grades of format classification results (i.e. the subclass of forum's page) bag of forum's page Include：Forum postings page, forum tabulation page.

It should be noted that, the technical scheme being not intended to limit the invention. shown in Fig. 3 by way of example only.Such as, root According to being actually needed, one-level format classification results are also possible that other, such as news page, correspondingly, can be further to news page Carry out two grades of format classification.

Between the classification of two-stage format independently of one another, varigrained demand can be satisfied with respectively.Moreover, can with stronger Autgmentability, between each two grades of formats classification also independently of one another, if desired, can also add the classification of three-level format even more It is many.

, can be based on suitable for the format classification belonging to Web page X after the format classification belonging to Web page X is determined At least one differentiates feature, determines the subclass belonging to Web page X.

Wherein, the differentiation feature is generally included：Text Link Ratio, URL features or specific piece.

1) Text Link Ratio：Refer to the ratio of the length of link text and page text, available for differentiation list page and content Page, if ratio is more than predetermined threshold, is regarded as list page.

2) URL features：Such as, multiple numeric strings, the URL of novel lobby page would generally be contained in the URL of novel content pages In would generally would generally contain character string list and catalog in the URL of novel list page containing character string view_book etc. Deng；So, in actual applications, if the numeric string number contained in URL is more than predetermined threshold, it is regarded as novel content Page, if containing character string view_book in URL, is regarded as novel lobby page, if containing character string list in URL And catalog, then it is regarded as novel list page.

3) specific piece：Such as, it would generally lead in Blog content page containing delivering time and author information in forum postings page Multiple reply blocks can often be contained；In actual applications, if the reply block number contained is more than predetermined threshold, it is regarded as forum Model page.

Differentiate feature to determine the subclass belonging to Web page X as it was previously stated, one can be based only on, can also be based on Two or more differentiates the combination of feature to determine the subclass belonging to Web page X.Such as, it is assumed that the format belonging to Web page X Classification is blog page, and its Text Link Ratio is more than predetermined threshold, then can determine that the subclass belonging to it is blogroll page, such as Whether fruit is not more than predetermined threshold, then can further determine that wherein containing time and author information is delivered, if it is, determining it Affiliated subclass is Blog content page.By the way of two or more differentiates the combination of feature classification results will be made more to be defined Really.

The specific value for each threshold value being related in above-described embodiment can be decided according to the actual requirements.

So far, that is, the introduction on the inventive method embodiment is completed.

In a word,, can be according to the embodiment got for any Web page after using scheme described in above method embodiment The information of the page layout feature of the Web page determines that the Web page is belonging respectively to the probability of different format classifications, and will take The maximum corresponding format classification of probability of value is used as the format classification belonging to the Web page；Compared to prior art, the above method Scheme described in embodiment need not rely upon list and typical URL features, arbitrary Web page is applicable, so as to preferable Improve the accuracy of classification results in ground；Moreover, implementing simple and convenient, it is easy to popularize and promotes；In addition, two-stage format is classified Between independently of one another, can be satisfied with varigrained demand respectively, and with stronger scalability, each two grades of formats classify it Between also independently of one another, if desired, can also add three-level format classification it is even more many.

Based on above-mentioned introduction, Fig. 4 carries out the composition structural representation of the device embodiment of format classification for the present invention to webpage Figure.As shown in figure 4, including：

First processing module, for when needing to classify to any Web page, carrying out following handle：Obtain the Web The information of page layout feature can be embodied in webpage, and is sent to Second processing module；

Second processing module, for according to the information that gets determine the Web page be belonging respectively to it is set in advance it is N number of not With the probability of format classification, N is the positive integer more than 1；It regard the maximum corresponding format classification of probability of value as the Web page Affiliated format classification.

As shown in figure 4, may particularly include in first processing module：

First processing units, the dom tree for setting up the Web page；

Second processing unit, for extracting content source information and architectural feature letter in the Web page according to dom tree Breath；Wherein, content source information includes：Label and short text；Structure feature information includes：URL, secondary navigation and title.

As shown in figure 4, may particularly include in Second processing module：

3rd processing unit, for generating text vector according to content source information；According to text vector, using previously generating Logic Regression Models, calculate respectively the Web page correspond to each format classification tendency degree；According to structure feature information And tendency degree, using the model-naive Bayesian previously generated, the Web page is calculated respectively and belongs to each format classification Probability；

Fourth processing unit, for regarding the maximum corresponding format classification of probability of value as the version belonging to the Web page Formula classification.

Wherein, the 3rd processing unit carries out participle to content source information；And generate the text vector of M dimensions, M value with it is pre- The Feature Words number that the text dictionary first generated includes is identical, and each component in text vector is corresponded respectively in text dictionary A Feature Words, record has the Feature Words for the page layout feature that can embody each format classification in text dictionary；For Each word segmentation result, determine if respectively it is identical with a Feature Words in text dictionary, if it is, by text vector Component corresponding with this feature word is set to 1, is otherwise 0.

In addition, fourth processing unit can be further used for, determine whether the maximum probability of value is more than predetermined threshold；If It is then to regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page；Otherwise, according to list Plus the mode of typical URL features determines the format classification belonging to the Web page.

It can also further comprise in Fig. 4 shown devices：

3rd processing module, for determining the format classification belonging to the Web page in Second processing module after, be based on Suitable at least one differentiation feature of the format classification belonging to the Web page, the subclass belonging to the Web page is determined；Its In, differentiate that feature includes：Text Link Ratio, URL features or specific piece；Format classification belonging to the Web page includes Z subclass Not, Z is the positive integer more than 1.

In addition, format classification may include：Blog page, novel page or forum's page；The subclass of blog page may include：In blog Hold page or blogroll page；The subclass of novel page may include：Novel list page, novel content pages or novel lobby page；Forum The subclass of page may include：Forum postings page or forum tabulation page.

The specific workflow of Fig. 4 shown device embodiments refer to the respective description in preceding method embodiment, herein Repeat no more.

In a word,, can be according to the embodiment got for any Web page after using scheme described in said apparatus embodiment The information of the page layout feature of the Web page determines that the Web page is belonging respectively to the probability of different format classifications, and will take The maximum corresponding format classification of probability of value is used as the format classification belonging to the Web page；Compared to prior art, said apparatus Scheme described in embodiment need not rely upon list and typical URL features, arbitrary Web page is applicable, so as to preferable Improve the accuracy of classification results in ground；Moreover, implementing simple and convenient, it is easy to popularize and promotes；In addition, two-stage format is classified Between independently of one another, can be satisfied with varigrained demand respectively, and with stronger scalability, each two grades of formats classify it Between also independently of one another, if desired, can also add three-level format classification it is even more many.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God is with principle, and any modification, equivalent substitution and improvements done etc. should be included within the scope of protection of the invention.

Claims

1. a kind of method that format classification is carried out to webpage, it is characterised in that when needing to classify to any Web page, Carry out following handle：

Information according to getting determines that the Web page is belonging respectively to the probability of N number of different format classifications set in advance, N For the positive integer more than 1；

It regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page；

The information of page layout feature can be embodied in the acquisition Web page to be included：

Set up the document object model dom tree of the Web page；

The content source information and structure feature information in the Web page are extracted according to the dom tree；

The information that the basis is got determines that the Web page is belonging respectively to the general of N number of different format classifications set in advance Rate includes：

Text vector is generated according to the content source information；

According to the text vector, using the Logic Regression Models previously generated, the Web page is calculated respectively and is corresponded to often The tendency degree of individual format classification；

According to the structure feature information and the tendency degree, using the model-naive Bayesian previously generated, calculate respectively Go out the probability that the Web page belongs to each format classification.

2. according to the method described in claim 1, it is characterised in that the content source information includes：Label and short text；

The structure feature information includes：Uniform resource position mark URL, secondary navigation and title.

3. according to the method described in claim 1, it is characterised in that described that text vector bag is generated according to the content source information Include：

Participle is carried out to the content source information；

The text vector of M dimensions is generated, M value is identical with the Feature Words number that the text dictionary previously generated includes, the text Record has energy in the Feature Words that each component in this vector is corresponded respectively in the text dictionary, the text dictionary Enough embody the Feature Words of the page layout feature of each format classification；

For each word segmentation result, determine if respectively it is identical with a Feature Words in the text dictionary, if it is, Component corresponding with this feature word in the text vector is set to 1, is otherwise 0.

4. according to the method described in claim 1, it is characterised in that described to make the corresponding format classification of probability of value maximum Before the format classification belonging to the Web page, further comprise：

Determine whether the maximum probability of value is more than predetermined threshold；

If it is, regarding the maximum corresponding format classification of probability of value as the format classification belonging to the Web page；

Otherwise, the format classification belonging to the Web page is determined in the way of list plus typical URL features.

5. according to method according to any one of claims 1 to 4, it is characterised in that described to determine belonging to the Web page Format classification after, further comprise：

Feature is differentiated based at least one suitable for the format classification belonging to the Web page, determined belonging to the Web page Subclass；

Wherein, the differentiation feature includes：Text Link Ratio, URL features or specific piece；

Format classification belonging to the Web page includes Z subclass, and Z is the positive integer more than 1.

6. method according to claim 5, it is characterised in that

The format classification includes：Blog page, novel page or forum's page；

Wherein, the subclass of the blog page includes：Blog content page or blogroll page；

The subclass of the novel page includes：Novel list page, novel content pages or novel lobby page；

The subclass of forum's page includes：Forum postings page or forum tabulation page.

7. a kind of device that format classification is carried out to webpage, it is characterised in that including：

First processing module, for when needing to classify to any Web page, carrying out following handle：Obtain the Web nets The information of page layout feature can be embodied in page, and is sent to Second processing module；

The Second processing module, it is set in advance N number of for determining that the Web page is belonging respectively to according to the information got The probability of different format classifications, N is the positive integer more than 1；It regard the maximum corresponding format classification of probability of value as the Web Format classification belonging to webpage；

The first processing module includes：

First processing units, the document object model dom tree for setting up the Web page；

Second processing unit, for extracting content source information and architectural feature in the Web page according to the dom tree Information；

The Second processing module includes：

3rd processing unit, for generating text vector according to the content source information；According to the text vector, using advance The Logic Regression Models of generation, calculate the tendency degree that the Web page corresponds to each format classification respectively；According to the knot Structure characteristic information and the tendency degree, using the model-naive Bayesian previously generated, calculate the Web page category respectively In the probability of each format classification；

Fourth processing unit, for regarding the maximum corresponding format classification of probability of value as the format belonging to the Web page Classification.

8. device according to claim 7, it is characterised in that the content source information includes：Label and short text；It is described Structure feature information includes：Uniform resource position mark URL, secondary navigation and title.

9. device according to claim 7, it is characterised in that

3rd processing unit carries out participle to the content source information；And generate the text vector of M dimensions, M value with it is pre- The Feature Words number that the text dictionary first generated includes is identical, and each component in the text vector corresponds respectively to the text Record has the page layout feature that can embody each format classification in a Feature Words in this dictionary, the text dictionary Feature Words；For each word segmentation result, determine if respectively it is identical with a Feature Words in the text dictionary, if It is that component corresponding with this feature word in the text vector is then set to 1, is otherwise 0.

10. device according to claim 7, it is characterised in that

The fourth processing unit is further used for, and determines whether the maximum probability of value is more than predetermined threshold；If it is, will The maximum corresponding format classification of probability of value is used as the format classification belonging to the Web page；Otherwise, according to list plus typical case The modes of URL features determine format classification belonging to the Web page.

11. the device according to any one of claim 7~10, it is characterised in that the device further comprises：

3rd processing module, for determining the format classification belonging to the Web page in the Second processing module after, base In at least one differentiation feature suitable for the format classification belonging to the Web page, the subclass belonging to the Web page is determined Not；Wherein, the differentiation feature includes：Text Link Ratio, URL features or specific piece；Format classification belonging to the Web page Including Z subclass, Z is the positive integer more than 1.

12. device according to claim 11, it is characterised in that

The format classification includes：Blog page, novel page or forum's page；