CN103377243B - A kind of method and apparatus that format classification is carried out to webpage - Google Patents

A kind of method and apparatus that format classification is carried out to webpage Download PDF

Info

Publication number
CN103377243B
CN103377243B CN201210127531.8A CN201210127531A CN103377243B CN 103377243 B CN103377243 B CN 103377243B CN 201210127531 A CN201210127531 A CN 201210127531A CN 103377243 B CN103377243 B CN 103377243B
Authority
CN
China
Prior art keywords
page
web page
format classification
format
belonging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210127531.8A
Other languages
Chinese (zh)
Other versions
CN103377243A (en
Inventor
蔡兵
黄钰
徐羽
张凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yayue Technology Co ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201210127531.8A priority Critical patent/CN103377243B/en
Publication of CN103377243A publication Critical patent/CN103377243A/en
Application granted granted Critical
Publication of CN103377243B publication Critical patent/CN103377243B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of method and apparatus that format classification is carried out to webpage:When needing to classify to any Web page, following handle is carried out:The information of page layout feature can be embodied by obtaining in the Web page;Information according to getting determines that the Web page is belonging respectively to the probability of N number of different format classifications set in advance, and N is the positive integer more than 1;It regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page.Using scheme of the present invention, it is possible to increase the accuracy of classification results.

Description

A kind of method and apparatus that format classification is carried out to webpage
Technical field
The present invention relates to Internet technology, more particularly to a kind of method and apparatus that format classification is carried out to webpage.
Background technology
At present, for Web page, two kinds of mode classifications are primarily present, one kind is classifying content, another is format point Class.
Wherein, classifying content is, using the different as classification angle of page body content, can be divided into news page and question and answer page Deng;Format classification is, using the different as classification angle of page main structural frame, can be divided into blog page and forum's page etc..
For classifying content, current research comparative maturity, but then slightly inadequate for the research of format classification. In practical application, the result of format classification can be used for setting up web page model, and can provide reference information for page info extraction, also Class discrimination available for search-engine results etc., it is significant.
In the prior art, typical URL (URL, Uniform Resource is mainly added by list Locator) mode of feature come realize format classify, implement including:
For any Web page X, its URL is matched first with list, be may include in the list a series of Different domain names and the corresponding format classification of difference etc., a domain name in such as list is hi.baidu.com, corresponding version Formula classification is blog page, then, if Web page X URL includes " hi.baidu.com ", it can determine that Web page X Affiliated format classification is blog page;If the format classification belonging to Web page X can not be determined using list, one can be entered Step is determined using some typical URL features, and such as Web page X URL includes " bbs ", then can determine that Web nets Format classification belonging to page X is forum's page.
But, the problem of aforesaid way can have certain in actual applications:Because the domain name that can be covered in list is non- It is often limited, and be not in such as " bbs " typical URL features in the URL of many Web pages, therefore will cause a lot Web page can not correctly be classified.
The content of the invention
In view of this, the invention provides a kind of method and apparatus that format classification is carried out to webpage, it is possible to increase classification As a result accuracy.
To reach above-mentioned purpose, the technical proposal of the invention is realized in this way:
A kind of method that format classification is carried out to webpage, when needing to classify to any Web page, carries out following locate Reason:
The information of page layout feature can be embodied by obtaining in the Web page;
Information according to getting determines that the Web page is belonging respectively to the general of N number of different format classifications set in advance Rate, N is the positive integer more than 1;
It regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page.
A kind of device that format classification is carried out to webpage, including:
First processing module, for when needing to classify to any Web page, carrying out following handle:Obtain described The information of page layout feature can be embodied in Web page, and is sent to Second processing module;
The Second processing module, for determining that the Web page is belonging respectively to preset according to the information got N number of different format classifications probability, N is positive integer more than 1;It regard the maximum corresponding format classification of probability of value as institute State the format classification belonging to Web page.
It can be seen that,, can be according to the embodiment Web page got for any Web page using scheme of the present invention The information of page layout feature determines that the Web page is belonging respectively to the probability of different format classifications, and by maximum general of value The corresponding format classification of rate is used as the format classification belonging to the Web page.Compared to prior art, scheme of the present invention need not Dependent on list and typical URL features, arbitrary Web page is applicable, so as to preferably improve the standard of classification results True property.Moreover, scheme of the present invention implements simple and convenient, it is easy to popularize and promotes.
Brief description of the drawings
Fig. 1 carries out the flow chart of the embodiment of the method for format classification for the present invention to webpage.
Fig. 2 carries out the process schematic of format classification for the present invention to webpage.
Fig. 3 is two-stage format mode classification schematic diagram of the present invention.
Fig. 4 carries out the composition structural representation of the device embodiment of format classification for the present invention to webpage.
Embodiment
For problems of the prior art, propose to carry out format classification to webpage after a kind of improvement in the present invention Scheme.
To make technical scheme clearer, clear, develop simultaneously embodiment referring to the drawings, to of the present invention Scheme is described in further detail.
Fig. 1 carries out the flow chart of the embodiment of the method for format classification for the present invention to webpage.When needing to any Web page When being classified, handled respectively according to flow shown in Fig. 1.
Step 11:The information of page layout feature can be embodied by obtaining in Web page X.
For ease of statement, any Web page is represented with Web page X.
In this step, Web page X document object model (DOM, Document Object Model) can be initially set up Tree;Afterwards, the content source information and structure feature information in Web page X are extracted according to the dom tree set up.
Wherein, content source information may include:Label and short text;Structure feature information may include:URL, secondary navigation and Title.
As a rule, page layout feature will not be embodied in long text, such as text and sentence, therefore, can only extract Web Short text and label in webpage X etc., as content source information, and extract in Web page X URL, Web page X two Level navigation and title etc. are as structure feature information, and title is the web page title for referring to Web page X, and short text refers to that webpage surpasses Do not include punctuate in text mark up language (HTML, Hypertext Markup Language) source file and text size is limited Character string, be generally used to illustrate some prompt messages of webpage.
How to set up dom tree and how to extract content source information and structure feature information may be referred to prior art, This is not repeated.
Step 12:Information according to getting determines that Web page X is belonging respectively to N number of different format classifications set in advance Probability, N is positive integer more than 1;It regard the maximum corresponding format classification of probability of value as the format belonging to Web page X Classification.
In this step, first, a text vector can be generated according to the content source information extracted, specific generating mode can For:Participle is carried out to the content source information extracted;The text vector of one M dimension of generation, M value and the text previously generated The Feature Words number that dictionary includes is identical, the feature that each component in text vector is corresponded respectively in text dictionary Record has the Feature Words for the page layout feature that can embody each format classification in word, text dictionary;For each participle knot Really, determine if respectively it is identical with a Feature Words in text dictionary, if it is, can by text vector with this feature The corresponding component of word is set to 1, otherwise can be 0.Text dictionary is usually what human-edited generated.
Such as, it is assumed that Web page X is forum's page, and the content source information extracted from Web page X is carried out at participle After reason, following word segmentation result is obtained:Post, reply, edition owner, building-owner, and assume that these word segmentation results have been both present in text word In allusion quotation, then, then the corresponding component of these word segmentation results can be set to 1.
Afterwards, Web, using the Logic Regression Models previously generated, can be calculated respectively according to the text vector generated Webpage X corresponds to the tendency degree of each format classification, and N number of result of calculation is obtained.
Afterwards, the Piao previously generated can be utilized according to the structure feature information extracted and the N number of tendency degree calculated Plain Bayesian model, calculates the probability that Web page X belongs to each format classification, N number of result of calculation is obtained respectively.
Logic Regression Models are a kind of linear classification model, and with the features such as speed is fast, effect is good, it is utilized in the present invention To determine that Web page X corresponds to the tendency degree of different format classifications;Model-naive Bayesian is independent vacation between a kind of feature based If forecast model, use it in the present invention and determine final page layout class probability.In the embodiment of the present invention, logic is returned It is what off-line training was completed to return model and model-naive Bayesian, how to be trained for prior art, equally, how to calculate tendency Degree and probability are also prior art.
In addition, as shown in step 12, calculating respectively after Web page X belongs to the probability of each format classification, can It regard the maximum corresponding format classification of probability of value as the format classification belonging to Web page X.Or, it is that further improve is divided The accuracy of class result, after also can belonging to the probability of each format classification calculating Web page X respectively, first determines value Whether maximum probability is more than predetermined threshold, if it is, regarding the maximum corresponding format classification of probability of value as Web page Format classification belonging to X;Otherwise, the format belonging to Web page X is determined according to the existing mode of list plus typical URL features Classification.
To sum up, Fig. 2 carries out the process schematic of format classification for the present invention to webpage.
By process shown in Fig. 2, it can be achieved to classify for the one-level format of webpage, on this basis, can also further enter Two grades of format classification of row.
Correspondingly, after the format classification belonging to Web page X is determined, it can also further determine that out belonging to Web page X Subclass;Format classification belonging to Web page X is further divided into Z subclass, and Z is the positive integer more than 1.N's and Z Specific value can be decided according to the actual requirements.
Fig. 3 is two-stage format mode classification schematic diagram of the present invention.(the i.e. version as shown in figure 3, one-level format classification results Formula classification) include:Blog page, novel page, forum's page;Wherein, two grades of format classification results (i.e. subclass of blog page of blog page Do not include):Blog content page, blogroll page, two grades of format classification results (i.e. the subclass of novel page) of novel page include: Novel list page, novel content pages, novel lobby page, two grades of format classification results (i.e. the subclass of forum's page) bag of forum's page Include:Forum postings page, forum tabulation page.
It should be noted that, the technical scheme being not intended to limit the invention. shown in Fig. 3 by way of example only.Such as, root According to being actually needed, one-level format classification results are also possible that other, such as news page, correspondingly, can be further to news page Carry out two grades of format classification.
Between the classification of two-stage format independently of one another, varigrained demand can be satisfied with respectively.Moreover, can with stronger Autgmentability, between each two grades of formats classification also independently of one another, if desired, can also add the classification of three-level format even more It is many.
, can be based on suitable for the format classification belonging to Web page X after the format classification belonging to Web page X is determined At least one differentiates feature, determines the subclass belonging to Web page X.
Wherein, the differentiation feature is generally included:Text Link Ratio, URL features or specific piece.
1) Text Link Ratio:Refer to the ratio of the length of link text and page text, available for differentiation list page and content Page, if ratio is more than predetermined threshold, is regarded as list page.
2) URL features:Such as, multiple numeric strings, the URL of novel lobby page would generally be contained in the URL of novel content pages In would generally would generally contain character string list and catalog in the URL of novel list page containing character string view_book etc. Deng;So, in actual applications, if the numeric string number contained in URL is more than predetermined threshold, it is regarded as novel content Page, if containing character string view_book in URL, is regarded as novel lobby page, if containing character string list in URL And catalog, then it is regarded as novel list page.
3) specific piece:Such as, it would generally lead in Blog content page containing delivering time and author information in forum postings page Multiple reply blocks can often be contained;In actual applications, if the reply block number contained is more than predetermined threshold, it is regarded as forum Model page.
Differentiate feature to determine the subclass belonging to Web page X as it was previously stated, one can be based only on, can also be based on Two or more differentiates the combination of feature to determine the subclass belonging to Web page X.Such as, it is assumed that the format belonging to Web page X Classification is blog page, and its Text Link Ratio is more than predetermined threshold, then can determine that the subclass belonging to it is blogroll page, such as Whether fruit is not more than predetermined threshold, then can further determine that wherein containing time and author information is delivered, if it is, determining it Affiliated subclass is Blog content page.By the way of two or more differentiates the combination of feature classification results will be made more to be defined Really.
The specific value for each threshold value being related in above-described embodiment can be decided according to the actual requirements.
So far, that is, the introduction on the inventive method embodiment is completed.
In a word,, can be according to the embodiment got for any Web page after using scheme described in above method embodiment The information of the page layout feature of the Web page determines that the Web page is belonging respectively to the probability of different format classifications, and will take The maximum corresponding format classification of probability of value is used as the format classification belonging to the Web page;Compared to prior art, the above method Scheme described in embodiment need not rely upon list and typical URL features, arbitrary Web page is applicable, so as to preferable Improve the accuracy of classification results in ground;Moreover, implementing simple and convenient, it is easy to popularize and promotes;In addition, two-stage format is classified Between independently of one another, can be satisfied with varigrained demand respectively, and with stronger scalability, each two grades of formats classify it Between also independently of one another, if desired, can also add three-level format classification it is even more many.
Based on above-mentioned introduction, Fig. 4 carries out the composition structural representation of the device embodiment of format classification for the present invention to webpage Figure.As shown in figure 4, including:
First processing module, for when needing to classify to any Web page, carrying out following handle:Obtain the Web The information of page layout feature can be embodied in webpage, and is sent to Second processing module;
Second processing module, for according to the information that gets determine the Web page be belonging respectively to it is set in advance it is N number of not With the probability of format classification, N is the positive integer more than 1;It regard the maximum corresponding format classification of probability of value as the Web page Affiliated format classification.
As shown in figure 4, may particularly include in first processing module:
First processing units, the dom tree for setting up the Web page;
Second processing unit, for extracting content source information and architectural feature letter in the Web page according to dom tree Breath;Wherein, content source information includes:Label and short text;Structure feature information includes:URL, secondary navigation and title.
As shown in figure 4, may particularly include in Second processing module:
3rd processing unit, for generating text vector according to content source information;According to text vector, using previously generating Logic Regression Models, calculate respectively the Web page correspond to each format classification tendency degree;According to structure feature information And tendency degree, using the model-naive Bayesian previously generated, the Web page is calculated respectively and belongs to each format classification Probability;
Fourth processing unit, for regarding the maximum corresponding format classification of probability of value as the version belonging to the Web page Formula classification.
Wherein, the 3rd processing unit carries out participle to content source information;And generate the text vector of M dimensions, M value with it is pre- The Feature Words number that the text dictionary first generated includes is identical, and each component in text vector is corresponded respectively in text dictionary A Feature Words, record has the Feature Words for the page layout feature that can embody each format classification in text dictionary;For Each word segmentation result, determine if respectively it is identical with a Feature Words in text dictionary, if it is, by text vector Component corresponding with this feature word is set to 1, is otherwise 0.
In addition, fourth processing unit can be further used for, determine whether the maximum probability of value is more than predetermined threshold;If It is then to regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page;Otherwise, according to list Plus the mode of typical URL features determines the format classification belonging to the Web page.
It can also further comprise in Fig. 4 shown devices:
3rd processing module, for determining the format classification belonging to the Web page in Second processing module after, be based on Suitable at least one differentiation feature of the format classification belonging to the Web page, the subclass belonging to the Web page is determined;Its In, differentiate that feature includes:Text Link Ratio, URL features or specific piece;Format classification belonging to the Web page includes Z subclass Not, Z is the positive integer more than 1.
In addition, format classification may include:Blog page, novel page or forum's page;The subclass of blog page may include:In blog Hold page or blogroll page;The subclass of novel page may include:Novel list page, novel content pages or novel lobby page;Forum The subclass of page may include:Forum postings page or forum tabulation page.
The specific workflow of Fig. 4 shown device embodiments refer to the respective description in preceding method embodiment, herein Repeat no more.
In a word,, can be according to the embodiment got for any Web page after using scheme described in said apparatus embodiment The information of the page layout feature of the Web page determines that the Web page is belonging respectively to the probability of different format classifications, and will take The maximum corresponding format classification of probability of value is used as the format classification belonging to the Web page;Compared to prior art, said apparatus Scheme described in embodiment need not rely upon list and typical URL features, arbitrary Web page is applicable, so as to preferable Improve the accuracy of classification results in ground;Moreover, implementing simple and convenient, it is easy to popularize and promotes;In addition, two-stage format is classified Between independently of one another, can be satisfied with varigrained demand respectively, and with stronger scalability, each two grades of formats classify it Between also independently of one another, if desired, can also add three-level format classification it is even more many.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God is with principle, and any modification, equivalent substitution and improvements done etc. should be included within the scope of protection of the invention.

Claims (12)

1. a kind of method that format classification is carried out to webpage, it is characterised in that when needing to classify to any Web page, Carry out following handle:
The information of page layout feature can be embodied by obtaining in the Web page;
Information according to getting determines that the Web page is belonging respectively to the probability of N number of different format classifications set in advance, N For the positive integer more than 1;
It regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page;
The information of page layout feature can be embodied in the acquisition Web page to be included:
Set up the document object model dom tree of the Web page;
The content source information and structure feature information in the Web page are extracted according to the dom tree;
The information that the basis is got determines that the Web page is belonging respectively to the general of N number of different format classifications set in advance Rate includes:
Text vector is generated according to the content source information;
According to the text vector, using the Logic Regression Models previously generated, the Web page is calculated respectively and is corresponded to often The tendency degree of individual format classification;
According to the structure feature information and the tendency degree, using the model-naive Bayesian previously generated, calculate respectively Go out the probability that the Web page belongs to each format classification.
2. according to the method described in claim 1, it is characterised in that the content source information includes:Label and short text;
The structure feature information includes:Uniform resource position mark URL, secondary navigation and title.
3. according to the method described in claim 1, it is characterised in that described that text vector bag is generated according to the content source information Include:
Participle is carried out to the content source information;
The text vector of M dimensions is generated, M value is identical with the Feature Words number that the text dictionary previously generated includes, the text Record has energy in the Feature Words that each component in this vector is corresponded respectively in the text dictionary, the text dictionary Enough embody the Feature Words of the page layout feature of each format classification;
For each word segmentation result, determine if respectively it is identical with a Feature Words in the text dictionary, if it is, Component corresponding with this feature word in the text vector is set to 1, is otherwise 0.
4. according to the method described in claim 1, it is characterised in that described to make the corresponding format classification of probability of value maximum Before the format classification belonging to the Web page, further comprise:
Determine whether the maximum probability of value is more than predetermined threshold;
If it is, regarding the maximum corresponding format classification of probability of value as the format classification belonging to the Web page;
Otherwise, the format classification belonging to the Web page is determined in the way of list plus typical URL features.
5. according to method according to any one of claims 1 to 4, it is characterised in that described to determine belonging to the Web page Format classification after, further comprise:
Feature is differentiated based at least one suitable for the format classification belonging to the Web page, determined belonging to the Web page Subclass;
Wherein, the differentiation feature includes:Text Link Ratio, URL features or specific piece;
Format classification belonging to the Web page includes Z subclass, and Z is the positive integer more than 1.
6. method according to claim 5, it is characterised in that
The format classification includes:Blog page, novel page or forum's page;
Wherein, the subclass of the blog page includes:Blog content page or blogroll page;
The subclass of the novel page includes:Novel list page, novel content pages or novel lobby page;
The subclass of forum's page includes:Forum postings page or forum tabulation page.
7. a kind of device that format classification is carried out to webpage, it is characterised in that including:
First processing module, for when needing to classify to any Web page, carrying out following handle:Obtain the Web nets The information of page layout feature can be embodied in page, and is sent to Second processing module;
The Second processing module, it is set in advance N number of for determining that the Web page is belonging respectively to according to the information got The probability of different format classifications, N is the positive integer more than 1;It regard the maximum corresponding format classification of probability of value as the Web Format classification belonging to webpage;
The first processing module includes:
First processing units, the document object model dom tree for setting up the Web page;
Second processing unit, for extracting content source information and architectural feature in the Web page according to the dom tree Information;
The Second processing module includes:
3rd processing unit, for generating text vector according to the content source information;According to the text vector, using advance The Logic Regression Models of generation, calculate the tendency degree that the Web page corresponds to each format classification respectively;According to the knot Structure characteristic information and the tendency degree, using the model-naive Bayesian previously generated, calculate the Web page category respectively In the probability of each format classification;
Fourth processing unit, for regarding the maximum corresponding format classification of probability of value as the format belonging to the Web page Classification.
8. device according to claim 7, it is characterised in that the content source information includes:Label and short text;It is described Structure feature information includes:Uniform resource position mark URL, secondary navigation and title.
9. device according to claim 7, it is characterised in that
3rd processing unit carries out participle to the content source information;And generate the text vector of M dimensions, M value with it is pre- The Feature Words number that the text dictionary first generated includes is identical, and each component in the text vector corresponds respectively to the text Record has the page layout feature that can embody each format classification in a Feature Words in this dictionary, the text dictionary Feature Words;For each word segmentation result, determine if respectively it is identical with a Feature Words in the text dictionary, if It is that component corresponding with this feature word in the text vector is then set to 1, is otherwise 0.
10. device according to claim 7, it is characterised in that
The fourth processing unit is further used for, and determines whether the maximum probability of value is more than predetermined threshold;If it is, will The maximum corresponding format classification of probability of value is used as the format classification belonging to the Web page;Otherwise, according to list plus typical case The modes of URL features determine format classification belonging to the Web page.
11. the device according to any one of claim 7~10, it is characterised in that the device further comprises:
3rd processing module, for determining the format classification belonging to the Web page in the Second processing module after, base In at least one differentiation feature suitable for the format classification belonging to the Web page, the subclass belonging to the Web page is determined Not;Wherein, the differentiation feature includes:Text Link Ratio, URL features or specific piece;Format classification belonging to the Web page Including Z subclass, Z is the positive integer more than 1.
12. device according to claim 11, it is characterised in that
The format classification includes:Blog page, novel page or forum's page;
Wherein, the subclass of the blog page includes:Blog content page or blogroll page;
The subclass of the novel page includes:Novel list page, novel content pages or novel lobby page;
The subclass of forum's page includes:Forum postings page or forum tabulation page.
CN201210127531.8A 2012-04-27 2012-04-27 A kind of method and apparatus that format classification is carried out to webpage Active CN103377243B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210127531.8A CN103377243B (en) 2012-04-27 2012-04-27 A kind of method and apparatus that format classification is carried out to webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210127531.8A CN103377243B (en) 2012-04-27 2012-04-27 A kind of method and apparatus that format classification is carried out to webpage

Publications (2)

Publication Number Publication Date
CN103377243A CN103377243A (en) 2013-10-30
CN103377243B true CN103377243B (en) 2017-09-08

Family

ID=49462369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210127531.8A Active CN103377243B (en) 2012-04-27 2012-04-27 A kind of method and apparatus that format classification is carried out to webpage

Country Status (1)

Country Link
CN (1) CN103377243B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834640A (en) * 2014-02-10 2015-08-12 腾讯科技(深圳)有限公司 Webpage identification method and apparatus
CN106203454B (en) * 2016-07-25 2019-05-21 重庆中科云从科技有限公司 The method and device of certificate format analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web page classification method based on the keyword frequency analysis
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web page classification method based on the keyword frequency analysis
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
一种基于综合特征的网页类型识别方法;陈翰等;《信息工程大学学报》;20111231;第739页第5-7段,第740页第3-6段、图2-3,第741页第7段、图4,第742页第6段 *
基于朴素贝叶斯的中文海事文本多分类器研究;袁文生等;《计算机与现代化》;20110531;150-153 *
用Naive Bayes方法协调分类Web网页;范焱等;《软件学报》;20010930;1386-1392 *
结合中文分词的贝叶斯文本分类;魏晓宁;《苏州市职业大学学报》;20080331;第104页第二栏第1-2段,第105页第二栏第1-5段 *

Also Published As

Publication number Publication date
CN103377243A (en) 2013-10-30

Similar Documents

Publication Publication Date Title
CN101661513B (en) Detection method of network focus and public sentiment
CN105045778B (en) A kind of Chinese homonym mistake auto-collation
CN110598000A (en) Relationship extraction and knowledge graph construction method based on deep learning model
CN103678310B (en) The sorting technique and device of Web page subject
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
CN103810251B (en) Method and device for extracting text
CN102253937B (en) Method and related device for acquiring information of interest in webpages
CN103034626A (en) Emotion analyzing system and method
CN101887443B (en) Method and device for classifying texts
CN103744905A (en) Junk mail judgment method and device
El-Halees Mining opinions in user-generated contents to improve course evaluation
CN102682120B (en) Method and device for acquiring essential article commented on network
CN103874994A (en) Method and apparatus for automatically summarizing the contents of electronic documents
CN103294781A (en) Method and equipment used for processing page data
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN104199845B (en) Line Evaluation based on agent model discusses sensibility classification method
CN108334493A (en) A kind of topic knowledge point extraction method based on neural network
CN101661468B (en) Method for extracting post metadata from forum post list pages
CN105183715A (en) Word distribution and document feature based automatic classification method for spam comments
WO2013178193A2 (en) Text content extraction method and device
US20120221545A1 (en) Isolating desired content, metadata, or both from social media
CN103377243B (en) A kind of method and apparatus that format classification is carried out to webpage
US20140281878A1 (en) Aligning Annotation of Fields of Documents
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title
CN111241270A (en) Resume processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20221117

Address after: 1402, Floor 14, Block A, Haina Baichuan Headquarters Building, No. 6, Baoxing Road, Haibin Community, Xin'an Street, Bao'an District, Shenzhen, Guangdong 518133

Patentee after: Shenzhen Yayue Technology Co.,Ltd.

Address before: 2, 518044, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

TR01 Transfer of patent right