CN103377243B - A kind of method and apparatus that format classification is carried out to webpage - Google Patents
A kind of method and apparatus that format classification is carried out to webpage Download PDFInfo
- Publication number
- CN103377243B CN103377243B CN201210127531.8A CN201210127531A CN103377243B CN 103377243 B CN103377243 B CN 103377243B CN 201210127531 A CN201210127531 A CN 201210127531A CN 103377243 B CN103377243 B CN 103377243B
- Authority
- CN
- China
- Prior art keywords
- page
- web page
- format classification
- format
- belonging
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention discloses a kind of method and apparatus that format classification is carried out to webpage:When needing to classify to any Web page, following handle is carried out:The information of page layout feature can be embodied by obtaining in the Web page;Information according to getting determines that the Web page is belonging respectively to the probability of N number of different format classifications set in advance, and N is the positive integer more than 1;It regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page.Using scheme of the present invention, it is possible to increase the accuracy of classification results.
Description
Technical field
The present invention relates to Internet technology, more particularly to a kind of method and apparatus that format classification is carried out to webpage.
Background technology
At present, for Web page, two kinds of mode classifications are primarily present, one kind is classifying content, another is format point
Class.
Wherein, classifying content is, using the different as classification angle of page body content, can be divided into news page and question and answer page
Deng;Format classification is, using the different as classification angle of page main structural frame, can be divided into blog page and forum's page etc..
For classifying content, current research comparative maturity, but then slightly inadequate for the research of format classification.
In practical application, the result of format classification can be used for setting up web page model, and can provide reference information for page info extraction, also
Class discrimination available for search-engine results etc., it is significant.
In the prior art, typical URL (URL, Uniform Resource is mainly added by list
Locator) mode of feature come realize format classify, implement including:
For any Web page X, its URL is matched first with list, be may include in the list a series of
Different domain names and the corresponding format classification of difference etc., a domain name in such as list is hi.baidu.com, corresponding version
Formula classification is blog page, then, if Web page X URL includes " hi.baidu.com ", it can determine that Web page X
Affiliated format classification is blog page;If the format classification belonging to Web page X can not be determined using list, one can be entered
Step is determined using some typical URL features, and such as Web page X URL includes " bbs ", then can determine that Web nets
Format classification belonging to page X is forum's page.
But, the problem of aforesaid way can have certain in actual applications:Because the domain name that can be covered in list is non-
It is often limited, and be not in such as " bbs " typical URL features in the URL of many Web pages, therefore will cause a lot
Web page can not correctly be classified.
The content of the invention
In view of this, the invention provides a kind of method and apparatus that format classification is carried out to webpage, it is possible to increase classification
As a result accuracy.
To reach above-mentioned purpose, the technical proposal of the invention is realized in this way:
A kind of method that format classification is carried out to webpage, when needing to classify to any Web page, carries out following locate
Reason:
The information of page layout feature can be embodied by obtaining in the Web page;
Information according to getting determines that the Web page is belonging respectively to the general of N number of different format classifications set in advance
Rate, N is the positive integer more than 1;
It regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page.
A kind of device that format classification is carried out to webpage, including:
First processing module, for when needing to classify to any Web page, carrying out following handle:Obtain described
The information of page layout feature can be embodied in Web page, and is sent to Second processing module;
The Second processing module, for determining that the Web page is belonging respectively to preset according to the information got
N number of different format classifications probability, N is positive integer more than 1;It regard the maximum corresponding format classification of probability of value as institute
State the format classification belonging to Web page.
It can be seen that,, can be according to the embodiment Web page got for any Web page using scheme of the present invention
The information of page layout feature determines that the Web page is belonging respectively to the probability of different format classifications, and by maximum general of value
The corresponding format classification of rate is used as the format classification belonging to the Web page.Compared to prior art, scheme of the present invention need not
Dependent on list and typical URL features, arbitrary Web page is applicable, so as to preferably improve the standard of classification results
True property.Moreover, scheme of the present invention implements simple and convenient, it is easy to popularize and promotes.
Brief description of the drawings
Fig. 1 carries out the flow chart of the embodiment of the method for format classification for the present invention to webpage.
Fig. 2 carries out the process schematic of format classification for the present invention to webpage.
Fig. 3 is two-stage format mode classification schematic diagram of the present invention.
Fig. 4 carries out the composition structural representation of the device embodiment of format classification for the present invention to webpage.
Embodiment
For problems of the prior art, propose to carry out format classification to webpage after a kind of improvement in the present invention
Scheme.
To make technical scheme clearer, clear, develop simultaneously embodiment referring to the drawings, to of the present invention
Scheme is described in further detail.
Fig. 1 carries out the flow chart of the embodiment of the method for format classification for the present invention to webpage.When needing to any Web page
When being classified, handled respectively according to flow shown in Fig. 1.
Step 11:The information of page layout feature can be embodied by obtaining in Web page X.
For ease of statement, any Web page is represented with Web page X.
In this step, Web page X document object model (DOM, Document Object Model) can be initially set up
Tree;Afterwards, the content source information and structure feature information in Web page X are extracted according to the dom tree set up.
Wherein, content source information may include:Label and short text;Structure feature information may include:URL, secondary navigation and
Title.
As a rule, page layout feature will not be embodied in long text, such as text and sentence, therefore, can only extract Web
Short text and label in webpage X etc., as content source information, and extract in Web page X URL, Web page X two
Level navigation and title etc. are as structure feature information, and title is the web page title for referring to Web page X, and short text refers to that webpage surpasses
Do not include punctuate in text mark up language (HTML, Hypertext Markup Language) source file and text size is limited
Character string, be generally used to illustrate some prompt messages of webpage.
How to set up dom tree and how to extract content source information and structure feature information may be referred to prior art,
This is not repeated.
Step 12:Information according to getting determines that Web page X is belonging respectively to N number of different format classifications set in advance
Probability, N is positive integer more than 1;It regard the maximum corresponding format classification of probability of value as the format belonging to Web page X
Classification.
In this step, first, a text vector can be generated according to the content source information extracted, specific generating mode can
For:Participle is carried out to the content source information extracted;The text vector of one M dimension of generation, M value and the text previously generated
The Feature Words number that dictionary includes is identical, the feature that each component in text vector is corresponded respectively in text dictionary
Record has the Feature Words for the page layout feature that can embody each format classification in word, text dictionary;For each participle knot
Really, determine if respectively it is identical with a Feature Words in text dictionary, if it is, can by text vector with this feature
The corresponding component of word is set to 1, otherwise can be 0.Text dictionary is usually what human-edited generated.
Such as, it is assumed that Web page X is forum's page, and the content source information extracted from Web page X is carried out at participle
After reason, following word segmentation result is obtained:Post, reply, edition owner, building-owner, and assume that these word segmentation results have been both present in text word
In allusion quotation, then, then the corresponding component of these word segmentation results can be set to 1.
Afterwards, Web, using the Logic Regression Models previously generated, can be calculated respectively according to the text vector generated
Webpage X corresponds to the tendency degree of each format classification, and N number of result of calculation is obtained.
Afterwards, the Piao previously generated can be utilized according to the structure feature information extracted and the N number of tendency degree calculated
Plain Bayesian model, calculates the probability that Web page X belongs to each format classification, N number of result of calculation is obtained respectively.
Logic Regression Models are a kind of linear classification model, and with the features such as speed is fast, effect is good, it is utilized in the present invention
To determine that Web page X corresponds to the tendency degree of different format classifications;Model-naive Bayesian is independent vacation between a kind of feature based
If forecast model, use it in the present invention and determine final page layout class probability.In the embodiment of the present invention, logic is returned
It is what off-line training was completed to return model and model-naive Bayesian, how to be trained for prior art, equally, how to calculate tendency
Degree and probability are also prior art.
In addition, as shown in step 12, calculating respectively after Web page X belongs to the probability of each format classification, can
It regard the maximum corresponding format classification of probability of value as the format classification belonging to Web page X.Or, it is that further improve is divided
The accuracy of class result, after also can belonging to the probability of each format classification calculating Web page X respectively, first determines value
Whether maximum probability is more than predetermined threshold, if it is, regarding the maximum corresponding format classification of probability of value as Web page
Format classification belonging to X;Otherwise, the format belonging to Web page X is determined according to the existing mode of list plus typical URL features
Classification.
To sum up, Fig. 2 carries out the process schematic of format classification for the present invention to webpage.
By process shown in Fig. 2, it can be achieved to classify for the one-level format of webpage, on this basis, can also further enter
Two grades of format classification of row.
Correspondingly, after the format classification belonging to Web page X is determined, it can also further determine that out belonging to Web page X
Subclass;Format classification belonging to Web page X is further divided into Z subclass, and Z is the positive integer more than 1.N's and Z
Specific value can be decided according to the actual requirements.
Fig. 3 is two-stage format mode classification schematic diagram of the present invention.(the i.e. version as shown in figure 3, one-level format classification results
Formula classification) include:Blog page, novel page, forum's page;Wherein, two grades of format classification results (i.e. subclass of blog page of blog page
Do not include):Blog content page, blogroll page, two grades of format classification results (i.e. the subclass of novel page) of novel page include:
Novel list page, novel content pages, novel lobby page, two grades of format classification results (i.e. the subclass of forum's page) bag of forum's page
Include:Forum postings page, forum tabulation page.
It should be noted that, the technical scheme being not intended to limit the invention. shown in Fig. 3 by way of example only.Such as, root
According to being actually needed, one-level format classification results are also possible that other, such as news page, correspondingly, can be further to news page
Carry out two grades of format classification.
Between the classification of two-stage format independently of one another, varigrained demand can be satisfied with respectively.Moreover, can with stronger
Autgmentability, between each two grades of formats classification also independently of one another, if desired, can also add the classification of three-level format even more
It is many.
, can be based on suitable for the format classification belonging to Web page X after the format classification belonging to Web page X is determined
At least one differentiates feature, determines the subclass belonging to Web page X.
Wherein, the differentiation feature is generally included:Text Link Ratio, URL features or specific piece.
1) Text Link Ratio:Refer to the ratio of the length of link text and page text, available for differentiation list page and content
Page, if ratio is more than predetermined threshold, is regarded as list page.
2) URL features:Such as, multiple numeric strings, the URL of novel lobby page would generally be contained in the URL of novel content pages
In would generally would generally contain character string list and catalog in the URL of novel list page containing character string view_book etc.
Deng;So, in actual applications, if the numeric string number contained in URL is more than predetermined threshold, it is regarded as novel content
Page, if containing character string view_book in URL, is regarded as novel lobby page, if containing character string list in URL
And catalog, then it is regarded as novel list page.
3) specific piece:Such as, it would generally lead in Blog content page containing delivering time and author information in forum postings page
Multiple reply blocks can often be contained;In actual applications, if the reply block number contained is more than predetermined threshold, it is regarded as forum
Model page.
Differentiate feature to determine the subclass belonging to Web page X as it was previously stated, one can be based only on, can also be based on
Two or more differentiates the combination of feature to determine the subclass belonging to Web page X.Such as, it is assumed that the format belonging to Web page X
Classification is blog page, and its Text Link Ratio is more than predetermined threshold, then can determine that the subclass belonging to it is blogroll page, such as
Whether fruit is not more than predetermined threshold, then can further determine that wherein containing time and author information is delivered, if it is, determining it
Affiliated subclass is Blog content page.By the way of two or more differentiates the combination of feature classification results will be made more to be defined
Really.
The specific value for each threshold value being related in above-described embodiment can be decided according to the actual requirements.
So far, that is, the introduction on the inventive method embodiment is completed.
In a word,, can be according to the embodiment got for any Web page after using scheme described in above method embodiment
The information of the page layout feature of the Web page determines that the Web page is belonging respectively to the probability of different format classifications, and will take
The maximum corresponding format classification of probability of value is used as the format classification belonging to the Web page;Compared to prior art, the above method
Scheme described in embodiment need not rely upon list and typical URL features, arbitrary Web page is applicable, so as to preferable
Improve the accuracy of classification results in ground;Moreover, implementing simple and convenient, it is easy to popularize and promotes;In addition, two-stage format is classified
Between independently of one another, can be satisfied with varigrained demand respectively, and with stronger scalability, each two grades of formats classify it
Between also independently of one another, if desired, can also add three-level format classification it is even more many.
Based on above-mentioned introduction, Fig. 4 carries out the composition structural representation of the device embodiment of format classification for the present invention to webpage
Figure.As shown in figure 4, including:
First processing module, for when needing to classify to any Web page, carrying out following handle:Obtain the Web
The information of page layout feature can be embodied in webpage, and is sent to Second processing module;
Second processing module, for according to the information that gets determine the Web page be belonging respectively to it is set in advance it is N number of not
With the probability of format classification, N is the positive integer more than 1;It regard the maximum corresponding format classification of probability of value as the Web page
Affiliated format classification.
As shown in figure 4, may particularly include in first processing module:
First processing units, the dom tree for setting up the Web page;
Second processing unit, for extracting content source information and architectural feature letter in the Web page according to dom tree
Breath;Wherein, content source information includes:Label and short text;Structure feature information includes:URL, secondary navigation and title.
As shown in figure 4, may particularly include in Second processing module:
3rd processing unit, for generating text vector according to content source information;According to text vector, using previously generating
Logic Regression Models, calculate respectively the Web page correspond to each format classification tendency degree;According to structure feature information
And tendency degree, using the model-naive Bayesian previously generated, the Web page is calculated respectively and belongs to each format classification
Probability;
Fourth processing unit, for regarding the maximum corresponding format classification of probability of value as the version belonging to the Web page
Formula classification.
Wherein, the 3rd processing unit carries out participle to content source information;And generate the text vector of M dimensions, M value with it is pre-
The Feature Words number that the text dictionary first generated includes is identical, and each component in text vector is corresponded respectively in text dictionary
A Feature Words, record has the Feature Words for the page layout feature that can embody each format classification in text dictionary;For
Each word segmentation result, determine if respectively it is identical with a Feature Words in text dictionary, if it is, by text vector
Component corresponding with this feature word is set to 1, is otherwise 0.
In addition, fourth processing unit can be further used for, determine whether the maximum probability of value is more than predetermined threshold;If
It is then to regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page;Otherwise, according to list
Plus the mode of typical URL features determines the format classification belonging to the Web page.
It can also further comprise in Fig. 4 shown devices:
3rd processing module, for determining the format classification belonging to the Web page in Second processing module after, be based on
Suitable at least one differentiation feature of the format classification belonging to the Web page, the subclass belonging to the Web page is determined;Its
In, differentiate that feature includes:Text Link Ratio, URL features or specific piece;Format classification belonging to the Web page includes Z subclass
Not, Z is the positive integer more than 1.
In addition, format classification may include:Blog page, novel page or forum's page;The subclass of blog page may include:In blog
Hold page or blogroll page;The subclass of novel page may include:Novel list page, novel content pages or novel lobby page;Forum
The subclass of page may include:Forum postings page or forum tabulation page.
The specific workflow of Fig. 4 shown device embodiments refer to the respective description in preceding method embodiment, herein
Repeat no more.
In a word,, can be according to the embodiment got for any Web page after using scheme described in said apparatus embodiment
The information of the page layout feature of the Web page determines that the Web page is belonging respectively to the probability of different format classifications, and will take
The maximum corresponding format classification of probability of value is used as the format classification belonging to the Web page;Compared to prior art, said apparatus
Scheme described in embodiment need not rely upon list and typical URL features, arbitrary Web page is applicable, so as to preferable
Improve the accuracy of classification results in ground;Moreover, implementing simple and convenient, it is easy to popularize and promotes;In addition, two-stage format is classified
Between independently of one another, can be satisfied with varigrained demand respectively, and with stronger scalability, each two grades of formats classify it
Between also independently of one another, if desired, can also add three-level format classification it is even more many.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
God is with principle, and any modification, equivalent substitution and improvements done etc. should be included within the scope of protection of the invention.
Claims (12)
1. a kind of method that format classification is carried out to webpage, it is characterised in that when needing to classify to any Web page,
Carry out following handle:
The information of page layout feature can be embodied by obtaining in the Web page;
Information according to getting determines that the Web page is belonging respectively to the probability of N number of different format classifications set in advance, N
For the positive integer more than 1;
It regard the maximum corresponding format classification of probability of value as the format classification belonging to the Web page;
The information of page layout feature can be embodied in the acquisition Web page to be included:
Set up the document object model dom tree of the Web page;
The content source information and structure feature information in the Web page are extracted according to the dom tree;
The information that the basis is got determines that the Web page is belonging respectively to the general of N number of different format classifications set in advance
Rate includes:
Text vector is generated according to the content source information;
According to the text vector, using the Logic Regression Models previously generated, the Web page is calculated respectively and is corresponded to often
The tendency degree of individual format classification;
According to the structure feature information and the tendency degree, using the model-naive Bayesian previously generated, calculate respectively
Go out the probability that the Web page belongs to each format classification.
2. according to the method described in claim 1, it is characterised in that the content source information includes:Label and short text;
The structure feature information includes:Uniform resource position mark URL, secondary navigation and title.
3. according to the method described in claim 1, it is characterised in that described that text vector bag is generated according to the content source information
Include:
Participle is carried out to the content source information;
The text vector of M dimensions is generated, M value is identical with the Feature Words number that the text dictionary previously generated includes, the text
Record has energy in the Feature Words that each component in this vector is corresponded respectively in the text dictionary, the text dictionary
Enough embody the Feature Words of the page layout feature of each format classification;
For each word segmentation result, determine if respectively it is identical with a Feature Words in the text dictionary, if it is,
Component corresponding with this feature word in the text vector is set to 1, is otherwise 0.
4. according to the method described in claim 1, it is characterised in that described to make the corresponding format classification of probability of value maximum
Before the format classification belonging to the Web page, further comprise:
Determine whether the maximum probability of value is more than predetermined threshold;
If it is, regarding the maximum corresponding format classification of probability of value as the format classification belonging to the Web page;
Otherwise, the format classification belonging to the Web page is determined in the way of list plus typical URL features.
5. according to method according to any one of claims 1 to 4, it is characterised in that described to determine belonging to the Web page
Format classification after, further comprise:
Feature is differentiated based at least one suitable for the format classification belonging to the Web page, determined belonging to the Web page
Subclass;
Wherein, the differentiation feature includes:Text Link Ratio, URL features or specific piece;
Format classification belonging to the Web page includes Z subclass, and Z is the positive integer more than 1.
6. method according to claim 5, it is characterised in that
The format classification includes:Blog page, novel page or forum's page;
Wherein, the subclass of the blog page includes:Blog content page or blogroll page;
The subclass of the novel page includes:Novel list page, novel content pages or novel lobby page;
The subclass of forum's page includes:Forum postings page or forum tabulation page.
7. a kind of device that format classification is carried out to webpage, it is characterised in that including:
First processing module, for when needing to classify to any Web page, carrying out following handle:Obtain the Web nets
The information of page layout feature can be embodied in page, and is sent to Second processing module;
The Second processing module, it is set in advance N number of for determining that the Web page is belonging respectively to according to the information got
The probability of different format classifications, N is the positive integer more than 1;It regard the maximum corresponding format classification of probability of value as the Web
Format classification belonging to webpage;
The first processing module includes:
First processing units, the document object model dom tree for setting up the Web page;
Second processing unit, for extracting content source information and architectural feature in the Web page according to the dom tree
Information;
The Second processing module includes:
3rd processing unit, for generating text vector according to the content source information;According to the text vector, using advance
The Logic Regression Models of generation, calculate the tendency degree that the Web page corresponds to each format classification respectively;According to the knot
Structure characteristic information and the tendency degree, using the model-naive Bayesian previously generated, calculate the Web page category respectively
In the probability of each format classification;
Fourth processing unit, for regarding the maximum corresponding format classification of probability of value as the format belonging to the Web page
Classification.
8. device according to claim 7, it is characterised in that the content source information includes:Label and short text;It is described
Structure feature information includes:Uniform resource position mark URL, secondary navigation and title.
9. device according to claim 7, it is characterised in that
3rd processing unit carries out participle to the content source information;And generate the text vector of M dimensions, M value with it is pre-
The Feature Words number that the text dictionary first generated includes is identical, and each component in the text vector corresponds respectively to the text
Record has the page layout feature that can embody each format classification in a Feature Words in this dictionary, the text dictionary
Feature Words;For each word segmentation result, determine if respectively it is identical with a Feature Words in the text dictionary, if
It is that component corresponding with this feature word in the text vector is then set to 1, is otherwise 0.
10. device according to claim 7, it is characterised in that
The fourth processing unit is further used for, and determines whether the maximum probability of value is more than predetermined threshold;If it is, will
The maximum corresponding format classification of probability of value is used as the format classification belonging to the Web page;Otherwise, according to list plus typical case
The modes of URL features determine format classification belonging to the Web page.
11. the device according to any one of claim 7~10, it is characterised in that the device further comprises:
3rd processing module, for determining the format classification belonging to the Web page in the Second processing module after, base
In at least one differentiation feature suitable for the format classification belonging to the Web page, the subclass belonging to the Web page is determined
Not;Wherein, the differentiation feature includes:Text Link Ratio, URL features or specific piece;Format classification belonging to the Web page
Including Z subclass, Z is the positive integer more than 1.
12. device according to claim 11, it is characterised in that
The format classification includes:Blog page, novel page or forum's page;
Wherein, the subclass of the blog page includes:Blog content page or blogroll page;
The subclass of the novel page includes:Novel list page, novel content pages or novel lobby page;
The subclass of forum's page includes:Forum postings page or forum tabulation page.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210127531.8A CN103377243B (en) | 2012-04-27 | 2012-04-27 | A kind of method and apparatus that format classification is carried out to webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210127531.8A CN103377243B (en) | 2012-04-27 | 2012-04-27 | A kind of method and apparatus that format classification is carried out to webpage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103377243A CN103377243A (en) | 2013-10-30 |
CN103377243B true CN103377243B (en) | 2017-09-08 |
Family
ID=49462369
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210127531.8A Active CN103377243B (en) | 2012-04-27 | 2012-04-27 | A kind of method and apparatus that format classification is carried out to webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103377243B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104834640A (en) * | 2014-02-10 | 2015-08-12 | 腾讯科技(深圳)有限公司 | Webpage identification method and apparatus |
CN106203454B (en) * | 2016-07-25 | 2019-05-21 | 重庆中科云从科技有限公司 | The method and device of certificate format analysis |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101593200A (en) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | Chinese Web page classification method based on the keyword frequency analysis |
CN101872347A (en) * | 2009-04-22 | 2010-10-27 | 富士通株式会社 | Method and device for judging type of webpage |
CN102411587A (en) * | 2010-09-21 | 2012-04-11 | 腾讯科技(深圳)有限公司 | Webpage classification method and device |
-
2012
- 2012-04-27 CN CN201210127531.8A patent/CN103377243B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101872347A (en) * | 2009-04-22 | 2010-10-27 | 富士通株式会社 | Method and device for judging type of webpage |
CN101593200A (en) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | Chinese Web page classification method based on the keyword frequency analysis |
CN102411587A (en) * | 2010-09-21 | 2012-04-11 | 腾讯科技(深圳)有限公司 | Webpage classification method and device |
Non-Patent Citations (4)
Title |
---|
一种基于综合特征的网页类型识别方法;陈翰等;《信息工程大学学报》;20111231;第739页第5-7段,第740页第3-6段、图2-3,第741页第7段、图4,第742页第6段 * |
基于朴素贝叶斯的中文海事文本多分类器研究;袁文生等;《计算机与现代化》;20110531;150-153 * |
用Naive Bayes方法协调分类Web网页;范焱等;《软件学报》;20010930;1386-1392 * |
结合中文分词的贝叶斯文本分类;魏晓宁;《苏州市职业大学学报》;20080331;第104页第二栏第1-2段,第105页第二栏第1-5段 * |
Also Published As
Publication number | Publication date |
---|---|
CN103377243A (en) | 2013-10-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101661513B (en) | Detection method of network focus and public sentiment | |
CN105045778B (en) | A kind of Chinese homonym mistake auto-collation | |
CN110598000A (en) | Relationship extraction and knowledge graph construction method based on deep learning model | |
CN103678310B (en) | The sorting technique and device of Web page subject | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
CN103810251B (en) | Method and device for extracting text | |
CN102253937B (en) | Method and related device for acquiring information of interest in webpages | |
CN103034626A (en) | Emotion analyzing system and method | |
CN101887443B (en) | Method and device for classifying texts | |
CN103744905A (en) | Junk mail judgment method and device | |
El-Halees | Mining opinions in user-generated contents to improve course evaluation | |
CN102682120B (en) | Method and device for acquiring essential article commented on network | |
CN103874994A (en) | Method and apparatus for automatically summarizing the contents of electronic documents | |
CN103294781A (en) | Method and equipment used for processing page data | |
CN110457579B (en) | Webpage denoising method and system based on cooperative work of template and classifier | |
CN104199845B (en) | Line Evaluation based on agent model discusses sensibility classification method | |
CN108334493A (en) | A kind of topic knowledge point extraction method based on neural network | |
CN101661468B (en) | Method for extracting post metadata from forum post list pages | |
CN105183715A (en) | Word distribution and document feature based automatic classification method for spam comments | |
WO2013178193A2 (en) | Text content extraction method and device | |
US20120221545A1 (en) | Isolating desired content, metadata, or both from social media | |
CN103377243B (en) | A kind of method and apparatus that format classification is carried out to webpage | |
US20140281878A1 (en) | Aligning Annotation of Fields of Documents | |
CN107145591A (en) | A kind of effective content metadata extracting method of webpage based on title | |
CN111241270A (en) | Resume processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20221117 Address after: 1402, Floor 14, Block A, Haina Baichuan Headquarters Building, No. 6, Baoxing Road, Haibin Community, Xin'an Street, Bao'an District, Shenzhen, Guangdong 518133 Patentee after: Shenzhen Yayue Technology Co.,Ltd. Address before: 2, 518044, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. |
|
TR01 | Transfer of patent right |