CN103309862A - Webpage type recognition method and system - Google Patents

Webpage type recognition method and system Download PDF

Info

Publication number
CN103309862A
CN103309862A CN2012100580243A CN201210058024A CN103309862A CN 103309862 A CN103309862 A CN 103309862A CN 2012100580243 A CN2012100580243 A CN 2012100580243A CN 201210058024 A CN201210058024 A CN 201210058024A CN 103309862 A CN103309862 A CN 103309862A
Authority
CN
China
Prior art keywords
type
webpage
news
content
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100580243A
Other languages
Chinese (zh)
Other versions
CN103309862B (en
Inventor
蔡兵
彭默
徐羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201210058024.3A priority Critical patent/CN103309862B/en
Publication of CN103309862A publication Critical patent/CN103309862A/en
Application granted granted Critical
Publication of CN103309862B publication Critical patent/CN103309862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a webpage type recognition method and system. The method comprises the following steps of calculating a content type alignment value of a webpage according to the contextual content of the webpage; extracting the webpage structural characteristics of the webpage; and utilizing the content type alignment value and the webpage structural characteristics to recognize a type of the webpage. By utilizing the method and system, the webpage is classified by comprehensively considering the dimensionality of the contextual content and the dimensionality of the webpage structure, so that the classification accuracy is higher. Moreover, through the data filter, noises which are nonrelated to the recognition type in the webpage such as tags, links and advertisement can be efficiently eliminated, and the classification effect is better.

Description

A kind of type of webpage recognition methods and system
Technical field
Embodiment of the present invention relates to technical field of internet application, more specifically, relates to a kind of type of webpage recognition methods and system.
Background technology
Along with the develop rapidly of computer technology and network technology, the effect that internet (Internet) brings into play in daily life, study and work is also increasing.According to the up-to-date internet development survey report demonstration that announce the CNNIC, the China Internet number of netizen reaches 5.13 hundred million, and Chinese webpage had 60,000,000,000 in 2010, and global webpage then has 1,000,000,000,000 at least.
How the information numerous and complicated that numerous webpages comprise on the internet is accurately sorted out these webpages so that follow-up work is a stern challenge.Such as: aspect web advertisement, show that the advertisement relevant with type of webpage will promote user's clicking rate greatly.In addition, development along with mobile Internet in nearly 2 years, the demand of mobile reading is the blowout shape, news is undoubtedly one of type that the user pays close attention to the most, if can identify news web page, also can use cleaner data to mobile reading are provided, can also extract to the page simultaneously provides corresponding help.
At present, usually adopt in the prior art the file classification method of naive Bayesian to identify content of text, mainly comprise: the mark training sample, utilize the text word as feature, estimate the classification of text by the method for statistics, etc.
At first, mainly be to classify according to web page contents in the prior art at present, and only classify according to web page contents, classify accuracy is not high.Secondly, compare with the webpage on the internet, the data source of text classification is because too simple and impracticable.
Summary of the invention
Embodiment of the present invention proposes a kind of type of webpage recognition methods, to improve the Web page classifying accuracy.
Embodiment of the present invention also proposes a kind of type of webpage recognition system, to improve the Web page classifying accuracy.
The concrete scheme of embodiment of the present invention is as follows:
A kind of type of webpage recognition methods, the method comprises:
Calculate the content type propensity value of this webpage according to the content of text of webpage;
Extract the structure of web page feature of this webpage;
Utilize described content type propensity value and described structure of web page feature to identify the type of described webpage.
A kind of type of webpage recognition system, this system comprise content type propensity value computing unit, architectural feature extraction unit and type identification unit, wherein:
Content type propensity value computing unit is for the content type propensity value of calculating this webpage according to the content of text of webpage;
The architectural feature extraction unit is for the structure of web page feature of extracting this webpage;
The type identification unit is used for utilizing described content type propensity value and described structure of web page feature to identify the type of described webpage.
Can find out from technique scheme, in embodiment of the present invention, calculate the content type propensity value of this webpage according to the content of text of webpage; Extract the structure of web page feature of this webpage; Recycling content type propensity value and structure of web page feature are identified the type of described webpage.This shows, use after the embodiment of the present invention, at first carry out the classification of two dimensions for webpage: a dimension that is based on content of text, another is based on the dimension of structure of web page; According to the classification results of these two dimensions, the classification of webpage is determined in combination at last.Therefore embodiment of the present invention has not only been considered the content of text dimension of webpage, has considered that also the structure of web page dimension comes webpage is classified, and has considered these two dimensions and has come webpage is classified, and therefore the accuracy of classification is higher.
Description of drawings
Fig. 1 is the type of webpage recognition methods process flow diagram according to embodiment of the present invention;
Fig. 2 is the type of webpage recognition methods exemplary flow chart according to embodiment of the present invention;
Fig. 3 is the type of webpage recognition system structural drawing according to embodiment of the present invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with accompanying drawing.
In embodiment of the present invention, carry out the classification of two dimensions for webpage.A dimension that is based on content of text, another is based on the dimension of structure of web page.Then, according to the classification results of these two dimensions, the classification of webpage is determined in combination.
Fig. 1 is the type of webpage recognition methods process flow diagram according to embodiment of the present invention.
As shown in Figure 1, the method comprises:
Step 101: the content type propensity value of calculating this webpage according to the content of text of webpage.
Here, relate to based on the dimension of content of text type of webpage is carried out preliminary classification.Classification relates generally to and utilizes the statistical machine learning sorting algorithm according to content of text, calculates the probability that certain page is particular type (such as the news type) by training sample and feature.
Particularly, can at first utilize dictionary that the content of text of webpage is carried out participle, and the weight of calculating participle feature is to form proper vector, and then according to the content type propensity value of this proper vector of web page contents classifier calculated that sets in advance, the content type propensity value that wherein calculates can be used as the probability corresponding to this kind web page contents sorter representative type of webpage.
Except text message, Webpage contains much other irrelevant contents usually.Found through experiments, only utilize all sentences in the webpage as the grouped data source, can effectively remove the noises such as label, link, advertisement, so that classifying quality is better.Therefore, in one embodiment, before the content of text that utilizes dictionary to webpage carries out participle, can be from content of text the whole sentence of elimination length less than the sentence of predetermined value, to strengthen classifying quality.
And, the cost that brings in order to reduce artificial mark data, can attempt utilizing various websites (such as, some news websites) go to grasp data as entrance, and by simple manual examination and verification, obtain altogether a large amount of (such as thousands of) news data, then utilize word as characteristic of division, and carry out dimensionality reduction in conjunction with the feature selecting scheduling algorithm.
In another embodiment, sorter can utilize the content type propensity value of logistic regression (Logistic Regression) sorting algorithm calculated characteristics vector.Logistic regression is a kind of linear classifier, and computing velocity is very fast, relatively is fit to the application scenarios of real-time grading.
In one embodiment, specifically can utilize word frequency-anti-document frequency (TF-IDF) weighting algorithm to calculate the weight of participle feature.
The TF-IDF weighting algorithm is a kind of weighting technique commonly used of prospecting for information retrieval and information, in order to assess a words for the significance level of a copy of it file in a file set or the corpus.In the TF-IDF weighting algorithm, the number of times that the importance of words occurs hereof along with it increase that is directly proportional, but the decline that can be inversely proportional to along with the frequency that it occurs in corpus simultaneously.
The various forms of TF-IDF weighting is often used by Search engine, as tolerance or the grading of degree of correlation between file and the user's inquiry.Except TF-IDF, the Search engine on the Internet also can use the ranking method of analyzing based on linking, with the order of determining that file occurs in search result.
Step 102: the structure of web page feature of extracting this webpage.
Here, relate to based on the dimension of content of text type of webpage is carried out preliminary classification.Particularly, can build DOM Document Object Model (DOM) tree to webpage first, then extract some structure of web page features by the traversal dom tree, with the foundation as textural classification.
According to W3C DOM standard, DOM is the interface of a kind of and browser, platform, language independent so that the user can accession page other standard package.DOM has solved the conflict between the Jscript of the Javascript of Netscape (Netscape) and Microsoft (Microsoft), give the method for web designer and a standard of developer, so that the data in the access site, script and presentation layer are to picture.DOM is with the node of hierarchical structure tissue or the set of pieces of information.This hierarchical structure allows the developer to navigate in tree and seeks customizing messages.Analyze this structure and usually need to load whole document and synthem aggregated(particle) structure, then just can do any work.Because it is based on level of information, thereby DOM is considered to based on tree or object-based.
Such as: traversal dom tree and the structure of web page feature extracted can comprise:
1) URL feature.Be index.html etc. such as the URL end, then basically can be judged to be index page.If URL contains " content " or date, then be that the possibility of content pages is larger.
2) Text Link Ratio.Calculate text (Pure Text) length of webpage the inside and the ratio of link text (Anchor) length.
3) maximum text size.Calculate one section the longest text size in the webpage.A length threshold value as content pages.
4) the longest continuous text ratio.The text size of namely concentrating accounts for the ratio of the total text size of webpage.In general, the text message of content pages mainly concentrates on one, and such as special topic page or leaf etc., although its text size is long, relatively dispersion distributes.
5) secondary navigation information;
6) web page title, etc.
Although above some concrete structure of web page features of having enumerated in detail it will be appreciated by those of skill in the art that the structure of web page feature that in fact adopts is not limited to this, and the protection domain of embodiment of the present invention also are not limited to this.
Step 103: the type of utilizing content type propensity value and structure of web page feature identification webpage.
Here, the structure of web page feature that the content type propensity value that calculates based on step 101 and step 102 extract can be determined by the various many judgment criterion that set in advance threshold value and the combined strategy of each feature, finally draws the type of this page.
Such as: when calculating the news type propensity value of this webpage according to the content of text of webpage in the step 101, then judgment criterion specifically can comprise:
1) when news type propensity value during greater than the news type first threshold that sets in advance, judges that directly the type of webpage is news.
For example, suppose that the span of news type propensity value is 0-100, the news type propensity value that calculates is 90, and news type first threshold is 85.At this moment, because the news type propensity value that calculates, therefore can be thought this webpage and news height correlation greater than news type first threshold, can not consider the structure of web page feature at this moment and judge that directly the type of this webpage is news.
2) when news type propensity value greater than the news type Second Threshold that sets in advance, and when comprising news category information in the structure of web page feature, judge that the type of described webpage is news, wherein news type first threshold is greater than news type Second Threshold.
For example, suppose that the span of news type propensity value is 0-100, the news type propensity value that calculates is 70, and news type first threshold is 85, and news type Second Threshold is 60.At this moment, because the news type propensity value that calculates is less than news type first threshold, therefore can not assert directly that this webpage is the news type, but because the news type propensity value that calculates is greater than news type Second Threshold, can think that then this webpage is relevant with the news type, therefore need to come in conjunction with the news type propensity value that calculates and structure of web page feature whether this webpage of synthetic determination is the news type.At this moment, when also comprising news category information simultaneously in the structure of web page feature (containing " news " in such as web page title), can judge that then the type of this webpage is news.
When the news type propensity value that calculates less than news type Second Threshold, can assert directly that then this webpage is uncorrelated with the news type.
In embodiment of the present invention, for the webpage of news type, final recognition accuracy can reach more than 95%, and recall rate is more than 80%.
Although abovely take the news type as example embodiment of the present invention is described in detail, those skilled in the art can recognize, based on above-mentioned detailed instruction, the type of webpage that in fact embodiment of the present invention can be suitable for not merely comprises the news type, but can comprise the polytypes such as knowledge question type, forum's zone of discussion type or online transaction type of webpage.
In the said method flow process, require there is no strict demand for the execution sequence of step 101 and step 102.In fact, step 101 and step 102 can be carried out simultaneously, also can first execution in step 101, and execution in step 102 again, perhaps execution in step 101 again after the execution of step 102.
And, identify after the type of webpage based on above-mentioned flow process, can carry out many kinds in conjunction with the type of webpage that identifies and use.
Such as: can based on the type of webpage of identifying, calculate the advertisement degree of correlation of this webpage; Also can based on the type of webpage of identifying, carry out Personalize News for this webpage and recommend; Can also based on the type of webpage of identifying, from this webpage, extract the Web page structural data; Or based on the type of webpage of identifying, carry out the data screening of reading the class application for this webpage, etc.
Based on above-mentioned labor, the below is to differentiate whether webpage describes exemplary flow of the present invention as the news type as example.
Fig. 2 is the type of webpage recognition methods exemplary flow chart according to embodiment of the present invention.
As shown in Figure 2, the operation for webpage has two branches.Left side branch comprises step 201, step 202 and step 203, and right branch comprises step 204 and step 205.Two branches are summarized in step 206.Wherein left side branch comprises:
Step 201: executing data filters.For for preventing the webpage noise, only extract some long sentences in the webpage as text, herein can be from content of text the whole sentence of elimination length less than the sentence of predetermined value, to strengthen classifying quality.
Step 202: utilize the characteristic set dictionary that text is carried out participle, then calculate the weight (utilize characteristic set and such as the feature weight computing method of TF-IDF) of each participle feature, form a proper vector.
Step 203: with the input of proper vector as sorter, obtain an output valve (span is that 0-100 divides), namely news content type propensity value is used for representing that its content is the tendency degree of news.Wherein can be by training sample and feature, obtain in advance this sorter by the logistic regression algorithm.
Right branch comprises:
Step 204: build dom tree.Comprise: utilize the html tag of webpage to set up dom tree, and comprise the information such as each tag attributes.
Step 205: extract the structure type feature based on dom tree, such as secondary navigation, Text Link Ratio etc.
Left and right sides branch is summarised in step 206: combination is judged.Utilize the output of step 203 and the output of step 205, utilize to preset strategy and carry out optimum and determine whether the news content page or leaf.
Based on above-mentioned detailed discussion, embodiment of the present invention has also proposed a kind of type of webpage recognition system.
Fig. 3 is the type of webpage recognition system structural drawing according to embodiment of the present invention.
As shown in Figure 3, this system comprises: content type propensity value computing unit 301, architectural feature extraction unit 302 and type identification unit 303.
Wherein: content type propensity value computing unit 301, for the content type propensity value of calculating this webpage according to the content of text of webpage;
Architectural feature extraction unit 302 is for the structure of web page feature of extracting this webpage;
Type identification unit 303 is used for utilizing described content type propensity value and described structure of web page feature to identify the type of described webpage.
In one embodiment, this system further comprises type of process unit (not shown in FIG.).At least one of following steps be used for to be carried out: based on the type of webpage of identifying, calculate the advertisement degree of correlation of this webpage in the type of process unit; Based on the type of webpage of identifying, carry out Personalize News for this webpage and recommend; Based on the type of webpage of identifying, from this webpage, extract the Web page structural data; Or based on the type of webpage of identifying, carry out the data screening of reading the class application for this webpage.
Particularly, content type propensity value computing unit 301 is used for utilizing dictionary that the content of text of webpage is carried out participle, and the weight of calculating participle feature is to form proper vector; And according to the content type propensity value of this proper vector of web page contents classifier calculated that sets in advance.
Preferably, content type propensity value computing unit 301 was further used for before the content of text that utilizes dictionary to webpage carries out participle, and the whole sentence of elimination length is less than the sentence of predetermined value from described content of text.
Particularly, architectural feature extraction unit 302 is used for setting up the DOM Document Object Model dom tree of this webpage, and extracts the structure of web page feature from described dom tree.
In one embodiment, content type propensity value computing unit 301 is for the news type propensity value of calculating this webpage according to the content of text of webpage; This moment, type identification unit 302 was used for carrying out at least one of following steps: when news type propensity value during greater than the news type first threshold that sets in advance, judge that directly the type of webpage is news; Or when news type propensity value greater than the news type Second Threshold that sets in advance, and when comprising news category information in the described structure of web page feature, judge that the type of webpage is news; Wherein news type first threshold is greater than news type Second Threshold.
Similarly, the type of webpage that the type of webpage recognition system in the embodiment of the present invention is suitable for not merely comprises the news type, but can comprise knowledge question type, forum's zone of discussion type or online transaction type of webpage, etc.
In sum, in embodiment of the present invention, calculate the content type propensity value of this webpage according to the content of text of webpage; Extract the structure of web page feature of this webpage; Recycling content type propensity value and structure of web page feature are identified the type of described webpage.This shows, use after the embodiment of the present invention, carry out the classification of two dimensions for webpage.A dimension that is based on content of text, another is based on the dimension of structure of web page, and according to the classification results of these two dimensions, the classification of webpage is determined in combination at last.Therefore embodiment of the present invention has not only been considered the content of text dimension, has considered that also the structure of web page dimension comes webpage is classified, and comes webpage is classified by considering these two dimensions, and therefore the accuracy of classification is higher.
And, in embodiment of the present invention, by data filtering, can effectively remove in the webpage label irrelevant with identification types, link, the noise such as advertisement, so that classifying quality is better.
The above is preferred embodiment of the present invention only, is not for limiting protection scope of the present invention.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (17)

1. a type of webpage recognition methods is characterized in that, the method comprises:
Calculate the content type propensity value of this webpage according to the content of text of webpage;
Extract the structure of web page feature of this webpage;
Utilize described content type propensity value and described structure of web page feature to identify the type of described webpage.
2. type of webpage recognition methods according to claim 1 is characterized in that, at least one during the method is further comprising the steps:
Based on the type of webpage of identifying, calculate the advertisement degree of correlation of this webpage;
Based on the type of webpage of identifying, carry out Personalize News for this webpage and recommend;
Based on the type of webpage of identifying, from this webpage, extract the Web page structural data; Or
Based on the type of webpage of identifying, carry out the data screening of reading the class application for this webpage.
3. type of webpage recognition methods according to claim 1 is characterized in that, the content type propensity value that described content of text according to webpage calculates this webpage specifically comprises:
Utilize dictionary that the content of text of this webpage is carried out participle, and the weight of calculating participle feature is to form proper vector;
Content type propensity value according to this proper vector of web page contents classifier calculated that sets in advance.
4. type of webpage recognition methods according to claim 3 is characterized in that, before the content of text that utilizes dictionary to webpage carried out participle, the method further comprised: the whole sentence of elimination length is less than the sentence of predetermined value from described content of text.
5. type of webpage recognition methods according to claim 3 is characterized in that, the weight of described calculating participle feature is: utilize the anti-document frequency IDF of word frequency TF-weighting algorithm to calculate the weight of participle feature.
6. type of webpage recognition methods according to claim 3 is characterized in that, in the method:
Described web page contents sorter utilizes the logistic regression sorting algorithm to calculate the content type propensity value of this proper vector.
7. type of webpage recognition methods according to claim 1 is characterized in that, the structure of web page feature of described this webpage of extraction specifically comprises:
Set up the DOM Document Object Model dom tree of this webpage;
From described dom tree, extract the structure of web page feature.
8. type of webpage recognition methods according to claim 7 is characterized in that, described structure of web page feature comprises at least one in the following information:
Secondary navigation information;
Text Link Ratio;
Uniform resource position mark URL;
Web page title;
Maximum text size; Or
The longest continuous text ratio.
9. type of webpage recognition methods according to claim 1 is characterized in that,
The content type propensity value that described content of text according to webpage calculates this webpage is specially: the news type propensity value of calculating this webpage according to the content of text of webpage; Wherein:
Utilize the type of news type propensity value and structure of web page feature identification webpage, at least one in specifically may further comprise the steps:
When described news type propensity value during greater than the news type first threshold that sets in advance, judge that directly the type of described webpage is news; Or
When described news type propensity value greater than the news type Second Threshold that sets in advance, and when comprising news category information in the described structure of web page feature, judge that the type of described webpage is news;
Wherein said news type first threshold is greater than news type Second Threshold.
10. type of webpage recognition methods according to claim 1 is characterized in that, the type of described webpage comprises news type, knowledge question type, forum's zone of discussion type or online transaction type of webpage.
11. a type of webpage recognition system is characterized in that, this system comprises content type propensity value computing unit, architectural feature extraction unit and type identification unit, wherein:
Content type propensity value computing unit is for the content type propensity value of calculating this webpage according to the content of text of webpage;
The architectural feature extraction unit is for the structure of web page feature of extracting this webpage;
The type identification unit is used for utilizing described content type propensity value and described structure of web page feature to identify the type of described webpage.
12. type of webpage recognition system according to claim 11 is characterized in that this system further comprises the type of process unit, described type of process unit is used for carrying out at least one of following steps:
Based on the type of webpage of identifying, calculate the advertisement degree of correlation of this webpage;
Based on the type of webpage of identifying, carry out Personalize News for this webpage and recommend;
Based on the type of webpage of identifying, from this webpage, extract the Web page structural data; Or
Based on the type of webpage of identifying, carry out the data screening of reading the class application for this webpage.
13. type of webpage recognition system according to claim 11 is characterized in that,
Described content type propensity value computing unit is used for utilizing dictionary that the content of text of webpage is carried out participle, and the weight of calculating participle feature is to form proper vector; And according to the content type propensity value of this proper vector of web page contents classifier calculated that sets in advance.
14. type of webpage recognition system according to claim 11 is characterized in that,
Described content type propensity value computing unit was further used for before the content of text that utilizes dictionary to webpage carries out participle, and the whole sentence of elimination length is less than the sentence of predetermined value from described content of text.
15. type of webpage recognition system according to claim 11 is characterized in that,
Described architectural feature extraction unit is used for setting up the DOM Document Object Model dom tree of this webpage, and extracts the structure of web page feature from described dom tree.
16. type of webpage recognition system according to claim 11 is characterized in that,
Described content type propensity value computing unit is for the news type propensity value of calculating this webpage according to the content of text of webpage;
Described type identification unit is used for carrying out at least one of following steps:
When described news type propensity value during greater than the news type first threshold that sets in advance, judge that directly the type of described webpage is news; Or
When described news type propensity value greater than the news type Second Threshold that sets in advance, and when comprising news category information in the described structure of web page feature, judge that the type of described webpage is news;
Wherein said news type first threshold is greater than news type Second Threshold.
17. type of webpage recognition system according to claim 11 is characterized in that, the type of described webpage comprises news type, knowledge question type, forum's zone of discussion type or online transaction type.
CN201210058024.3A 2012-03-07 2012-03-07 Webpage type recognition method and system Active CN103309862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210058024.3A CN103309862B (en) 2012-03-07 2012-03-07 Webpage type recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210058024.3A CN103309862B (en) 2012-03-07 2012-03-07 Webpage type recognition method and system

Publications (2)

Publication Number Publication Date
CN103309862A true CN103309862A (en) 2013-09-18
CN103309862B CN103309862B (en) 2017-05-17

Family

ID=49135101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210058024.3A Active CN103309862B (en) 2012-03-07 2012-03-07 Webpage type recognition method and system

Country Status (1)

Country Link
CN (1) CN103309862B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544310A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Shopping guide webpage information classifying method achieved based on classifier
CN104021180A (en) * 2014-06-09 2014-09-03 南京航空航天大学 Combined software defect report classification method
WO2015196740A1 (en) * 2014-06-25 2015-12-30 华南理工大学 Information forecast and acquisition method based on webpage link parameter analysis
CN105302884A (en) * 2015-10-19 2016-02-03 天津海量信息技术有限公司 Deep learning-based webpage mode recognition method and visual structure learning method
CN105373570A (en) * 2014-09-02 2016-03-02 中兴通讯股份有限公司 Browser history management method and terminal
CN106557517A (en) * 2015-09-29 2017-04-05 百度在线网络技术(北京)有限公司 The sort management method and device of website
WO2018053863A1 (en) * 2016-09-26 2018-03-29 Microsoft Technology Licensing, Llc Identifying video pages
CN108255891A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 A kind of method and device for differentiating type of webpage
CN108345599A (en) * 2017-01-23 2018-07-31 阿里巴巴集团控股有限公司 Type of webpage determines method, apparatus and computer-readable medium
CN108881138A (en) * 2017-10-26 2018-11-23 新华三信息安全技术有限公司 A kind of web-page requests recognition methods and device
CN110110075A (en) * 2017-12-25 2019-08-09 中国电信股份有限公司 Web page classification method, device and computer readable storage medium
CN110427755A (en) * 2018-10-16 2019-11-08 新华三信息安全技术有限公司 A kind of method and device identifying script file
CN110781925A (en) * 2019-09-29 2020-02-11 支付宝(杭州)信息技术有限公司 Software page classification method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOGUANG QI 等: ""Web page classification: Features and algorithms"", 《ACM COMPUTING SURVEYS (CSUR)》 *
刘欣: ""基于结构信息的中文网页自动分类技术研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544310B (en) * 2013-11-04 2017-08-08 北京中搜云商网络技术有限公司 A kind of information classification approach for the shopping guide's class webpage realized based on grader
CN103544310A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Shopping guide webpage information classifying method achieved based on classifier
CN104021180A (en) * 2014-06-09 2014-09-03 南京航空航天大学 Combined software defect report classification method
CN104021180B (en) * 2014-06-09 2017-10-24 南京航空航天大学 A kind of modular software defect report sorting technique
WO2015196740A1 (en) * 2014-06-25 2015-12-30 华南理工大学 Information forecast and acquisition method based on webpage link parameter analysis
CN105373570A (en) * 2014-09-02 2016-03-02 中兴通讯股份有限公司 Browser history management method and terminal
CN105373570B (en) * 2014-09-02 2020-09-15 中兴通讯股份有限公司 Management method and terminal for browser history records
CN106557517A (en) * 2015-09-29 2017-04-05 百度在线网络技术(北京)有限公司 The sort management method and device of website
CN105302884A (en) * 2015-10-19 2016-02-03 天津海量信息技术有限公司 Deep learning-based webpage mode recognition method and visual structure learning method
CN105302884B (en) * 2015-10-19 2019-02-19 天津海量信息技术股份有限公司 Webpage mode identification method and visual structure learning method based on deep learning
WO2018053863A1 (en) * 2016-09-26 2018-03-29 Microsoft Technology Licensing, Llc Identifying video pages
CN108475275A (en) * 2016-09-26 2018-08-31 微软技术许可有限责任公司 Identify video page
CN108255891B (en) * 2016-12-29 2020-08-28 北京国双科技有限公司 Method and device for judging webpage type
CN108255891A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 A kind of method and device for differentiating type of webpage
CN108345599A (en) * 2017-01-23 2018-07-31 阿里巴巴集团控股有限公司 Type of webpage determines method, apparatus and computer-readable medium
CN108345599B (en) * 2017-01-23 2021-12-14 阿里巴巴集团控股有限公司 Webpage type determination method and device and computer readable medium
CN108881138A (en) * 2017-10-26 2018-11-23 新华三信息安全技术有限公司 A kind of web-page requests recognition methods and device
CN108881138B (en) * 2017-10-26 2020-06-26 新华三信息安全技术有限公司 Webpage request identification method and device
EP3703329A4 (en) * 2017-10-26 2020-12-02 New H3C Security Technologies Co., Ltd. Webpage request identification
WO2019080860A1 (en) * 2017-10-26 2019-05-02 新华三信息安全技术有限公司 Webpage request identification
CN110110075A (en) * 2017-12-25 2019-08-09 中国电信股份有限公司 Web page classification method, device and computer readable storage medium
CN110427755A (en) * 2018-10-16 2019-11-08 新华三信息安全技术有限公司 A kind of method and device identifying script file
CN110781925A (en) * 2019-09-29 2020-02-11 支付宝(杭州)信息技术有限公司 Software page classification method and device, electronic equipment and storage medium
CN110781925B (en) * 2019-09-29 2023-03-10 支付宝(杭州)信息技术有限公司 Software page classification method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN103309862B (en) 2017-05-17

Similar Documents

Publication Publication Date Title
CN103309862B (en) Webpage type recognition method and system
CN107193959B (en) Pure text-oriented enterprise entity classification method
CN102332028B (en) Webpage-oriented unhealthy Web content identifying method
TWI424325B (en) Systems and methods for organizing collective social intelligence information using an organic object data model
CN102054015B (en) System and method of organizing community intelligent information by using organic matter data model
CN103870973B (en) Information push, searching method and the device of keyword extraction based on electronic information
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN102929937B (en) Based on the data processing method of the commodity classification of text subject model
Nagamma et al. An improved sentiment analysis of online movie reviews based on clustering for box-office prediction
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN107590219A (en) Webpage personage subject correlation message extracting method
CN109522562B (en) Webpage knowledge extraction method based on text image fusion recognition
US20200004792A1 (en) Automated website data collection method
CN103853834B (en) Text structure analysis-based Web document abstract generation method
CN105378731A (en) Correlating corpus/corpora value from answered questions
CN101609450A (en) Web page classification method based on training set
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN109558587B (en) Method for classifying public opinion tendency recognition aiming at category distribution imbalance
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
Rashid et al. Feature level opinion mining of educational student feedback data using sequential pattern mining and association rule mining
CN103593431A (en) Internet public opinion analyzing method and device
CN103559199A (en) Web information extraction method and web information extraction device
Chinsha et al. Aspect based opinion mining from restaurant reviews
CN113312476A (en) Automatic text labeling method and device and terminal
CN104794209A (en) Chinese microblog sentiment classification method and system based on Markov logic network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant