CN103309862B - Webpage type recognition method and system - Google Patents

Webpage type recognition method and system Download PDF

Info

Publication number
CN103309862B
CN103309862B CN201210058024.3A CN201210058024A CN103309862B CN 103309862 B CN103309862 B CN 103309862B CN 201210058024 A CN201210058024 A CN 201210058024A CN 103309862 B CN103309862 B CN 103309862B
Authority
CN
China
Prior art keywords
webpage
type
news
content
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210058024.3A
Other languages
Chinese (zh)
Other versions
CN103309862A (en
Inventor
蔡兵
彭默
徐羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201210058024.3A priority Critical patent/CN103309862B/en
Publication of CN103309862A publication Critical patent/CN103309862A/en
Application granted granted Critical
Publication of CN103309862B publication Critical patent/CN103309862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The embodiment of the invention provides a webpage type recognition method and system. The method comprises the following steps of calculating a content type alignment value of a webpage according to the contextual content of the webpage; extracting the webpage structural characteristics of the webpage; and utilizing the content type alignment value and the webpage structural characteristics to recognize a type of the webpage. By utilizing the method and system, the webpage is classified by comprehensively considering the dimensionality of the contextual content and the dimensionality of the webpage structure, so that the classification accuracy is higher. Moreover, through the data filter, noises which are nonrelated to the recognition type in the webpage such as tags, links and advertisement can be efficiently eliminated, and the classification effect is better.

Description

A kind of webpage type identification method and system
Technical field
Embodiment of the present invention is related to technical field of internet application, more particularly, to a kind of type of webpage identification side Method and system.
Background technology
With developing rapidly for computer technology and network technology, the Internet (Internet) daily life, The effect played in study and work is also increasing.The newest internet development announced according to CNNIC is adjusted Look into report to show, China Internet number of netizen reaches Chinese webpage in 5.13 hundred million, 2010 60,000,000,000, and global webpage is then at least Have 1,000,000,000,000.
How the information numerous and complicated that numerous webpages are included on the Internet, accurately sort out these webpages in order to follow-up work Work is a stern challenge.Such as:In terms of web advertisement, show that the advertisement related to type of webpage will greatly promote use Family clicking rate.In addition, nearly 2 years with the development of mobile Internet, the demand of mobile reading is in blowout shape, and news is undoubtedly used One of type that family is paid close attention to the most, if can recognize that news web page, it is also possible to provide cleaner number to mobile reading application According to while can also extract to the page that corresponding help is provided.
At present, generally content of text is recognized using the file classification method of naive Bayesian in the prior art, mainly Including:Mark training sample, by the use of text word as feature, classification of text, etc. is estimated by the method for counting.
First, mainly classified according to web page contents in currently available technology, and carried out only according to web page contents If classification, classification accuracy is not high.Secondly, compared with the webpage on the Internet, the data source of text classification is due to excessively It is simple and impracticable.
The content of the invention
Embodiment of the present invention proposes a kind of webpage type identification method, to improve Web page classifying accuracy.
Embodiment of the present invention also proposes a kind of type of webpage identifying system, to improve Web page classifying accuracy.
The concrete scheme of embodiment of the present invention is as follows:
A kind of webpage type identification method, the method includes:
The content type propensity value of the webpage is calculated according to the content of text of webpage;
Extract the structure of web page feature of the webpage;
Using the content type propensity value and webpage described in the structure of web page feature identification type.
A kind of type of webpage identifying system, it is single that the system includes that content type propensity value computing unit, architectural feature are extracted Unit and type identification unit, wherein:
Content type propensity value computing unit, the content type for calculating the webpage according to the content of text of webpage is inclined to Value;
Architectural feature extraction unit, for extracting the structure of web page feature of the webpage;
Type identification unit, for using the content type propensity value and webpage described in the structure of web page feature identification Type.
From above-mentioned technical proposal as can be seen that in embodiments of the present invention, the net is calculated according to the content of text of webpage The content type propensity value of page;Extract the structure of web page feature of the webpage;Recycle content type propensity value and structure of web page special Levy the type for recognizing the webpage.As can be seen here, after using embodiment of the present invention, two dimensions are carried out first against webpage Classification:One is that, based on the dimension of content of text, another is based on the dimension of structure of web page;Finally according to the two dimensions Classification results, combination determines the classification of webpage.Therefore embodiment of the present invention not only allows for the content of text dimension of webpage Degree, it is also contemplated that structure of web page dimension has considered the two dimensions to classify webpage classifying webpage, Therefore the accuracy of classification is higher.
Description of the drawings
Fig. 1 is the webpage type identification method flow chart according to embodiment of the present invention;
Fig. 2 is the webpage type identification method exemplary flow chart according to embodiment of the present invention;
Fig. 3 is the type of webpage identifying system structure chart according to embodiment of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawings further is made to the present invention Detailed description.
In embodiments of the present invention, the classification of two dimensions is carried out for webpage.One is based on the dimension of content of text Degree, another is based on the dimension of structure of web page.Then, according to the classification results of the two dimensions, webpage is determined in combination Classification.
Fig. 1 is the webpage type identification method flow chart according to embodiment of the present invention.
As shown in figure 1, the method includes:
Step 101:The content type propensity value of the webpage is calculated according to the content of text of webpage.
Here, it is related to carry out preliminary classification to type of webpage based on the dimension of content of text.Classified according to content of text Relate generally to using statistical machine learning sorting algorithm, certain page is calculated by training sample and feature for particular type The probability of (such as news type).
Specifically, participle can be carried out to the content of text of webpage first with dictionary, and calculates the weight of participle feature To form characteristic vector, then further according to the content type tendency of the web page contents classifier calculated this feature vector for pre-setting Value, wherein the content type propensity value for calculating can be used as corresponding to type of webpage representated by this kind of web page contents grader Probability.
In addition to text message, Webpage usually contains many other irrelevant contents.It is found through experiments, only profit With all sentences in webpage as categorical data source, the noises such as label, link, advertisement can be effectively removed so that classifying quality More preferably.Therefore, in one embodiment, before participle is carried out to the content of text of webpage using dictionary, can be from text Sentence of the whole sentence length less than predetermined value is filtered off in content, to strengthen classifying quality.
And, in order to reduce the cost of artificial mark data band, can attempt using various websites (such as, some news Website) go to capture data as entrance, and by simple manual examination and verification, substantial amounts of (such as thousands of) news data is obtained, Then by the use of word as characteristic of division, and binding characteristic selects scheduling algorithm to carry out dimensionality reduction.
In another embodiment, grader can be calculated using logistic regression (Logistic Regression) classification Method calculates the content type propensity value of characteristic vector.Logistic regression is a kind of linear classifier, and calculating speed quickly, is relatively adapted to The application scenarios of real-time grading.
In one embodiment, specifically can be calculated using word frequency-anti-document frequency (TF-IDF) weighting algorithm point The weight of word feature.
TF-IDF weighting algorithms are a kind of conventional weighting techniques prospected for information retrieval and information, to assess a word Word is for the significance level of a copy of it file in a file set or a corpus.In TF-IDF weighting algorithms, word The importance of word is directly proportional increase, but while the frequency that can occur in corpus with it with the number of times that it occurs hereof Rate is inversely proportional to decline.
The various forms of TF-IDF weightings is often searched engine application, used as degree of correlation between file and user's inquiry Tolerance is graded.In addition to TF-IDF, the Search engine on the Internet can be also used based on the ranking method for linking analysis, with Determine the order that file occurs in search result.
Step 102:Extract the structure of web page feature of the webpage.
Here, it is related to carry out preliminary classification to type of webpage based on the dimension of content of text.Specifically, can first to net Page builds DOM Document Object Model (DOM) tree, then some structure of web page features is extracted by traveling through dom tree, using as textural classification Foundation.
According to W3C DOM specifications, DOM is a kind of interface unrelated with browser, platform, language so that user can visit Ask page others standard package.DOM solves the Javascript of Netscape (Netscape) and Microsoft (Microsoft) Conflict between Jscript, gives web designer and developer the method for one standard, in order to access website in data, Script and presentation layer are to picture.DOM is the set of the node or pieces of information organized with hierarchical structure.This hierarchical structure is allowed out The personnel of sending out navigate in tree and find customizing messages.Analyzing the structure generally needs to load whole document and tectonic remnant basin structure, so Any work can be just done afterwards.Because it is based on level of information, thus DOM is considered as based on tree or object-based.
Such as:The structure of web page feature for traveling through dom tree and extracting can include:
1) URL features.Such as URL ends are index.html etc., then substantially can be determined that as index page.If URL Containing " content " or date, then the probability for content pages is larger.
2) Text Link Ratio.Calculate text (Pure Text) length and link text (Anchor) length inside webpage Ratio.
3) maximum text size.Calculate one section of text size most long in webpage.As a length gauge of content pages Value.
4) most long continuous text ratio.The text size concentrated accounts for the ratio of the total text size of webpage.In general, it is interior The text message for holding page is concentrated mainly on one piece, and such as thematic page etc., although its text size is long, but is distributed relative distribution.
5) secondary navigation information;
6) web page title, etc..
Although enumerating some specific structure of web page features in detail above, it will be appreciated by those of skill in the art that real The structure of web page feature adopted on border is not limited thereto, and the protection domain of embodiment of the present invention is also not limited to This.
Step 103:Using content type propensity value and the type of structure of web page feature identification webpage.
Here, the structure of web page that the content type propensity value and step 102 for being calculated based on step 101 is extracted is special Levy, the threshold value and combined strategy of each feature can be determined by the various many judgment criterions for pre-setting, finally draw The type of the page.
Such as:When the content of text in step 101 according to webpage calculates the news type propensity value of the webpage, then judge Criterion specifically can include:
1) when news type propensity value is more than the news type first threshold for pre-setting, the type of webpage is directly judged For news.
For example, it is assumed that the span of news type propensity value is 0-100, the news type propensity value for calculating is 90, And news type first threshold is 85.Now, because the news type propensity value for calculating is more than news type first threshold, because This can consider the webpage and news height correlation, can not now consider structure of web page feature and directly judge the class of the webpage Type is news.
2) when news type propensity value is more than the news type Second Threshold for pre-setting, and include in structure of web page feature During news category information, the type of the webpage is judged as news, wherein news type first threshold is more than the threshold of news type second Value.
For example, it is assumed that the span of news type propensity value is 0-100, the news type propensity value for calculating is 70, And news type first threshold is 85, news type Second Threshold is 60.Now, because the news type propensity value for calculating is little In news type first threshold, therefore the webpage can not be directly assert for news type, but due to the news type for calculating Propensity value is more than news type Second Threshold, then it is considered that the webpage is related to news type, it is therefore desirable to combine and calculate News type propensity value and structure of web page feature whether carry out the synthetic determination webpage be news type.Now, structure of web page is worked as (contain " news " in such as web page title) when also including news category information in feature simultaneously, then can be determined that the type of the webpage For news.
When the news type propensity value for calculating be less than news type Second Threshold, then can directly assert the webpage with it is new Hear type uncorrelated.
In embodiments of the present invention, for the webpage of news type, final recognition accuracy can reach 95% with On, and recall rate is more than 80%.
Although being described in detail to embodiment of the present invention by example of news type above, those skilled in the art It is to be appreciated that being based on above-mentioned detailed teachings, the type of webpage that embodiment of the present invention can essentially be suitable for not merely is wrapped News type is included, and can be including multiple types such as knowledge question type, forum's zone of discussion type or online transaction type of webpage Type.
In said method flow process, require to have no strict demand for the execution sequence of step 101 and step 102.It is actual On, step 101 and step 102 can be carried out simultaneously, it is also possible to first carried out step 101, then execution step 102, or performed Execution step 101 again after step 102.
And, identified after type of webpage based on above-mentioned flow process, many can be performed with reference to the type of webpage for identifying Plant application.
Such as:Recognized type of webpage can be based on, the advertisement degree of association of the webpage is calculated;Can also be based on and be recognized Type of webpage, perform Personalize News for the webpage and recommend;Recognized type of webpage is also based on, from the webpage Extract Web page structural data;Or based on the type of webpage for being recognized, the data screening for reading class application is performed for the webpage, Etc..
Based on above-mentioned labor, below to differentiate whether webpage is exemplary flow of the news type as example to the present invention Journey is illustrated.
Fig. 2 is the webpage type identification method exemplary flow chart according to embodiment of the present invention.
As shown in Fig. 2 for the operation You Liangge branches of webpage.Left side branch includes step 201, step 202 and step 203, right branch includes step 204 and step 205.Liang Ge branches are summarized in step 206.Wherein left side branch includes:
Step 201:Perform data filtering.In order to prevent webpage noise, only extract some long sentences in webpage as text This, can filter off sentence of the whole sentence length less than predetermined value, to strengthen classifying quality from content of text herein.
Step 202:Participle is carried out to text using characteristic set dictionary, the weight (profit of each participle feature is then calculated With characteristic set and the feature weight computational methods of such as TF-IDF), form a characteristic vector.
Step 203:Using characteristic vector as the input of grader, an output valve (span is 0-100 point) is obtained, That is news content type propensity value, for representing tendency degree that its content is news.Wherein can by training sample and feature, The grader is previously obtained by logistic regression algorithm.
Right branch includes:
Step 204:Build dom tree.Including:Dom tree is set up using the html tag of webpage, and comprising the letter such as each tag attributes Breath.
Step 205:Structure type feature, such as secondary navigation, Text Link Ratio etc. are extracted based on dom tree.
Left and right branch is summarised in step 206:Combination judges.Using the output and the output of step 205 of step 203, utilize Preset strategy and carry out optimum and determine whether news content page.
It is discussed in detail based on above-mentioned, embodiment of the present invention also proposed a kind of type of webpage identifying system.
Fig. 3 is the type of webpage identifying system structure chart according to embodiment of the present invention.
As shown in figure 3, the system includes:Content type propensity value computing unit 301, the and of architectural feature extraction unit 302 Type identification unit 303.
Wherein:Content type propensity value computing unit 301, for calculating the content of the webpage according to the content of text of webpage Type propensity value;
Architectural feature extraction unit 302, for extracting the structure of web page feature of the webpage;
Type identification unit 303, for using described in the content type propensity value and the structure of web page feature identification The type of webpage.
In one embodiment, the system further includes type of process unit (not shown in FIG.).Type of process Unit, for performing at least one of following steps:Based on the type of webpage for being recognized, the advertisement for calculating the webpage is related Degree;Based on the type of webpage for being recognized, perform Personalize News for the webpage and recommend;Based on the type of webpage for being recognized, from Web page structural data are extracted in the webpage;Or based on the type of webpage for being recognized, perform for the webpage and read class application Data screening.
Specifically, content type propensity value computing unit 301, for being carried out to the content of text of webpage point using dictionary Word, and the weight of participle feature is calculated to form characteristic vector;And according to the web page contents classifier calculated spy for pre-setting Levy the content type propensity value of vector.
Preferably, content type propensity value computing unit 301, is further used in the content of text using dictionary to webpage Before carrying out participle, sentence of the whole sentence length less than predetermined value is filtered off from the content of text.
Specifically, architectural feature extraction unit 302, for setting up the DOM Document Object Model dom tree of the webpage, and from described Structure of web page feature is extracted in dom tree.
In one embodiment, content type propensity value computing unit 301, for being calculated according to the content of text of webpage The news type propensity value of the webpage;Now type identification unit 302 is used to perform at least one of following steps:Work as news When type propensity value is more than the news type first threshold for pre-setting, directly judge the type of webpage as news;Or work as news Type propensity value includes news category information more than the news type Second Threshold for pre-setting in the structure of web page feature When, judge the type of webpage as news;Wherein news type first threshold is more than news type Second Threshold.
Similarly, the type of webpage that the type of webpage identifying system in embodiment of the present invention is suitable for not merely includes News type, and can be including knowledge question type, forum's zone of discussion type or online transaction type of webpage, etc..
In sum, in embodiments of the present invention, the content type for calculating the webpage according to the content of text of webpage inclines To value;Extract the structure of web page feature of the webpage;Recycle content type propensity value and webpage described in structure of web page feature identification Type.As can be seen here, after using embodiment of the present invention, for webpage the classification of two dimensions is carried out.One is to be based on The dimension of content of text, another is that, based on the dimension of structure of web page, finally according to the classification results of the two dimensions, combination is true Make the classification of webpage.Therefore embodiment of the present invention not only allows for content of text dimension, it is also contemplated that structure of web page dimension To classify to webpage, webpage is classified by considering the two dimensions, therefore the accuracy classified is higher.
And, in embodiments of the present invention, by data filtering, can effectively remove unrelated with identification types in webpage The noises such as label, link, advertisement so that classifying quality is more preferably.
The above, only presently preferred embodiments of the present invention is not intended to limit protection scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvements made etc. should be included in the protection of the present invention Within the scope of.

Claims (15)

1. a kind of webpage type identification method, it is characterised in that the method includes:
The content type propensity value of the webpage is calculated according to the content of text of webpage;
Extract the structure of web page feature of the webpage;
Using the content type propensity value and webpage described in the structure of web page feature identification type;
The content of text according to webpage calculates the content type propensity value of the webpage and is specially:According to the content of text of webpage Calculate the news type propensity value of the webpage;Wherein:
Using news type propensity value and the type of structure of web page feature identification webpage, at least in following steps is specifically included It is individual:
When the news type propensity value is more than the news type first threshold for pre-setting, the class of the webpage is directly judged Type is news;Or
When the news type propensity value is more than the news type Second Threshold for pre-setting, and wrap in the structure of web page feature During category information containing news, judge the type of the webpage as news;
Wherein described news type first threshold is more than news type Second Threshold.
2. webpage type identification method according to claim 1, it is characterised in that the method is further comprising the steps At least one of:
Based on the type of webpage for being recognized, the advertisement degree of association of the webpage is calculated;
Based on the type of webpage for being recognized, perform Personalize News for the webpage and recommend;
Based on the type of webpage for being recognized, Web page structural data are extracted from the webpage;Or
Based on the type of webpage for being recognized, the data screening for reading class application is performed for the webpage.
3. webpage type identification method according to claim 1, it is characterised in that the content of text meter according to webpage The content type propensity value for calculating the webpage is specifically included:
Participle is carried out to the content of text of the webpage using dictionary, and calculates the weight of participle feature to form characteristic vector;
According to the content type propensity value of the web page contents classifier calculated this feature vector for pre-setting.
4. webpage type identification method according to claim 3, it is characterised in that in the text using dictionary to webpage Appearance is carried out before participle, and the method is further included:Sentence of the whole sentence length less than predetermined value is filtered off from the content of text.
5. webpage type identification method according to claim 3, it is characterised in that the weight of the calculating participle feature For:The weight of participle feature is calculated using the anti-document frequency IDF weighting algorithms of word frequency TF-.
6. webpage type identification method according to claim 3, it is characterised in that in the method:
The web page contents grader calculates the content type propensity value of this feature vector using logistic regression sorting algorithm.
7. webpage type identification method according to claim 1, it is characterised in that the structure of web page of the extraction webpage Feature is specifically included:
Set up the DOM Document Object Model dom tree of the webpage;
Structure of web page feature is extracted from the dom tree.
8. webpage type identification method according to claim 7, it is characterised in that the structure of web page feature includes following At least one of information:
Secondary navigation information;
Text Link Ratio;
Uniform resource position mark URL;
Web page title;
Maximum text size;Or
Most long continuous text ratio.
9. webpage type identification method according to claim 1, it is characterised in that the type of the webpage includes news category Type, knowledge question type, forum's zone of discussion type or online transaction type of webpage.
10. a kind of type of webpage identifying system, it is characterised in that the system includes content type propensity value computing unit, structure Feature extraction unit and type identification unit, wherein:
Content type propensity value computing unit, for calculating the content type propensity value of the webpage according to the content of text of webpage;
Architectural feature extraction unit, for extracting the structure of web page feature of the webpage;
Type identification unit, for using the content type propensity value and webpage described in the structure of web page feature identification class Type;
The content of text according to webpage calculates the content type propensity value of the webpage and is specially:According to the content of text of webpage Calculate the news type propensity value of the webpage;Wherein:
Using news type propensity value and the type of structure of web page feature identification webpage, at least in following steps is specifically included It is individual:
When the news type propensity value is more than the news type first threshold for pre-setting, the class of the webpage is directly judged Type is news;Or
When the news type propensity value is more than the news type Second Threshold for pre-setting, and wrap in the structure of web page feature During category information containing news, judge the type of the webpage as news;
Wherein described news type first threshold is more than news type Second Threshold.
11. type of webpage identifying systems according to claim 10, it is characterised in that the system is further included at type Reason unit, the type of process unit is used to perform at least one of following steps:
Based on the type of webpage for being recognized, the advertisement degree of association of the webpage is calculated;
Based on the type of webpage for being recognized, perform Personalize News for the webpage and recommend;
Based on the type of webpage for being recognized, Web page structural data are extracted from the webpage;Or
Based on the type of webpage for being recognized, the data screening for reading class application is performed for the webpage.
12. type of webpage identifying systems according to claim 10, it is characterised in that
The content type propensity value computing unit, divides for participle being carried out to the content of text of webpage using dictionary, and being calculated The weight of word feature is forming characteristic vector;And according to the content of the web page contents classifier calculated this feature vector for pre-setting Type propensity value.
13. type of webpage identifying systems according to claim 10, it is characterised in that
The content type propensity value computing unit, be further used for using dictionary the content of text of webpage is carried out participle it Before, sentence of the whole sentence length less than predetermined value is filtered off from the content of text.
14. type of webpage identifying systems according to claim 10, it is characterised in that
The architectural feature extraction unit, for setting up the DOM Document Object Model dom tree of the webpage, and carries from the dom tree Take structure of web page feature.
15. type of webpage identifying systems according to claim 10, it is characterised in that the type of the webpage includes news Type, knowledge question type, forum's zone of discussion type or online transaction type.
CN201210058024.3A 2012-03-07 2012-03-07 Webpage type recognition method and system Active CN103309862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210058024.3A CN103309862B (en) 2012-03-07 2012-03-07 Webpage type recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210058024.3A CN103309862B (en) 2012-03-07 2012-03-07 Webpage type recognition method and system

Publications (2)

Publication Number Publication Date
CN103309862A CN103309862A (en) 2013-09-18
CN103309862B true CN103309862B (en) 2017-05-17

Family

ID=49135101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210058024.3A Active CN103309862B (en) 2012-03-07 2012-03-07 Webpage type recognition method and system

Country Status (1)

Country Link
CN (1) CN103309862B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544310B (en) * 2013-11-04 2017-08-08 北京中搜云商网络技术有限公司 A kind of information classification approach for the shopping guide's class webpage realized based on grader
CN104021180B (en) * 2014-06-09 2017-10-24 南京航空航天大学 A kind of modular software defect report sorting technique
CN104090931A (en) * 2014-06-25 2014-10-08 华南理工大学 Information prediction and acquisition method based on webpage link parameter analysis
CN105373570B (en) * 2014-09-02 2020-09-15 中兴通讯股份有限公司 Management method and terminal for browser history records
CN106557517A (en) * 2015-09-29 2017-04-05 百度在线网络技术(北京)有限公司 The sort management method and device of website
CN105302884B (en) * 2015-10-19 2019-02-19 天津海量信息技术股份有限公司 Webpage mode identification method and visual structure learning method based on deep learning
WO2018053863A1 (en) * 2016-09-26 2018-03-29 Microsoft Technology Licensing, Llc Identifying video pages
CN108255891B (en) * 2016-12-29 2020-08-28 北京国双科技有限公司 Method and device for judging webpage type
CN108345599B (en) * 2017-01-23 2021-12-14 阿里巴巴集团控股有限公司 Webpage type determination method and device and computer readable medium
CN108881138B (en) * 2017-10-26 2020-06-26 新华三信息安全技术有限公司 Webpage request identification method and device
CN110110075A (en) * 2017-12-25 2019-08-09 中国电信股份有限公司 Web page classification method, device and computer readable storage medium
CN110427755A (en) * 2018-10-16 2019-11-08 新华三信息安全技术有限公司 A kind of method and device identifying script file
CN110781925B (en) * 2019-09-29 2023-03-10 支付宝(杭州)信息技术有限公司 Software page classification method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178714A (en) * 2006-12-20 2008-05-14 腾讯科技(深圳)有限公司 Web page classification method and device
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Web page classification: Features and algorithms";Xiaoguang Qi 等;《ACM Computing Surveys (CSUR)》;20090228;第41卷(第2期);第1-31页 *
"基于结构信息的中文网页自动分类技术研究";刘欣;《中国优秀硕士学位论文全文数据库 信息科技辑》;20110615(第6期);第3章,图3.4、图3.5 *

Also Published As

Publication number Publication date
CN103309862A (en) 2013-09-18

Similar Documents

Publication Publication Date Title
CN103309862B (en) Webpage type recognition method and system
CN107193959B (en) Pure text-oriented enterprise entity classification method
CN103870973B (en) Information push, searching method and the device of keyword extraction based on electronic information
CN102332028B (en) Webpage-oriented unhealthy Web content identifying method
KR101741509B1 (en) Device and method for analyzing corporate reputation by data mining of news, recording medium for performing the method
Anderka et al. Predicting quality flaws in user-generated content: the case of wikipedia
CN102929937B (en) Based on the data processing method of the commodity classification of text subject model
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN101609450A (en) Web page classification method based on training set
WO2009096523A1 (en) Information analysis device, search system, information analysis method, and information analysis program
CN103914478A (en) Webpage training method and system and webpage prediction method and system
Barakat et al. Applying deep learning models to twitter data to detect airport service quality
CN103324666A (en) Topic tracing method and device based on micro-blog data
Wang et al. Customer-driven product design selection using web based user-generated content
US20200004792A1 (en) Automated website data collection method
US11893537B2 (en) Linguistic analysis of seed documents and peer groups
CN111160019B (en) Public opinion monitoring method, device and system
CN107885883A (en) A kind of macroeconomy field sentiment analysis method and system based on Social Media
CN106933800A (en) A kind of event sentence abstracting method of financial field
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
CN109710725A (en) A kind of Chinese table column label restoration methods and system based on text classification
Sivakumar Effectual web content mining using noise removal from web pages
Verberne et al. Automatic thematic classification of election manifestos
CN114997288A (en) Design resource association method
Vavpetič et al. Semantic data mining of financial news articles

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant