CN103309862B - Webpage type recognition method and system - Google Patents
Webpage type recognition method and system Download PDFInfo
- Publication number
- CN103309862B CN103309862B CN201210058024.3A CN201210058024A CN103309862B CN 103309862 B CN103309862 B CN 103309862B CN 201210058024 A CN201210058024 A CN 201210058024A CN 103309862 B CN103309862 B CN 103309862B
- Authority
- CN
- China
- Prior art keywords
- webpage
- type
- news
- content
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The embodiment of the invention provides a webpage type recognition method and system. The method comprises the following steps of calculating a content type alignment value of a webpage according to the contextual content of the webpage; extracting the webpage structural characteristics of the webpage; and utilizing the content type alignment value and the webpage structural characteristics to recognize a type of the webpage. By utilizing the method and system, the webpage is classified by comprehensively considering the dimensionality of the contextual content and the dimensionality of the webpage structure, so that the classification accuracy is higher. Moreover, through the data filter, noises which are nonrelated to the recognition type in the webpage such as tags, links and advertisement can be efficiently eliminated, and the classification effect is better.
Description
Technical field
Embodiment of the present invention is related to technical field of internet application, more particularly, to a kind of type of webpage identification side
Method and system.
Background technology
With developing rapidly for computer technology and network technology, the Internet (Internet) daily life,
The effect played in study and work is also increasing.The newest internet development announced according to CNNIC is adjusted
Look into report to show, China Internet number of netizen reaches Chinese webpage in 5.13 hundred million, 2010 60,000,000,000, and global webpage is then at least
Have 1,000,000,000,000.
How the information numerous and complicated that numerous webpages are included on the Internet, accurately sort out these webpages in order to follow-up work
Work is a stern challenge.Such as:In terms of web advertisement, show that the advertisement related to type of webpage will greatly promote use
Family clicking rate.In addition, nearly 2 years with the development of mobile Internet, the demand of mobile reading is in blowout shape, and news is undoubtedly used
One of type that family is paid close attention to the most, if can recognize that news web page, it is also possible to provide cleaner number to mobile reading application
According to while can also extract to the page that corresponding help is provided.
At present, generally content of text is recognized using the file classification method of naive Bayesian in the prior art, mainly
Including:Mark training sample, by the use of text word as feature, classification of text, etc. is estimated by the method for counting.
First, mainly classified according to web page contents in currently available technology, and carried out only according to web page contents
If classification, classification accuracy is not high.Secondly, compared with the webpage on the Internet, the data source of text classification is due to excessively
It is simple and impracticable.
The content of the invention
Embodiment of the present invention proposes a kind of webpage type identification method, to improve Web page classifying accuracy.
Embodiment of the present invention also proposes a kind of type of webpage identifying system, to improve Web page classifying accuracy.
The concrete scheme of embodiment of the present invention is as follows:
A kind of webpage type identification method, the method includes:
The content type propensity value of the webpage is calculated according to the content of text of webpage;
Extract the structure of web page feature of the webpage;
Using the content type propensity value and webpage described in the structure of web page feature identification type.
A kind of type of webpage identifying system, it is single that the system includes that content type propensity value computing unit, architectural feature are extracted
Unit and type identification unit, wherein:
Content type propensity value computing unit, the content type for calculating the webpage according to the content of text of webpage is inclined to
Value;
Architectural feature extraction unit, for extracting the structure of web page feature of the webpage;
Type identification unit, for using the content type propensity value and webpage described in the structure of web page feature identification
Type.
From above-mentioned technical proposal as can be seen that in embodiments of the present invention, the net is calculated according to the content of text of webpage
The content type propensity value of page;Extract the structure of web page feature of the webpage;Recycle content type propensity value and structure of web page special
Levy the type for recognizing the webpage.As can be seen here, after using embodiment of the present invention, two dimensions are carried out first against webpage
Classification:One is that, based on the dimension of content of text, another is based on the dimension of structure of web page;Finally according to the two dimensions
Classification results, combination determines the classification of webpage.Therefore embodiment of the present invention not only allows for the content of text dimension of webpage
Degree, it is also contemplated that structure of web page dimension has considered the two dimensions to classify webpage classifying webpage,
Therefore the accuracy of classification is higher.
Description of the drawings
Fig. 1 is the webpage type identification method flow chart according to embodiment of the present invention;
Fig. 2 is the webpage type identification method exemplary flow chart according to embodiment of the present invention;
Fig. 3 is the type of webpage identifying system structure chart according to embodiment of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the accompanying drawings further is made to the present invention
Detailed description.
In embodiments of the present invention, the classification of two dimensions is carried out for webpage.One is based on the dimension of content of text
Degree, another is based on the dimension of structure of web page.Then, according to the classification results of the two dimensions, webpage is determined in combination
Classification.
Fig. 1 is the webpage type identification method flow chart according to embodiment of the present invention.
As shown in figure 1, the method includes:
Step 101:The content type propensity value of the webpage is calculated according to the content of text of webpage.
Here, it is related to carry out preliminary classification to type of webpage based on the dimension of content of text.Classified according to content of text
Relate generally to using statistical machine learning sorting algorithm, certain page is calculated by training sample and feature for particular type
The probability of (such as news type).
Specifically, participle can be carried out to the content of text of webpage first with dictionary, and calculates the weight of participle feature
To form characteristic vector, then further according to the content type tendency of the web page contents classifier calculated this feature vector for pre-setting
Value, wherein the content type propensity value for calculating can be used as corresponding to type of webpage representated by this kind of web page contents grader
Probability.
In addition to text message, Webpage usually contains many other irrelevant contents.It is found through experiments, only profit
With all sentences in webpage as categorical data source, the noises such as label, link, advertisement can be effectively removed so that classifying quality
More preferably.Therefore, in one embodiment, before participle is carried out to the content of text of webpage using dictionary, can be from text
Sentence of the whole sentence length less than predetermined value is filtered off in content, to strengthen classifying quality.
And, in order to reduce the cost of artificial mark data band, can attempt using various websites (such as, some news
Website) go to capture data as entrance, and by simple manual examination and verification, substantial amounts of (such as thousands of) news data is obtained,
Then by the use of word as characteristic of division, and binding characteristic selects scheduling algorithm to carry out dimensionality reduction.
In another embodiment, grader can be calculated using logistic regression (Logistic Regression) classification
Method calculates the content type propensity value of characteristic vector.Logistic regression is a kind of linear classifier, and calculating speed quickly, is relatively adapted to
The application scenarios of real-time grading.
In one embodiment, specifically can be calculated using word frequency-anti-document frequency (TF-IDF) weighting algorithm point
The weight of word feature.
TF-IDF weighting algorithms are a kind of conventional weighting techniques prospected for information retrieval and information, to assess a word
Word is for the significance level of a copy of it file in a file set or a corpus.In TF-IDF weighting algorithms, word
The importance of word is directly proportional increase, but while the frequency that can occur in corpus with it with the number of times that it occurs hereof
Rate is inversely proportional to decline.
The various forms of TF-IDF weightings is often searched engine application, used as degree of correlation between file and user's inquiry
Tolerance is graded.In addition to TF-IDF, the Search engine on the Internet can be also used based on the ranking method for linking analysis, with
Determine the order that file occurs in search result.
Step 102:Extract the structure of web page feature of the webpage.
Here, it is related to carry out preliminary classification to type of webpage based on the dimension of content of text.Specifically, can first to net
Page builds DOM Document Object Model (DOM) tree, then some structure of web page features is extracted by traveling through dom tree, using as textural classification
Foundation.
According to W3C DOM specifications, DOM is a kind of interface unrelated with browser, platform, language so that user can visit
Ask page others standard package.DOM solves the Javascript of Netscape (Netscape) and Microsoft (Microsoft)
Conflict between Jscript, gives web designer and developer the method for one standard, in order to access website in data,
Script and presentation layer are to picture.DOM is the set of the node or pieces of information organized with hierarchical structure.This hierarchical structure is allowed out
The personnel of sending out navigate in tree and find customizing messages.Analyzing the structure generally needs to load whole document and tectonic remnant basin structure, so
Any work can be just done afterwards.Because it is based on level of information, thus DOM is considered as based on tree or object-based.
Such as:The structure of web page feature for traveling through dom tree and extracting can include:
1) URL features.Such as URL ends are index.html etc., then substantially can be determined that as index page.If URL
Containing " content " or date, then the probability for content pages is larger.
2) Text Link Ratio.Calculate text (Pure Text) length and link text (Anchor) length inside webpage
Ratio.
3) maximum text size.Calculate one section of text size most long in webpage.As a length gauge of content pages
Value.
4) most long continuous text ratio.The text size concentrated accounts for the ratio of the total text size of webpage.In general, it is interior
The text message for holding page is concentrated mainly on one piece, and such as thematic page etc., although its text size is long, but is distributed relative distribution.
5) secondary navigation information;
6) web page title, etc..
Although enumerating some specific structure of web page features in detail above, it will be appreciated by those of skill in the art that real
The structure of web page feature adopted on border is not limited thereto, and the protection domain of embodiment of the present invention is also not limited to
This.
Step 103:Using content type propensity value and the type of structure of web page feature identification webpage.
Here, the structure of web page that the content type propensity value and step 102 for being calculated based on step 101 is extracted is special
Levy, the threshold value and combined strategy of each feature can be determined by the various many judgment criterions for pre-setting, finally draw
The type of the page.
Such as:When the content of text in step 101 according to webpage calculates the news type propensity value of the webpage, then judge
Criterion specifically can include:
1) when news type propensity value is more than the news type first threshold for pre-setting, the type of webpage is directly judged
For news.
For example, it is assumed that the span of news type propensity value is 0-100, the news type propensity value for calculating is 90,
And news type first threshold is 85.Now, because the news type propensity value for calculating is more than news type first threshold, because
This can consider the webpage and news height correlation, can not now consider structure of web page feature and directly judge the class of the webpage
Type is news.
2) when news type propensity value is more than the news type Second Threshold for pre-setting, and include in structure of web page feature
During news category information, the type of the webpage is judged as news, wherein news type first threshold is more than the threshold of news type second
Value.
For example, it is assumed that the span of news type propensity value is 0-100, the news type propensity value for calculating is 70,
And news type first threshold is 85, news type Second Threshold is 60.Now, because the news type propensity value for calculating is little
In news type first threshold, therefore the webpage can not be directly assert for news type, but due to the news type for calculating
Propensity value is more than news type Second Threshold, then it is considered that the webpage is related to news type, it is therefore desirable to combine and calculate
News type propensity value and structure of web page feature whether carry out the synthetic determination webpage be news type.Now, structure of web page is worked as
(contain " news " in such as web page title) when also including news category information in feature simultaneously, then can be determined that the type of the webpage
For news.
When the news type propensity value for calculating be less than news type Second Threshold, then can directly assert the webpage with it is new
Hear type uncorrelated.
In embodiments of the present invention, for the webpage of news type, final recognition accuracy can reach 95% with
On, and recall rate is more than 80%.
Although being described in detail to embodiment of the present invention by example of news type above, those skilled in the art
It is to be appreciated that being based on above-mentioned detailed teachings, the type of webpage that embodiment of the present invention can essentially be suitable for not merely is wrapped
News type is included, and can be including multiple types such as knowledge question type, forum's zone of discussion type or online transaction type of webpage
Type.
In said method flow process, require to have no strict demand for the execution sequence of step 101 and step 102.It is actual
On, step 101 and step 102 can be carried out simultaneously, it is also possible to first carried out step 101, then execution step 102, or performed
Execution step 101 again after step 102.
And, identified after type of webpage based on above-mentioned flow process, many can be performed with reference to the type of webpage for identifying
Plant application.
Such as:Recognized type of webpage can be based on, the advertisement degree of association of the webpage is calculated;Can also be based on and be recognized
Type of webpage, perform Personalize News for the webpage and recommend;Recognized type of webpage is also based on, from the webpage
Extract Web page structural data;Or based on the type of webpage for being recognized, the data screening for reading class application is performed for the webpage,
Etc..
Based on above-mentioned labor, below to differentiate whether webpage is exemplary flow of the news type as example to the present invention
Journey is illustrated.
Fig. 2 is the webpage type identification method exemplary flow chart according to embodiment of the present invention.
As shown in Fig. 2 for the operation You Liangge branches of webpage.Left side branch includes step 201, step 202 and step
203, right branch includes step 204 and step 205.Liang Ge branches are summarized in step 206.Wherein left side branch includes:
Step 201:Perform data filtering.In order to prevent webpage noise, only extract some long sentences in webpage as text
This, can filter off sentence of the whole sentence length less than predetermined value, to strengthen classifying quality from content of text herein.
Step 202:Participle is carried out to text using characteristic set dictionary, the weight (profit of each participle feature is then calculated
With characteristic set and the feature weight computational methods of such as TF-IDF), form a characteristic vector.
Step 203:Using characteristic vector as the input of grader, an output valve (span is 0-100 point) is obtained,
That is news content type propensity value, for representing tendency degree that its content is news.Wherein can by training sample and feature,
The grader is previously obtained by logistic regression algorithm.
Right branch includes:
Step 204:Build dom tree.Including:Dom tree is set up using the html tag of webpage, and comprising the letter such as each tag attributes
Breath.
Step 205:Structure type feature, such as secondary navigation, Text Link Ratio etc. are extracted based on dom tree.
Left and right branch is summarised in step 206:Combination judges.Using the output and the output of step 205 of step 203, utilize
Preset strategy and carry out optimum and determine whether news content page.
It is discussed in detail based on above-mentioned, embodiment of the present invention also proposed a kind of type of webpage identifying system.
Fig. 3 is the type of webpage identifying system structure chart according to embodiment of the present invention.
As shown in figure 3, the system includes:Content type propensity value computing unit 301, the and of architectural feature extraction unit 302
Type identification unit 303.
Wherein:Content type propensity value computing unit 301, for calculating the content of the webpage according to the content of text of webpage
Type propensity value;
Architectural feature extraction unit 302, for extracting the structure of web page feature of the webpage;
Type identification unit 303, for using described in the content type propensity value and the structure of web page feature identification
The type of webpage.
In one embodiment, the system further includes type of process unit (not shown in FIG.).Type of process
Unit, for performing at least one of following steps:Based on the type of webpage for being recognized, the advertisement for calculating the webpage is related
Degree;Based on the type of webpage for being recognized, perform Personalize News for the webpage and recommend;Based on the type of webpage for being recognized, from
Web page structural data are extracted in the webpage;Or based on the type of webpage for being recognized, perform for the webpage and read class application
Data screening.
Specifically, content type propensity value computing unit 301, for being carried out to the content of text of webpage point using dictionary
Word, and the weight of participle feature is calculated to form characteristic vector;And according to the web page contents classifier calculated spy for pre-setting
Levy the content type propensity value of vector.
Preferably, content type propensity value computing unit 301, is further used in the content of text using dictionary to webpage
Before carrying out participle, sentence of the whole sentence length less than predetermined value is filtered off from the content of text.
Specifically, architectural feature extraction unit 302, for setting up the DOM Document Object Model dom tree of the webpage, and from described
Structure of web page feature is extracted in dom tree.
In one embodiment, content type propensity value computing unit 301, for being calculated according to the content of text of webpage
The news type propensity value of the webpage;Now type identification unit 302 is used to perform at least one of following steps:Work as news
When type propensity value is more than the news type first threshold for pre-setting, directly judge the type of webpage as news;Or work as news
Type propensity value includes news category information more than the news type Second Threshold for pre-setting in the structure of web page feature
When, judge the type of webpage as news;Wherein news type first threshold is more than news type Second Threshold.
Similarly, the type of webpage that the type of webpage identifying system in embodiment of the present invention is suitable for not merely includes
News type, and can be including knowledge question type, forum's zone of discussion type or online transaction type of webpage, etc..
In sum, in embodiments of the present invention, the content type for calculating the webpage according to the content of text of webpage inclines
To value;Extract the structure of web page feature of the webpage;Recycle content type propensity value and webpage described in structure of web page feature identification
Type.As can be seen here, after using embodiment of the present invention, for webpage the classification of two dimensions is carried out.One is to be based on
The dimension of content of text, another is that, based on the dimension of structure of web page, finally according to the classification results of the two dimensions, combination is true
Make the classification of webpage.Therefore embodiment of the present invention not only allows for content of text dimension, it is also contemplated that structure of web page dimension
To classify to webpage, webpage is classified by considering the two dimensions, therefore the accuracy classified is higher.
And, in embodiments of the present invention, by data filtering, can effectively remove unrelated with identification types in webpage
The noises such as label, link, advertisement so that classifying quality is more preferably.
The above, only presently preferred embodiments of the present invention is not intended to limit protection scope of the present invention.It is all
Within the spirit and principles in the present invention, any modification, equivalent substitution and improvements made etc. should be included in the protection of the present invention
Within the scope of.
Claims (15)
1. a kind of webpage type identification method, it is characterised in that the method includes:
The content type propensity value of the webpage is calculated according to the content of text of webpage;
Extract the structure of web page feature of the webpage;
Using the content type propensity value and webpage described in the structure of web page feature identification type;
The content of text according to webpage calculates the content type propensity value of the webpage and is specially:According to the content of text of webpage
Calculate the news type propensity value of the webpage;Wherein:
Using news type propensity value and the type of structure of web page feature identification webpage, at least in following steps is specifically included
It is individual:
When the news type propensity value is more than the news type first threshold for pre-setting, the class of the webpage is directly judged
Type is news;Or
When the news type propensity value is more than the news type Second Threshold for pre-setting, and wrap in the structure of web page feature
During category information containing news, judge the type of the webpage as news;
Wherein described news type first threshold is more than news type Second Threshold.
2. webpage type identification method according to claim 1, it is characterised in that the method is further comprising the steps
At least one of:
Based on the type of webpage for being recognized, the advertisement degree of association of the webpage is calculated;
Based on the type of webpage for being recognized, perform Personalize News for the webpage and recommend;
Based on the type of webpage for being recognized, Web page structural data are extracted from the webpage;Or
Based on the type of webpage for being recognized, the data screening for reading class application is performed for the webpage.
3. webpage type identification method according to claim 1, it is characterised in that the content of text meter according to webpage
The content type propensity value for calculating the webpage is specifically included:
Participle is carried out to the content of text of the webpage using dictionary, and calculates the weight of participle feature to form characteristic vector;
According to the content type propensity value of the web page contents classifier calculated this feature vector for pre-setting.
4. webpage type identification method according to claim 3, it is characterised in that in the text using dictionary to webpage
Appearance is carried out before participle, and the method is further included:Sentence of the whole sentence length less than predetermined value is filtered off from the content of text.
5. webpage type identification method according to claim 3, it is characterised in that the weight of the calculating participle feature
For:The weight of participle feature is calculated using the anti-document frequency IDF weighting algorithms of word frequency TF-.
6. webpage type identification method according to claim 3, it is characterised in that in the method:
The web page contents grader calculates the content type propensity value of this feature vector using logistic regression sorting algorithm.
7. webpage type identification method according to claim 1, it is characterised in that the structure of web page of the extraction webpage
Feature is specifically included:
Set up the DOM Document Object Model dom tree of the webpage;
Structure of web page feature is extracted from the dom tree.
8. webpage type identification method according to claim 7, it is characterised in that the structure of web page feature includes following
At least one of information:
Secondary navigation information;
Text Link Ratio;
Uniform resource position mark URL;
Web page title;
Maximum text size;Or
Most long continuous text ratio.
9. webpage type identification method according to claim 1, it is characterised in that the type of the webpage includes news category
Type, knowledge question type, forum's zone of discussion type or online transaction type of webpage.
10. a kind of type of webpage identifying system, it is characterised in that the system includes content type propensity value computing unit, structure
Feature extraction unit and type identification unit, wherein:
Content type propensity value computing unit, for calculating the content type propensity value of the webpage according to the content of text of webpage;
Architectural feature extraction unit, for extracting the structure of web page feature of the webpage;
Type identification unit, for using the content type propensity value and webpage described in the structure of web page feature identification class
Type;
The content of text according to webpage calculates the content type propensity value of the webpage and is specially:According to the content of text of webpage
Calculate the news type propensity value of the webpage;Wherein:
Using news type propensity value and the type of structure of web page feature identification webpage, at least in following steps is specifically included
It is individual:
When the news type propensity value is more than the news type first threshold for pre-setting, the class of the webpage is directly judged
Type is news;Or
When the news type propensity value is more than the news type Second Threshold for pre-setting, and wrap in the structure of web page feature
During category information containing news, judge the type of the webpage as news;
Wherein described news type first threshold is more than news type Second Threshold.
11. type of webpage identifying systems according to claim 10, it is characterised in that the system is further included at type
Reason unit, the type of process unit is used to perform at least one of following steps:
Based on the type of webpage for being recognized, the advertisement degree of association of the webpage is calculated;
Based on the type of webpage for being recognized, perform Personalize News for the webpage and recommend;
Based on the type of webpage for being recognized, Web page structural data are extracted from the webpage;Or
Based on the type of webpage for being recognized, the data screening for reading class application is performed for the webpage.
12. type of webpage identifying systems according to claim 10, it is characterised in that
The content type propensity value computing unit, divides for participle being carried out to the content of text of webpage using dictionary, and being calculated
The weight of word feature is forming characteristic vector;And according to the content of the web page contents classifier calculated this feature vector for pre-setting
Type propensity value.
13. type of webpage identifying systems according to claim 10, it is characterised in that
The content type propensity value computing unit, be further used for using dictionary the content of text of webpage is carried out participle it
Before, sentence of the whole sentence length less than predetermined value is filtered off from the content of text.
14. type of webpage identifying systems according to claim 10, it is characterised in that
The architectural feature extraction unit, for setting up the DOM Document Object Model dom tree of the webpage, and carries from the dom tree
Take structure of web page feature.
15. type of webpage identifying systems according to claim 10, it is characterised in that the type of the webpage includes news
Type, knowledge question type, forum's zone of discussion type or online transaction type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210058024.3A CN103309862B (en) | 2012-03-07 | 2012-03-07 | Webpage type recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210058024.3A CN103309862B (en) | 2012-03-07 | 2012-03-07 | Webpage type recognition method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103309862A CN103309862A (en) | 2013-09-18 |
CN103309862B true CN103309862B (en) | 2017-05-17 |
Family
ID=49135101
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210058024.3A Active CN103309862B (en) | 2012-03-07 | 2012-03-07 | Webpage type recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103309862B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103544310B (en) * | 2013-11-04 | 2017-08-08 | 北京中搜云商网络技术有限公司 | A kind of information classification approach for the shopping guide's class webpage realized based on grader |
CN104021180B (en) * | 2014-06-09 | 2017-10-24 | 南京航空航天大学 | A kind of modular software defect report sorting technique |
CN104090931A (en) * | 2014-06-25 | 2014-10-08 | 华南理工大学 | Information prediction and acquisition method based on webpage link parameter analysis |
CN105373570B (en) * | 2014-09-02 | 2020-09-15 | 中兴通讯股份有限公司 | Management method and terminal for browser history records |
CN106557517A (en) * | 2015-09-29 | 2017-04-05 | 百度在线网络技术(北京)有限公司 | The sort management method and device of website |
CN105302884B (en) * | 2015-10-19 | 2019-02-19 | 天津海量信息技术股份有限公司 | Webpage mode identification method and visual structure learning method based on deep learning |
WO2018053863A1 (en) * | 2016-09-26 | 2018-03-29 | Microsoft Technology Licensing, Llc | Identifying video pages |
CN108255891B (en) * | 2016-12-29 | 2020-08-28 | 北京国双科技有限公司 | Method and device for judging webpage type |
CN108345599B (en) * | 2017-01-23 | 2021-12-14 | 阿里巴巴集团控股有限公司 | Webpage type determination method and device and computer readable medium |
CN108881138B (en) * | 2017-10-26 | 2020-06-26 | 新华三信息安全技术有限公司 | Webpage request identification method and device |
CN110110075A (en) * | 2017-12-25 | 2019-08-09 | 中国电信股份有限公司 | Web page classification method, device and computer readable storage medium |
CN110427755A (en) * | 2018-10-16 | 2019-11-08 | 新华三信息安全技术有限公司 | A kind of method and device identifying script file |
CN110781925B (en) * | 2019-09-29 | 2023-03-10 | 支付宝(杭州)信息技术有限公司 | Software page classification method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178714A (en) * | 2006-12-20 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page classification method and device |
CN101251855A (en) * | 2008-03-27 | 2008-08-27 | 腾讯科技(深圳)有限公司 | Equipment, system and method for cleaning internet web page |
CN101609450A (en) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | Web page classification method based on training set |
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
CN101872347A (en) * | 2009-04-22 | 2010-10-27 | 富士通株式会社 | Method and device for judging type of webpage |
-
2012
- 2012-03-07 CN CN201210058024.3A patent/CN103309862B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101178714A (en) * | 2006-12-20 | 2008-05-14 | 腾讯科技(深圳)有限公司 | Web page classification method and device |
CN101251855A (en) * | 2008-03-27 | 2008-08-27 | 腾讯科技(深圳)有限公司 | Equipment, system and method for cleaning internet web page |
CN101609450A (en) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | Web page classification method based on training set |
CN101872347A (en) * | 2009-04-22 | 2010-10-27 | 富士通株式会社 | Method and device for judging type of webpage |
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
Non-Patent Citations (2)
Title |
---|
"Web page classification: Features and algorithms";Xiaoguang Qi 等;《ACM Computing Surveys (CSUR)》;20090228;第41卷(第2期);第1-31页 * |
"基于结构信息的中文网页自动分类技术研究";刘欣;《中国优秀硕士学位论文全文数据库 信息科技辑》;20110615(第6期);第3章,图3.4、图3.5 * |
Also Published As
Publication number | Publication date |
---|---|
CN103309862A (en) | 2013-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103309862B (en) | Webpage type recognition method and system | |
CN107193959B (en) | Pure text-oriented enterprise entity classification method | |
CN103870973B (en) | Information push, searching method and the device of keyword extraction based on electronic information | |
CN102332028B (en) | Webpage-oriented unhealthy Web content identifying method | |
KR101741509B1 (en) | Device and method for analyzing corporate reputation by data mining of news, recording medium for performing the method | |
Anderka et al. | Predicting quality flaws in user-generated content: the case of wikipedia | |
CN102929937B (en) | Based on the data processing method of the commodity classification of text subject model | |
CN103544255A (en) | Text semantic relativity based network public opinion information analysis method | |
CN101609450A (en) | Web page classification method based on training set | |
WO2009096523A1 (en) | Information analysis device, search system, information analysis method, and information analysis program | |
CN103914478A (en) | Webpage training method and system and webpage prediction method and system | |
Barakat et al. | Applying deep learning models to twitter data to detect airport service quality | |
CN103324666A (en) | Topic tracing method and device based on micro-blog data | |
Wang et al. | Customer-driven product design selection using web based user-generated content | |
US20200004792A1 (en) | Automated website data collection method | |
US11893537B2 (en) | Linguistic analysis of seed documents and peer groups | |
CN111160019B (en) | Public opinion monitoring method, device and system | |
CN107885883A (en) | A kind of macroeconomy field sentiment analysis method and system based on Social Media | |
CN106933800A (en) | A kind of event sentence abstracting method of financial field | |
CN110287314A (en) | Long text credibility evaluation method and system based on Unsupervised clustering | |
CN109710725A (en) | A kind of Chinese table column label restoration methods and system based on text classification | |
Sivakumar | Effectual web content mining using noise removal from web pages | |
Verberne et al. | Automatic thematic classification of election manifestos | |
CN114997288A (en) | Design resource association method | |
Vavpetič et al. | Semantic data mining of financial news articles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |