CN103544210A - System and method for identifying webpage types - Google Patents

System and method for identifying webpage types Download PDF

Info

Publication number
CN103544210A
CN103544210A CN201310391961.5A CN201310391961A CN103544210A CN 103544210 A CN103544210 A CN 103544210A CN 201310391961 A CN201310391961 A CN 201310391961A CN 103544210 A CN103544210 A CN 103544210A
Authority
CN
China
Prior art keywords
webpage
feature
url
type
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310391961.5A
Other languages
Chinese (zh)
Other versions
CN103544210B (en
Inventor
李海燕
王海洋
刘大伟
刘玮
余智华
隋雪青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yantai Zhong Ke Network Technical Institute
Original Assignee
Yantai Zhong Ke Network Technical Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yantai Zhong Ke Network Technical Institute filed Critical Yantai Zhong Ke Network Technical Institute
Priority to CN201310391961.5A priority Critical patent/CN103544210B/en
Publication of CN103544210A publication Critical patent/CN103544210A/en
Application granted granted Critical
Publication of CN103544210B publication Critical patent/CN103544210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention relates to the field of network information retrieval and mining, in particular to a system and method for identifying webpage types. The method comprises the following steps that a heuristic rule is predefined, and a heuristic rule list is generated; a predetermined feature is extracted from a training webpage to form a standard feature vector which is optimized twice to form a simplified feature set, a classifier and a feature extractor are established, and a classification model is generated through the classifier; based on an URL and a source code of a webpage to be identified, rule matching is carried out on the heuristic rule list; if matching succeeds, the webpage type of the webpage to be identified is output; if matching fails, the classifier is used for carrying out webpage type classification on the webpage to be identified. The system and method for identifying webpage types are flexible and convenient to use, high in identifying speed and high in identifying accuracy, big change is of no need when a cross language webpage is identified, identifying efficiency is high, and high actual use value is achieved.

Description

A kind of system and method for identifying type of webpage
Technical field
The present invention relates to networked information retrieval and excavation applications, particularly a kind of system and method for identifying type of webpage.
Background technology
Along with the increase of the network information, by search engine, be sometimes difficult to retrieve the information document that user wants, the Search Results of simultaneously how expressing search engine to user also causes increasing concern.Traditional search system majority returns to a large amount of, a web document set that can match user inquiry.Yet the high recall rate of search engine result document and low precision make to find more and more difficult to user's Useful Information.In recent years, the method that researcher arranges according to subject for document had been carried out a large amount of research, and had obtained good effect.But, although document can successfully according to theme, classify, a large amount of different web page style type also having within each theme, for example according to theme, " NBA match " classifies, existing homepage in the document of classification results, news web page, picture page etc.Yet, some user only wants to see the news pages about " NBA match ", or only want to see the forum page about " NBA match " ... therefore, except theme, the style of file or type can think to express the second view of document, also become and meet another formula of criteria that search engine user is classified to webpage.
In addition, take type of webpage is sorted in and on network public sentiment monitoring system, also has good effect webpage as standard.Development along with Internet technology, network has replaced the conventional information media such as newspaper, broadcast, TV gradually, become requisite a kind of new media in people's life, bearing fast the role who transmits, diffuses information, no matter be domestic or international event, capital is published on network with ultrafast speed, and online friend is also stated one's views and expressed the view of certain public event, focus, focal issue, viewpoint, suggestion and opinion by network, thereby forms network public opinion.Network public-opinion, with its unprecedented quick velocity of propagation, becomes the gathering ground of expressing public opinion.For departments such as governments, network public-opinion is significant to safeguarding that the development of national long-term stability and the harmony of society are stablized for timely monitoring, the guiding of the people's livelihood, the will of the people; For negative public sentiment speech, need to guide timely and effectively and dissolve, thereby eliminate the threat to social safety, safeguard social stable development.At present, four of network public-opinion large main carriers are news (news), forum (bbs), blog (blog) and microblogging (weibo).Network public-opinion monitoring system is in regular hour spatial dimension, the viewpoint of on network, this event being held for Emergence and Development and the masses of certain social event, the system that attitude set is monitored.It is mainly to real-time the gathering of the magnanimity information on internet by acquisition system, afterwards webpage is carried out the information extraction of body matter, finally information is carried out to intelligentized analysis and processing, thereby realize identification, topic tracking, the excavation of sensitive subjects, the functions such as the analysis of public sentiment trend, public sentiment early warning and sentiment classification of public sentiment focus.Public sentiment carrier is mainly by the positive web page text of news, forum, blog is gathered and information extraction.Existing Web page information extraction technology is varied, yet because the structure of news, forum, the positive web page text of blog respectively has feature, different, therefore the perfect algorithm of neither one can be suitable for all network public-opinion carriers, therefore when processing dissimilar public sentiment carrier, select to be applicable to separately respectively the algorithm of its feature, thereby the accuracy that guarantee information extracts meets the accurate processing of monitoring system to data.Therefore, accurate identification to the type of public sentiment carrier is most important, the public sentiment system of part all adopts artificial mode to mate identification to the type of webpage at present, yet increase, webpage import address url(Uniform Resource Locator along with Websites quantity, URL(uniform resource locator)) also often change, when processing number in millions of website, the mode of artificial treatment seems, and efficiency is extremely low, therefore the automatic identification of type of webpage is also seemed to particularly important.
In recent years, the automatic identification of web document stylistic category caused increasing concern.Research for automatic identification stylistic category has had no small effect, in " a kind of efficient SVM Chinese Web page classification device based on presorting " of Xu Shiming, think that the parts such as web page title, key word have higher weights to classification results, the antistop list and the title content that have proposed take to set in advance are the method according to classifying in advance, but the web page characteristics which is used is comprehensive not." blog Web page classifying and the recognition technology research " of Zheng Dequan is by analyzing the feature of blog webpage, proposed to calculate according to structure of web page, key word the method identification blog web page of similarity, but need the artificial Criterion blog webpage collection that participates in, from the angle of practical engineering application, come effective relatively low.Zhang Cheng " the blog webpage based on dom tree construction is identified automatically " proposed for the dom(document object model that contains timestamp) tree carries out the blog web page automatic identification algorithm of pattern match." An Examination of Genre Attributes for Web Page Classification " primary study of Lei Dong the features such as the content of text in webpage, Form type and functional label, the automatic identification technology of the types of web pages such as news and electronic business transaction has been proposed." news web page is the correlated characteristic research of identification automatically " of Hu Xuegang proposes comprehensive utilization webpage url feature, architectural feature and content medium-high frequency word as the recognition feature of news web page.Other feature of the date of " blog research " webpage head of giving chapter and verse of Yang Yuhang, key word and some is identified blog web page.The calendar of the arranged in sequence that " Automatically Collecting, Monitoring, and Mining Japanese Weblogs " proposition of Tomoyuki has most of blog web page is as the principal character of identification blog web page.These methods have all been considered the feature of webpage part aspect or have considered the exclusive feature of certain types of web pages according to different application backgrounds, although all obtained good classifying quality, but the type of webpage of identification is confined to the webpage of particular type, in actual engineering application, cannot meet the requirement to the type identification of webpage carrier, and along with the development of network technology, the inefficacy of Partial Feature may cause the inefficacy of whole identifying, as the calendar of blog webpage does not exist in part website.
In addition, the patent CN101872347A that is entitled as " method and apparatus of judgement type of webpage " first mates by url list of rules, if mate unsuccessful words, extract again url date feature, meta, rss feed, atom feed feature, text feature, chain feature, the anchor text of webpage, and the number of times of repeat pattern appearance; The method has considered regular identifying schemes and the method based on statistical learning, but the tag feature of considering, text feature, architectural feature etc. are comprehensive not and perfect, and the features such as meta, rss feed, atom feed are not all to work to distinguishing any type of webpage.And, the scheme of rule identification at identification division type of webpage as distinguished bbs list page " http://bbs.tianya.cn/list-no04-1.shtml " and bbs text page " http://bbs.tianya.cn/post-no04-2300663-1.shtml ", when distinguishing microblogging webpage " http://sd.sina.com.cn/news/shenbian/2013-06-24/11382535.html " and news web page " http://sd.sina.com.cn/news/sdyw/2013-06-24/070827368.html ", site information " bbs.tianya.cn " and " sd.sina.com.cn " do not have the effect of identification in rule match identifying schemes, need to carry out Classification and Identification.The successful identification of rule match identifying schemes is only wanted the webpage of each column at several type of webpage place of distinguishing to do training sample list of rules to be carried out after carrying out Classification and Identification more under news in each website, ability is meaningful; Once and the type of webpage that will distinguish increases or change, and variation has occurred, list of rules needs all training again to upgrade.In the face of millions of the columns that website is more than one hundred million in Practical Project, method feasibility is lower, and maintenance cost is also large.So, if rule match identifying schemes is not extensively comprehensively just identified type of webpage in situation at training set, can not have any effect or occur the situation of irreversible identification error.
In addition, the patent CN103020067A that is entitled as " a kind of method and apparatus of definite type of webpage " has proposed that each n unit's phrase (n-gram) of corresponding all query forms when obtaining in search daily record that webpage to be identified is clicked proper vector and the correlativity between definite vector identify; Different from the present invention, the practical application scene of both identification, knowledge background difference, different application scenarioss, need to provide different background knowledges do auxiliary and support to realize.The present invention is in the situation that only have the url of webpage and Web page text to carry out, so the technology path that both walk is also different.
Angle from practical engineering application, no matter be the webpage that user passes through certain theme of the interested particular type of search engine retrieving user, or by network public-opinion monitoring system, carry out certain interested operation, to the accuracy of result for retrieval and real-time, all require very high.Up to now, the method of identification type of webpage or identification certain web page type is nothing but the heuristic rule feature based on artificial setting, or the machine learning method based on statistics, in the situation that providing respective background knowledge, application-specific scene can produce good effect.Based on didactic rule and method, be that the rule of distinguishing effect based on having of artificial setting is identified, although slightly better in speed, often precision is not high enough, and popularization is not high yet; The method of machine learning based on classification, is to carry out on the basis of a large amount of training sample statistics,, sorter appropriate in feature selecting select suitable in, precision and speed can get a desired effect.Nicety of grading a big chunk degree of sorter depends on the quality of feature, the feature of carrying out at present type of webpage identification mainly comprises webpage url feature and some basic features of web page contents itself (as label (tag) feature), part foreign literature is when processing English webpage, also used for reference plain text has been carried out to the method that natural language analysis is processed extraction feature, although effect is pretty good, but based on carrying out participle, part of speech, the method that relates to natural language processing of the analyses such as grammer is often subject to the restriction of languages, participle for example, part of speech, grammer, semantic analysis, Chinese and English method completely different (English does not need participle), Chinese is more more complex, when the webpage being applied to across language, the part that algorithm need to be changed is larger, and also lower comparatively speaking in efficiency.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of system and method for identifying type of webpage, has solved in prior art based on heuristic rule and has carried out that type of webpage recognition effect is poor, the Feature Selection of sorter is improper, especially need to do larger change and the lower problem of efficiency when identifying across the webpage of language.
The technical scheme that the present invention solves the problems of the technologies described above is as follows: a kind of method of identifying type of webpage, comprises the following steps:
(1) specific one or more type of webpage are pre-defined heuristic rule and generate heuristic rule list, type of webpage corresponding to described arbitrary heuristic rule;
(2) choose training webpage, from training webpage, extract predefined predetermined characteristic and form standardized proper vector, described standardized proper vector is carried out to two suboptimization and form the characteristic set of simplifying, and building sorter and feature extractor based on described characteristic set of simplifying, described sorter generates for determining the disaggregated model of webpage type of webpage to be identified by described characteristic set of simplifying; The characteristic set of simplifying described in described feature extractor basis has been set the setting feature to web page extraction to be identified;
(3) URL(uniform resource locator) based on webpage to be identified (URL) and source code, executing rule coupling in described heuristic rule list, if the URL of described identification webpage and source code meet the condition of described heuristic rule definition, rule match success, enters step (4); Otherwise, enter step (5);
(4) according to the rule of coupling, export the type of webpage of webpage to be identified;
(5) URL of webpage to be identified and source code are input in described feature extractor, described feature extractor extracts the setting feature of webpage to be identified, described sorter is according to the described setting feature and the described disaggregated model that are drawn into, webpage to be identified is carried out to type of webpage classification, export the type of webpage of webpage to be identified.
A system of identifying type of webpage, comprising:
Rule memory, described rule memory is used for storing described heuristic rule list;
Rule match device, described rule match device is for URL(uniform resource locator) (URL) and source code based on webpage to be identified, executing rule coupling in described list of rules, if the URL of described identification webpage and source code meet the condition of described heuristic rule definition, rule match success, and the rule of mating according to success is exported the type of webpage of webpage to be identified;
Characteristic processing device, described characteristic processing device is for extracting predefined predetermined characteristic and form standardized proper vector from training webpage, described standardized proper vector is carried out to two suboptimization and form the characteristic set of simplifying, and build sorter and feature extractor based on described characteristic set of simplifying;
Feature extractor, described feature extractor, for when the performed rule match of rule match device is unsuccessful, extracts described setting feature according to described characteristic set of simplifying from webpage to be identified;
Sorter, described sorter is for generating disaggregated model and according to the described setting feature of described disaggregated model and the extraction of described feature extractor, exporting the type of webpage of webpage to be identified.
The method that the present invention adopts heuristic rule adaptation and sorter to be used in conjunction with each other, heuristic rule adaptation adopts the heuristic rule of artificial definition to determine the type of webpage, to type of webpage feature clearly or when the webpage that meets specific rule carries out rule match, speed is fast and accuracy of identification is high, can for different type of webpage, change at any time the content of heuristic rule, dirigibility is very large simultaneously; Not obvious or while not having the webpage of specific rule to identify to feature, can directly adopt the classifier methods of machine learning to identify.The present invention has comprehensively comprised 6 types of features, and every type of feature has been carried out again the definition of specific features value, and the setting of eigenwert only need to once travel through to the node of webpage the process extracting, and has guaranteed the speed extracting; And user can self-defining with revise specific features and the eigenwert adopting, dirigibility is large; And the definition of most of characteristic type does not relate to languages problem, is applicable to the environment across language; Characteristic optimization by two links obtains finally preferably characteristic set simultaneously, has guaranteed the quality extracting.Therefore, no matter be recognition speed or accuracy of identification, also relate to the problem across languages, recognition system of the present invention and method can meet actual engineering demand.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet that the present invention identifies the method for type of webpage;
Fig. 2 is the schematic flow sheet figure that the present invention extracts the predetermined characteristic of training webpage;
Fig. 3 is that the present invention is optimized and builds the schematic flow sheet of sorter to proper vector;
Fig. 4 is that the present invention carries out the schematic flow sheet of Web page classifying to webpage to be identified by sorter;
Fig. 5 is the structural representation that the present invention identifies the system of type of webpage.
Embodiment
Below in conjunction with accompanying drawing, principle of the present invention and feature are described, example, only for explaining the present invention, is not intended to limit scope of the present invention.
Fig. 1 is the schematic flow sheet of method of the identification type of webpage of the present embodiment, as shown in Figure 1, comprises the following steps:
Based in the situation that application-specific scene has specific background knowledge, specific one or more type of webpage are pre-defined heuristic rule and generate heuristic rule list, described heuristic rule list storage in rule memory, the type of webpage that described arbitrary heuristic rule is corresponding unique.The content of described heuristic rule has different definition for different web pages type, and rule definition need meet the feature of the type webpage completely, if there is ambiguity, remove this rule; If all type of webpage all do not meet the unambiguous heuristic rule of this type of webpage completely,, without the identification of heuristic rule method, directly webpage to be identified is carried out to the identification of machine learning method.
Choose training webpage; by characteristic processing device, from training webpage, extract predefined predetermined characteristic and form standardized proper vector; described standardized proper vector is carried out to two suboptimization and form the characteristic set of simplifying, and build sorter and feature extractor based on described characteristic set of simplifying; Described sorter generates for determining the disaggregated model of webpage type of webpage to be identified by described characteristic set of simplifying; The characteristic set of simplifying described in described feature extractor basis has been set the setting feature to web page extraction to be identified;
In the present embodiment, the URL feature of extracting the character string of described predetermined characteristic for the URL(uniform resource locator) (URL) from training webpage and webpage to be identified and/or the web page characteristics of extracting from the node of dom tree corresponding to webpage source code, specifically ask for an interview Fig. 2.
By rule match device, the URL of webpage to be identified and source code are carried out to regular mating with the predefined heuristic rule list of user, if the URL of described identification webpage and source code meet the condition of described heuristic rule definition, rule match is successful, and according to the rule of coupling, exports the type of webpage of webpage to be identified.
If all type of webpage all do not meet the unambiguous heuristic rule of this type of webpage completely, the URL of webpage to be identified and source code are input in described feature extractor, described feature extractor extracts described setting feature from the URL of webpage to be identified and webpage source code, then sorter is according to described disaggregated model and described setting feature, webpage to be identified is carried out to type of webpage classification, export the type of webpage of webpage to be identified.
Fig. 2 is that in the present embodiment, characteristic processing device extracts the predetermined characteristic of training webpage the schematic flow sheet that forms proper vector, if not otherwise specified, the definition of the eigenwert of each feature of the present embodiment is all that to identify news, forum, the positive web page text of blog be example, in practical engineering application, use the eigenwert of which feature and feature to carry out artificial setting and change according to actual background knowledge and the type of webpage that will identify.Comprise the following steps:
Step S210: choose training webpage, extract described URL feature from the URL character string of training webpage; If described URL ends up with "/", described URL character string is the character string between beginning " http: // " and end "/" in URL; If described URL is with "/" ending, described URL character string is the later all character strings of beginning in URL " http: // ".
Preferably, described URL feature comprises any one or more in following:
URL depth value: URL depth value is that the quantitative value of "/" in URL character string adds 1;
URL fullstop quantitative value: URL fullstop quantitative value is the quantitative value of ". " in the character string before first "/" of URL;
URL date eigenwert: for representing whether URL character string has the date literal that meets date regular expression.In body webpage url, several conventional date regular expressions are as follows:
{“[0-9]{4}-[0-9]{2}-[0-9]{2}”,“[0-9]{4}[0-9]{2}[0-9]{2}”,“[0-9]{4}-[0-9]{2}/[0-9]{2}",“[0-9]{4}-[0-9]{2}”,“[0-9]{4}[0-9]{2}”,“[0-9]{4}/[0-9]{2}-[0-9]{2}”,“[0-9]{4}/[0-9]{2}[0-9]{2}”,“[0-9]{4}/[0-9]{2}/[0-9]{2}”,“[0-9]{4}/[0-9]{2}”,“[0-9]{4}_[0-9]{2}/[0-9]{2}”}
The prototype that its tempon language is the time is respectively { YYYY-MM-DD, YYYYMMDD, YYYY-MM/DD, YYYY-MM, YYYYMM, YYYY/MM-DD, YYYY/MMDD, YYYY/MM/DD, YYYY/MM, YYYY_MM/DD}, wherein YYYY is the four figures time, as 1999, YY represents the double figures time, if 03, MM was 2 figure place months, if 05, DD is the double figures date, as 31; Tempon language and regular expression can increase and decrease according to actual summary situation; Search and in url character string, whether have the date literal that meets above-mentioned regular expression, if had, to extract respectively and judge whether the date have legitimacy the date, and if illegal, continue to find and nextly meets the character string of date canonical and repeat this process; If there is legal date literal, this eigenwert is set to 1, if can not find legal date literal, this eigenwert is set to 0.
The frequency of URL type feature word, described URL type feature word is predefined for representing the Feature Words of type of webpage, four kinds of type of webpage Feature Words distinguishing body webpage, the positive web page text of forum, the positive web page text of blog, other types webpage in addition be take in described URL type feature word is example, specifically comprises the news url type feature Ci, url of forum type feature word, blog url type feature word and the 4th class url type feature word;
Preferably, described news url type feature word comprises: " story ", " article ", " content ", " news " and/or " xinwen "; The described url of forum type feature word comprises: " detail ", " thread-", " viewthread ", " read-", " tid ", " forum ", " luntan ", " bbs ", " tieba ", " guba ", " shequ ", " tiezi ", " huitie ", " post " and/or " showtopic "; Described blog url type feature word comprises: " blog ", " static " and/or " boke "; Described the 4th class url type feature word comprises: " node ", " main ", " list ", " index ", " more ", " category ", " item ", " default ", " brief ", " catid ", " specia ", " data ", " club ", " group ", " rss ", " board ", " formblogger ", " profile ", " link ", " search ", " login ", " front ", " class ", " forum-", " channel ", " fid " and/or " tag ".
The score numerical value of URL type feature word, because the residing position of url type feature word in url is that the degree of depth is also influential to the identification of type, so the present invention gives a mark to url type feature word, and the scoring function of described URL type feature word is:
Score ( i ) = Σ j = 1 D 2 × E ij D × ( D + 1 ) × log ( 2 × E ij D × ( D + 1 ) ) Σ m = 1 D 2 × m D × ( D + 1 ) × log ( 2 × m D × ( D + 1 ) ) , ifD ≠ 1 1 , ifD = 1
Wherein, i is i url type feature word, the total depth that D is url, and j is j layer depth,
Figure BDA0000376046880000111
Step S220: read the source code of described training webpage, and described source code is converted into dom tree;
Step S230: travel through successively the node in dom tree, generate the web page characteristics of described training webpage;
Preferably, described web page characteristics comprises described text high frequency words feature, architectural feature, label characteristics, chain feature and/or grammar property, and described grammar property comprises punctuation mark feature and sentence feature;
Described text high frequency words is characterized as the frequency of high-frequency characteristic word appearance relevant to type of webpage in the text node of the dom tree that webpage source code is corresponding, and described high-frequency characteristic word comprises news type high-frequency characteristic word, blog type high-frequency characteristic Ci, forum type high-frequency characteristic word.
Preferably, described news type high-frequency characteristic word comprises: " news ", " text ", " source ", " information ", " report ", " daily paper ", " Times ", " evening paper ", " it is reported ", " reporter ", " news ", " media ", " this newspaper ", " special topic ", " " center ", " editor ", " channel ", " important news ", " current events ", " responsible editor ", " relevant report ", " report ", " source herein ", " responsible editor ", " keyword ", " summary ", " key word ", " our publication ", " submission ", " original text source ", " reporter ", " issuing time ", " contribution source ", " contribution ", " information issue ", " issue ", " literary composition/", " article ", " focus ", " gather and edit ", " statement online ", " copyright notice ", " solemnly declare ", " disclaimer ", " copyright statement ", " news ", " editor ", " story ", " stories ", " headline ", " report ", " newspaper ", " xinwen ", " peer link ", " comment warmly ", " news seniority among brothers and sisters " and/or " mobile phone is seen news ",
Preferably, described blog type high-frequency characteristic word comprises: " blog ", " blog article ", " bloger ", " daily record ", " piece of writing ", " send out comment ", " add concern ", " greeting ", " send out paper slip ", " popularity ", " classification ", " filing ", " file ", " collection ", " reading ", " write message ", " label ", " classification ", " classification ", " blogger ", " subscribe to this blog ", " one piece ", " subscription ", " rich report ", " weblog ", " weblog ", " blog ", " journal ", " diary ", " postedby ", " comments ", " archive ", " boke " and/or " message ",
Preferably, described forum type high-frequency characteristic word comprises: " forum ", " community ", " mhkc ", " post ", " reply ", " quote ", " building ", " new post ", " theme ", " be published in ", " reply in ", " main subsides ", " click ", " model ", " browse ", " elite ", " shielding ", " title ", " gold coin ", " report ", " complaint ", " report ", " money order receipt to be signed and returned to the sender ", " follow-up ", " prestige ", " building-owner ", " reply the date ", " only see this author ", " keeper ", " member ", " sofa ", " stool ", " floor ", " integration ", " grade ", " deliver model ", " short-message sending ", " rank ", " edition owner ", " off-line ", " online ", " plusing good friend ", " add as a friend ", " post ", " bean vermicelli ", " send out personal letter ", " concern ", " register ", " send short messages ", " thigh ", " bbs ", " forum ", " club ", " tieba ", " reply ", " discussion ", " luntan ", " shequ ", " tiezi ", " huitie " and/or " thread ".
The frequency that three labels of " h1 ", " h2 ", " h3 " of the number of the structure type Feature Words that the content attribute that described architectural feature is " title " and " meta " two kinds of label nodes in " head " subtree comprises and sign font size occur in whole dom tree; Preferably, described structure type Feature Words comprises structure of a news story type feature word, blog structure type Feature Words, forum's structure type Feature Words; Described structure of a news story type feature word is " news ", " news " and/or " xinwen "; Described blog structure type Feature Words is " blog ", " blog " and/or " boke "; Described forum structure type Feature Words is " club ", " bbs ", " forum ", " thread ", " the tieba ”,“ ”,“ of forum community ", " mhkc ", " model ", " money order receipt to be signed and returned to the sender ", " luntan ", " shequ ", " tieba ", " tiezi " and/or " huitie ";
Described label characteristics is the number percent that preferred 50 conventional labels account for the total label of described webpage, and described 50 conventional labels are:
" tbody ", " span ", " div ", " tr ", " td ", " table ", " ul ", " li ", " p ", " a ", " b ", " font ", " i ", " em ", " big ", " strong ", " small ", " sup ", " sub ", " u ", " br ", " hr ", " frame ", " frameset ", " noframes ", " iframe ", " form ", " input ", " textarea ", " button ", " select ", " option ", " label ", " fieldset ", " ol ", " dl ", " dt ", " dd ", " caption ", " th ", " thead ", " col ", " style ", " meta ", " script ", " noscript ", " applet ", " object ", " link " and/or " img ",
Described chain feature is number or the number percent of the url that comprises every class url type feature word in the property value of url link, and described property value comprises the property value of href attribute of " a ", " link " label and/or the property value of the src attribute of " img " label;
Described punctuation mark is characterized as in the text node of dom tree, the frequency that Chinese and English punctuation mark occurs, and described english punctuation mark comprises: ", ", ". ", " ", "; ", "! ", "-", ": " }, ",, ", " .. ", " ", "; ; ", "! ! ", "-", ":: " }; Described Chinese punctuation mark comprises: ", ", ".”,“?”,“;”,“!”,“……”,“、”,“:”};{“,,”,“。。", " ", "; ; ", "! ! ", " ... ", ",, ", ":: " }; , described ",, " and represent a plurality of ", ", rather than two ", ", all the other are in like manner.
Described sentence feature comprises the frequency that frequency that in the text node of dom tree, each Chinese and English sentence marks occurs, every kind of sentence marks occur, total quantity and/or each sentence average byte quantity of all sentences; Described english sentence punctuation mark comprise ", ", ". ", " ", "; ", "! ", "-" }; ",, ", " .. ", " ", "; ; ", "! ! ", "-" }; Described Chinese sentence marks comprises: ", ", ".”,“?”,“;”,“!”,“……”};{“,,”,“。。”,“??”,“;;”,“!!”,“…………”}。Described punctuation mark feature and sentence feature be not because relating to the processes such as the complicated participle that relates to languages, part of speech, semantic analysis, and time complexity is much lower, and can use across languages.
Above feature, user can carry out additions and deletions to Feature Words according to actual demand and change and subtract.Simultaneously, when generating text high frequency words feature and/or grammar property, the text node of traversal does not comprise the text node that invisible node and/or father node are following label node: " script ", " style ", " object ", " iframe ", " textarea ", " noscript ", " noembed ", " marquee ", " frame ", " frameset ", " noframes ", " form ", " input ", " button ", " select ", " option ", " label ", " fieldset ", " applet ", " optgroup ", " legend ", " isindex " and/or " param ".
Step S240: repeat said process, all training webpages are carried out to URL feature and web page characteristics extraction, obtain respectively following characteristics vector: URL proper vector (R), high frequency words text vector (C), label vector (T), link vector (L), structure vector (U) and/or grammer vector (N).
The type feature that the present embodiment adopts does not all need manual intervention, does not need the coupling of many dom subtrees yet, does not need plain text to enter deep excavation yet, therefore can guarantee to extract the speed of feature; And the type feature adopting is except text high frequency words feature, the extraction of remaining feature does not relate to the problem of languages substantially, when the webpage of processing across language, only need modify to the definition of high frequency words, is applicable to identifying across language web page.
Fig. 3 is optimized and builds the schematic flow sheet of sorter to above 6 proper vectors in the present embodiment, this example has adopted optimizing process twice, thereby guarantees that nicety of grading is enough high.
Step S310: to described URL proper vector and web page characteristics vector, proper vector R, C, T, L, U, N carry out respectively standardization, obtain standardized each proper vector; In the present embodiment, the standardization formula of data is defined as
Std ( ij ) = fij - fi avg fi max - fi min
Wherein Std (ij) is j eigenwert standardization of i feature result afterwards, and fij is the result before j eigenwert standardization of i feature, fi avgthe average of i feature, fi maxthe maximal value of i feature, fi minthe minimum value of i feature.
Step S320: to each standardized proper vector R, C, T, L, U, N, optimize for the first time, remove and affect the redundancy feature of nicety of grading and feature of noise, obtain preferably proper vector R ', C ', T ', L ', U ', N ';
Step S330: described preferably proper vector R ', C ', T ', L ', U ', N ' are formed to a characteristic set F, described characteristic set integral body is carried out to characteristic optimization for the second time and generate the characteristic set S simplifying, according to the described characteristic set S simplifying, build sorter, and generating feature withdrawal device, described feature extractor has been set the setting feature that webpage to be identified will extract, all features of the characteristic set S simplifying described in, and the normalizing parameter value of each feature; Preferably, maximal value, minimum value and the average of each feature in the characteristic set that described normalizing parameter value is simplified described in comprising.
Step S340: the described characteristic set S simplifying is input to and obtains described disaggregated model in sorter.
Step S320 and step S330, for the process that each standardized proper vector is optimized, characteristic optimization is by finding from primitive character space the impact that relevant character subset limits uncorrelated feature or redundancy feature, by reducing incoherent feature and this mode of redundancy feature quantity, select a part to there is the feature of good discrimination ability, the time that classification is carried out can reduce greatly, and also often produces classification results more accurately.The redundancy feature or the feature of noise that are about to affect nicety of grading remove, thereby improve the precision of classification.At present in information retrieval field, the main inclusion information gain of feature selection approach (the Information Gain that text based is popular, IG), mutual information (Mutual Information, MI), card side's (chi-square) feature selecting or difference on the frequency point-score (Relative Frequency Difference, RFD) etc., these methods are mainly to calculate to degree of correlation score of each characteristic allocation by statistical estimation, advantage is speed, but when selecting correlative character, has ignored the performance of sorter; Consider the high request to accuracy of identification in practical engineering application, the present embodiment has preferably been used the feature delet method (SVM_RFE) of the support vector machine recurrence that accuracy rate is relatively high, optimizing process obtains each preferably proper vector for the first time, has guaranteed the quality of the proper vector of each type; Be optimized on the whole for the second time, further guaranteed that each characteristic set is combined in the effect working, i.e. a nicety of grading afterwards.
Sorter can use support vector machine (Support Vector Machine, SVM) model, naive Bayesian (
Figure BDA0000376046880000151
bayes, NB) and decision tree (C4,5) etc., the present embodiment combines SVM_RFE characteristic optimization method, uses SVM as sorter.
The feature selection approach SVM_RFE adopting in the present embodiment and sorter SVM method are only for explanation, and other feature selection approachs and classifier methods are also applicable to this.
Fig. 4 is that the present embodiment carries out the schematic flow sheet of Web page classifying to webpage to be identified by sorter, comprises the following steps:
Step S410: the URL of webpage to be identified and source code are input to described feature extractor;
Step S420: described feature extractor is set feature described in web page extraction to be identified, and according to described normalizing parameter, described setting feature is carried out to standardization, obtain standardized characteristic set, and be input in sorter;
Step S430: sorter, according to standardized characteristic set and described disaggregated model, carries out type of webpage classification to webpage to be identified, exports the type of webpage of webpage to be identified.
Fig. 5 is the structural representation of system of the identification type of webpage of the present embodiment, as shown in Figure 5, comprises rule memory, rule match device, characteristic processing device, feature extractor and/or sorter:
Rule memory, for storing described heuristic rule list;
Rule match device, for URL(uniform resource locator) (URL) and the source code based on webpage to be identified, executing rule coupling in described list of rules, if the URL of described identification webpage and source code meet the condition of described heuristic rule definition, rule match success, and the rule of mating according to success is exported the type of webpage of webpage to be identified;
Characteristic processing device, for extracting predefined predetermined characteristic and form standardized proper vector from training webpage, described standardized proper vector is carried out to two suboptimization and form the characteristic set of simplifying, and build sorter and feature extractor based on described characteristic set of simplifying;
Feature extractor for when the performed rule match of rule match device is unsuccessful, extracts described setting feature according to described characteristic set of simplifying from webpage to be identified; Described feature extractor also, for according to the normalizing parameter value of each feature of described characteristic storage unit storage, carries out standardization to the setting feature of the webpage to be identified extracting.
Sorter, for generating disaggregated model and according to the setting feature of the described webpage to be identified of described disaggregated model and the extraction of described feature extractor, exporting the type of webpage of webpage to be identified.
Preferably, described characteristic processing device further comprises URL feature extraction unit, web page characteristics extraction unit, characteristic optimization unit, characteristic storage unit and/or construction unit:
URL feature extraction unit, for extracting described URL feature from the URL character string of training webpage, generates URL proper vector;
Web page characteristics extraction unit, for reading the source code of described training webpage, is converted into dom tree by described source code; And travel through successively the node in dom tree, extract the web page characteristics of described training webpage, and generating web page proper vector, described web page characteristics comprises described text high frequency words feature, architectural feature, label characteristics, chain feature and/or grammar property;
Characteristic optimization unit, carries out standardization for the URL proper vector to described training webpage and described web page characteristics vector and obtains standardized proper vector, and described standardized proper vector is carried out to two suboptimization generates the characteristic set of simplifying;
Characteristic storage unit, for all features of characteristic set and the normalizing parameter value of each feature of simplifying described in storing.
Construction unit, for characteristic set construction feature withdrawal device and the sorter of simplifying described in basis.
Preferably, described characteristic optimization unit comprises first order characteristic optimization unit and characteristic optimization unit, the second level, the described first order is optimized unit and is used for adopting SVM_RFE method to carry out characteristic optimization for the first time to each standardized proper vector, remove and affect the redundancy feature of nicety of grading and feature of noise, obtain preferably proper vector; Optimization unit, the described second level, for described each characteristic set integral body that preferably proper vector forms is carried out to characteristic optimization for the second time, generates the characteristic set of simplifying.
No matter the system and method for identification type of webpage of the present invention, be in links or the use at unit, all more flexible.Heuristic rule device can load at any time according to the actual requirements or unload, predefined feature does not relate to complicated processing procedure, thereby has guaranteed the speed extracting, and the extraction of most of feature all seldom relates to the problem of language, when using across languages, change does not need too large.After feature extraction, carry out continuously twice characteristic optimization process, not only guaranteed the quality of each characteristic set, also guaranteed all characteristic set combinations effect afterwards, further guaranteed the precision extracting.No matter be recognition speed or accuracy of identification, also relate to the problem across languages, this system can meet actual engineering demand.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (16)

1. a method of identifying type of webpage, comprises the following steps:
(1) specific one or more type of webpage are pre-defined heuristic rule and generate heuristic rule list, type of webpage corresponding to described arbitrary heuristic rule;
(2) choose training webpage, from training webpage, extract predefined predetermined characteristic and form standardized proper vector, described standardized proper vector is carried out to two suboptimization and form the characteristic set of simplifying, and building sorter and feature extractor based on described characteristic set of simplifying, described sorter generates for determining the disaggregated model of webpage type of webpage to be identified by described characteristic set of simplifying; The characteristic set of simplifying described in described feature extractor basis has been set the setting feature to web page extraction to be identified;
(3) URL(uniform resource locator) based on webpage to be identified (URL) and source code, executing rule coupling in described heuristic rule list, if the URL of described identification webpage and source code meet the condition of described heuristic rule definition, rule match success, enters step (4); Otherwise, enter step (5);
(4) according to the rule of coupling, export the type of webpage of webpage to be identified;
(5) URL of webpage to be identified and source code are input in described feature extractor, described feature extractor extracts the setting feature of webpage to be identified, described sorter is according to the described setting feature and the described disaggregated model that are drawn into, webpage to be identified is carried out to type of webpage classification, export the type of webpage of webpage to be identified.
2. method according to claim 1, it is characterized in that: described predetermined characteristic comprises the URL feature of extracting from the URL character string of webpage and/or the web page characteristics of extracting from the node of document dbject model corresponding to webpage source code (dom) tree, and described webpage comprises training webpage and webpage to be identified.
3. method according to claim 2, is characterized in that: if described URL ends up with "/", described URL character string is the character string between beginning " http: // " and end "/" in URL; If described URL is with "/" ending, described URL character string is the later all character strings of beginning in URL " http: // ".
4. method according to claim 3, is characterized in that: described URL feature comprises any one or more in following:
URL depth value, the quantitative value that described URL depth value is "/" in URL character string adds 1;
URL fullstop quantitative value, described URL fullstop quantitative value is the quantitative value of ". " in the character string before first "/" of URL;
URL date eigenwert, described date eigenwert is for representing whether URL character string has the date literal that meets date regular expression, if the date that exists described date literal and described date literal to represent is legal, described date eigenwert is made as " 1 "; Otherwise described date eigenwert is made as " 0 ";
The frequency of URL type feature word, described URL type feature word is predefined for representing the Feature Words of type of webpage; And/or
The score numerical value of URL type feature word, the scoring function of described URL type feature word is:
Figure FDA0000376046870000021
Wherein, i is i url type feature word, the total depth that D is url, and j is j layer depth,
5. method according to claim 4, is characterized in that: described URL type feature word is the type feature word for definite type of webpage.
6. method according to claim 2, is characterized in that: described web page characteristics comprises text high frequency words feature, architectural feature, label characteristics, chain feature and/or grammar property, and described grammar property comprises punctuation mark feature and sentence feature;
Described text high frequency words is characterized as the frequency that high-frequency characteristic word relevant to type of webpage in the text node of the document dbject model that webpage source code is corresponding (dom) tree occurs, described high-frequency characteristic word is for determining each text high-frequency characteristic word of type of webpage;
The frequency that three labels of " h1 ", " h2 ", " h3 " of the number of the structure type Feature Words that the content attribute that described architectural feature is " title " and " meta " two kinds of label nodes in " head " subtree comprises and sign font size occur in whole dom tree, described structure type Feature Words is for determining each structure type Feature Words of type of webpage;
Described label characteristics is the number percent that 50 default conventional labels account for the total label of described webpage;
Described chain feature is number or the number percent of the url that comprises every class url type feature word in the property value of url link, and described property value comprises the property value of href attribute of " a ", " link " label and/or the property value of the src attribute of " img " label;
Described punctuation mark is characterized as in the text node of dom tree, the frequency that Chinese and English punctuation mark occurs;
Described sentence feature comprises the frequency that frequency that in the text node of dom tree, each Chinese and English sentence marks occurs, every kind of sentence marks occur, total quantity and/or each sentence average byte quantity of all sentences.
7. according to the arbitrary described method of claim 1~6, extract the predetermined characteristic of described training webpage and form proper vector and comprise the following steps:
Choose training webpage, from the URL character string of training webpage, extract described URL feature;
Read the source code of described training webpage, and described source code is converted into dom tree;
Travel through successively the node in dom tree, generate the web page characteristics of described training webpage, described web page characteristics comprises described text high frequency words feature, architectural feature, label characteristics, chain feature and/or grammar property;
Repeat said process, all training webpages are carried out to URL feature and web page characteristics extraction, obtain respectively following characteristics vector: URL proper vector, high frequency words text vector, label vector, link vector, structure vector and/or grammer vector.
8. method according to claim 7, it is characterized in that: while generating text high frequency words feature and/or grammar property, the text node of traversal does not comprise the text node that invisible node and/or father node are following label node: " script ", " style ", " object ", " iframe ", " textarea ", " noscript ", " noembed ", " marquee ", " frame ", " frameset ", " noframes ", " form ", " input ", " button ", " select ", " option ", " label ", " fieldset ", " applet ", " optgroup ", " legend ", " isindex " and/or " param ".
9. method according to claim 7, is optimized and builds sorter to described proper vector and comprise the following steps:
Described URL proper vector and web page characteristics vector are carried out respectively to standardization, obtain standardized each proper vector;
Adopt the feature delet method (SVM_RFE) of support vector machine recurrence to remove and affect the redundancy feature of nicety of grading and feature of noise each standardized proper vector, carry out characteristic optimization for the first time, obtain preferably proper vector;
Described preferably proper vector is formed to a characteristic set, described characteristic set integral body is carried out to characteristic optimization for the second time and generate the characteristic set of simplifying, according to described characteristic set of simplifying, build sorter and feature extractor, described feature extractor has been set setting feature that webpage to be identified will extract and the normalizing parameter value of each feature, all features of the characteristic set of simplifying described in described setting is characterized as;
Described characteristic set of simplifying is input to and in sorter, obtains described disaggregated model.
10. method according to claim 9, is characterized in that: described standardization formula is
Figure FDA0000376046870000041
Wherein Std (ij) is j eigenwert standardization of i feature result afterwards, and fij is the result before j eigenwert standardization of i feature, fi avgthe average of i feature, fi maxthe maximal value of i feature, fi minthe minimum value of i feature.
11. methods according to claim 9, is characterized in that: maximal value, minimum value and the average of each feature in the characteristic set of simplifying described in described normalizing parameter value comprises.
12. methods according to claim 9, is characterized in that: by sorter, webpage to be identified is carried out to Web page classifying and comprise the following steps:
The URL of webpage to be identified and source code are input to described feature extractor;
Described feature extractor is set feature described in web page extraction to be identified, and according to described normalizing parameter, described setting feature is carried out to standardization, obtains standardized characteristic set, and is input in sorter;
Sorter, according to standardized characteristic set and described disaggregated model, carries out type of webpage classification to webpage to be identified, exports the type of webpage of webpage to be identified.
13. 1 kinds of systems of identifying type of webpage, comprising:
Rule memory, described rule memory is used for storing described heuristic rule list;
Rule match device, described rule match device is for URL(uniform resource locator) (URL) and source code based on webpage to be identified, executing rule coupling in described list of rules, if the URL of described identification webpage and source code meet the condition of described heuristic rule definition, rule match success, and the rule of mating according to success is exported the type of webpage of webpage to be identified;
Characteristic processing device, described characteristic processing device is for extracting predefined predetermined characteristic and form standardized proper vector from training webpage, described standardized proper vector is carried out to two suboptimization and form the characteristic set of simplifying, and build sorter and feature extractor based on described characteristic set of simplifying;
Feature extractor, described feature extractor, for when the performed rule match of rule match device is unsuccessful, extracts described setting feature according to described characteristic set of simplifying from webpage to be identified;
Sorter, described sorter is for generating disaggregated model and according to the described setting feature of described disaggregated model and the extraction of described feature extractor, exporting the type of webpage of webpage to be identified.
14. systems according to claim 13, is characterized in that: described characteristic processing device further comprises:
URL feature extraction unit, for extracting described URL feature from the URL character string of training webpage, generates URL proper vector;
Web page characteristics extraction unit, for reading the source code of described training webpage, is converted into dom tree by described source code; And travel through successively the node in dom tree, extract the web page characteristics of described training webpage, and generating web page proper vector, described web page characteristics comprises described text high frequency words feature, architectural feature, label characteristics, chain feature and/or grammar property;
Characteristic optimization unit, carries out standardization for the URL proper vector to described training webpage and described web page characteristics vector and obtains standardized proper vector, and described standardized proper vector is carried out to two suboptimization generates the characteristic set of simplifying;
Characteristic storage unit, for all features of characteristic set and the normalizing parameter value of each feature of simplifying described in storing;
Construction unit, for characteristic set construction feature withdrawal device and the sorter of simplifying described in basis.
15. systems according to claim 14, it is characterized in that: described characteristic optimization unit comprises first order characteristic optimization unit and characteristic optimization unit, the second level, the described first order is optimized unit and is used for adopting SVM_RFE method to carry out characteristic optimization for the first time to each standardized proper vector, remove and affect the redundancy feature of nicety of grading and feature of noise, obtain preferably proper vector; Optimization unit, the described second level, for described each characteristic set integral body that preferably proper vector forms is carried out to characteristic optimization for the second time, generates the characteristic set of simplifying.
16. systems according to claim 14, is characterized in that: described feature extractor also, for according to the normalizing parameter value of each feature of described characteristic storage unit storage, carries out standardization to the setting feature of the webpage to be identified extracting.
CN201310391961.5A 2013-09-02 2013-09-02 System and method for identifying webpage types Active CN103544210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310391961.5A CN103544210B (en) 2013-09-02 2013-09-02 System and method for identifying webpage types

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310391961.5A CN103544210B (en) 2013-09-02 2013-09-02 System and method for identifying webpage types

Publications (2)

Publication Number Publication Date
CN103544210A true CN103544210A (en) 2014-01-29
CN103544210B CN103544210B (en) 2017-01-18

Family

ID=49967663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310391961.5A Active CN103544210B (en) 2013-09-02 2013-09-02 System and method for identifying webpage types

Country Status (1)

Country Link
CN (1) CN103544210B (en)

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834685A (en) * 2015-04-17 2015-08-12 百度国际科技(深圳)有限公司 Method and device for processing comment message block in comment-like webpage
CN105302884A (en) * 2015-10-19 2016-02-03 天津海量信息技术有限公司 Deep learning-based webpage mode recognition method and visual structure learning method
CN105678352A (en) * 2015-12-31 2016-06-15 电子科技大学 Long distance high speed data transmission system based on ultrahigh frequency RFID
CN106205288A (en) * 2016-09-23 2016-12-07 长沙军鸽软件有限公司 A kind of implementation method training robot
CN106211165A (en) * 2016-06-14 2016-12-07 北京奇虎科技有限公司 The detection foreign language harassing and wrecking method of note, device and corresponding client
CN106528655A (en) * 2016-10-18 2017-03-22 百度在线网络技术(北京)有限公司 Text subject recognition method and device
CN106779992A (en) * 2016-11-28 2017-05-31 畅捷通信息技术股份有限公司 The method and apparatus that financial records, electronics account book are generated according to short message
CN106790593A (en) * 2016-12-28 2017-05-31 北京奇虎科技有限公司 A kind of page processing method and device
CN106874302A (en) * 2015-12-14 2017-06-20 北京国双科技有限公司 Setting rate determines method and apparatus
CN107341183A (en) * 2017-05-31 2017-11-10 中国科学院信息工程研究所 A kind of Website classification method based on darknet website comprehensive characteristics
CN107545179A (en) * 2017-07-11 2018-01-05 宁波大学 A kind of spam page recognition methods
CN107577783A (en) * 2017-09-15 2018-01-12 电子科技大学 The type of webpage automatic identifying method excavated based on Web architectural features
CN107818132A (en) * 2017-09-21 2018-03-20 中国科学院信息工程研究所 A kind of webpage agent discovery method based on machine learning
CN107832774A (en) * 2017-10-09 2018-03-23 无线生活(杭州)信息科技有限公司 A kind of page exception detection method and device
CN107957872A (en) * 2017-10-11 2018-04-24 中国互联网络信息中心 A kind of full web site source code acquisition methods and illegal website detection method, system
CN107992741A (en) * 2017-10-24 2018-05-04 阿里巴巴集团控股有限公司 A kind of model training method, the method and device for detecting URL
CN108304890A (en) * 2018-03-16 2018-07-20 科大讯飞股份有限公司 A kind of generation method and device of disaggregated model
CN108519986A (en) * 2018-02-24 2018-09-11 阿里巴巴集团控股有限公司 A kind of webpage generating method, device and equipment
CN108733405A (en) * 2017-04-13 2018-11-02 富士通株式会社 The method and apparatus that training webpage distribution indicates model
CN108829898A (en) * 2018-06-29 2018-11-16 无码科技(杭州)有限公司 HTML content page issuing time extracting method and system
CN109067708A (en) * 2018-06-29 2018-12-21 北京奇虎科技有限公司 A kind of detection method, device, equipment and the storage medium at webpage back door
CN109284384A (en) * 2018-10-10 2019-01-29 拉扎斯网络科技(上海)有限公司 Text analyzing method, apparatus, electronic equipment and readable storage medium storing program for executing
CN109726347A (en) * 2018-12-29 2019-05-07 杭州迪普科技股份有限公司 Network request automatic classification method and relevant device
CN109740146A (en) * 2018-12-10 2019-05-10 厦门市美亚柏科信息股份有限公司 A kind of public sentiment monitoring method, terminal and storage medium
CN110020331A (en) * 2017-07-20 2019-07-16 北京国双科技有限公司 Webpage type identification method and device
CN110046746A (en) * 2019-03-18 2019-07-23 北京牡丹电子集团有限责任公司数字电视技术中心 A kind of dispatching method of the network public-opinion device based on intensified learning
CN110110075A (en) * 2017-12-25 2019-08-09 中国电信股份有限公司 Web page classification method, device and computer readable storage medium
CN110442807A (en) * 2019-08-05 2019-11-12 腾讯科技(深圳)有限公司 A kind of webpage type identification method, device, server and storage medium
CN111488511A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN111989699A (en) * 2018-01-29 2020-11-24 微软技术许可有限责任公司 Calendar-aware resource retrieval
CN112100530A (en) * 2020-08-03 2020-12-18 百度在线网络技术(北京)有限公司 Webpage classification method and device, electronic equipment and storage medium
CN112115357A (en) * 2020-09-11 2020-12-22 华中师范大学 Online course forum interaction mode identification method and system
CN112199148A (en) * 2020-10-15 2021-01-08 Tcl通讯(宁波)有限公司 Information processing method and device, storage medium and terminal
CN112287272A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN115374334A (en) * 2022-10-26 2022-11-22 墨责(北京)科技传播有限公司 Text page acquisition method of webpage acquisition page based on machine learning
CN116596386A (en) * 2023-05-20 2023-08-15 中咨海外咨询有限公司 Feasibility analysis and evaluation method for engineering construction project

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090164502A1 (en) * 2007-12-24 2009-06-25 Anirban Dasgupta Systems and methods of universal resource locator normalization
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN102567337A (en) * 2010-12-15 2012-07-11 盛乐信息技术(上海)有限公司 Method and system for quickly recognizing webpage types through links
CN103020067A (en) * 2011-09-21 2013-04-03 北京百度网讯科技有限公司 Method and device for determining webpage type
CN102411587B (en) * 2010-09-21 2013-08-21 腾讯科技(深圳)有限公司 Webpage classification method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090164502A1 (en) * 2007-12-24 2009-06-25 Anirban Dasgupta Systems and methods of universal resource locator normalization
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN102411587B (en) * 2010-09-21 2013-08-21 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102567337A (en) * 2010-12-15 2012-07-11 盛乐信息技术(上海)有限公司 Method and system for quickly recognizing webpage types through links
CN103020067A (en) * 2011-09-21 2013-04-03 北京百度网讯科技有限公司 Method and device for determining webpage type

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈翰 等: "一种基于综合特征的网页类型识别方法", 《信息工程大学学报》 *

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834685A (en) * 2015-04-17 2015-08-12 百度国际科技(深圳)有限公司 Method and device for processing comment message block in comment-like webpage
CN105302884A (en) * 2015-10-19 2016-02-03 天津海量信息技术有限公司 Deep learning-based webpage mode recognition method and visual structure learning method
CN105302884B (en) * 2015-10-19 2019-02-19 天津海量信息技术股份有限公司 Webpage mode identification method and visual structure learning method based on deep learning
CN106874302B (en) * 2015-12-14 2019-12-24 北京国双科技有限公司 Setting rate determination method and device
CN106874302A (en) * 2015-12-14 2017-06-20 北京国双科技有限公司 Setting rate determines method and apparatus
CN105678352A (en) * 2015-12-31 2016-06-15 电子科技大学 Long distance high speed data transmission system based on ultrahigh frequency RFID
CN106211165A (en) * 2016-06-14 2016-12-07 北京奇虎科技有限公司 The detection foreign language harassing and wrecking method of note, device and corresponding client
CN106211165B (en) * 2016-06-14 2020-04-21 北京奇虎科技有限公司 Method and device for detecting foreign language harassment short message and corresponding client
CN106205288A (en) * 2016-09-23 2016-12-07 长沙军鸽软件有限公司 A kind of implementation method training robot
CN106528655A (en) * 2016-10-18 2017-03-22 百度在线网络技术(北京)有限公司 Text subject recognition method and device
CN106779992A (en) * 2016-11-28 2017-05-31 畅捷通信息技术股份有限公司 The method and apparatus that financial records, electronics account book are generated according to short message
CN106790593A (en) * 2016-12-28 2017-05-31 北京奇虎科技有限公司 A kind of page processing method and device
CN108733405A (en) * 2017-04-13 2018-11-02 富士通株式会社 The method and apparatus that training webpage distribution indicates model
CN107341183A (en) * 2017-05-31 2017-11-10 中国科学院信息工程研究所 A kind of Website classification method based on darknet website comprehensive characteristics
CN107545179A (en) * 2017-07-11 2018-01-05 宁波大学 A kind of spam page recognition methods
CN107545179B (en) * 2017-07-11 2020-06-19 宁波大学 Junk web page identification method
CN110020331A (en) * 2017-07-20 2019-07-16 北京国双科技有限公司 Webpage type identification method and device
CN107577783A (en) * 2017-09-15 2018-01-12 电子科技大学 The type of webpage automatic identifying method excavated based on Web architectural features
CN107818132A (en) * 2017-09-21 2018-03-20 中国科学院信息工程研究所 A kind of webpage agent discovery method based on machine learning
CN107832774A (en) * 2017-10-09 2018-03-23 无线生活(杭州)信息科技有限公司 A kind of page exception detection method and device
CN107957872A (en) * 2017-10-11 2018-04-24 中国互联网络信息中心 A kind of full web site source code acquisition methods and illegal website detection method, system
CN112182578A (en) * 2017-10-24 2021-01-05 创新先进技术有限公司 Model training method, URL detection method and device
WO2019080660A1 (en) * 2017-10-24 2019-05-02 阿里巴巴集团控股有限公司 Model training method, method and device for testing url
TWI696090B (en) * 2017-10-24 2020-06-11 香港商阿里巴巴集團服務有限公司 Model training method, method and device for detecting URL
CN107992741A (en) * 2017-10-24 2018-05-04 阿里巴巴集团控股有限公司 A kind of model training method, the method and device for detecting URL
CN110110075A (en) * 2017-12-25 2019-08-09 中国电信股份有限公司 Web page classification method, device and computer readable storage medium
CN111989699A (en) * 2018-01-29 2020-11-24 微软技术许可有限责任公司 Calendar-aware resource retrieval
CN108519986B (en) * 2018-02-24 2022-01-28 创新先进技术有限公司 Webpage generation method, device and equipment
CN108519986A (en) * 2018-02-24 2018-09-11 阿里巴巴集团控股有限公司 A kind of webpage generating method, device and equipment
CN108304890A (en) * 2018-03-16 2018-07-20 科大讯飞股份有限公司 A kind of generation method and device of disaggregated model
CN108829898B (en) * 2018-06-29 2020-11-20 无码科技(杭州)有限公司 HTML content page release time extraction method and system
CN109067708A (en) * 2018-06-29 2018-12-21 北京奇虎科技有限公司 A kind of detection method, device, equipment and the storage medium at webpage back door
CN108829898A (en) * 2018-06-29 2018-11-16 无码科技(杭州)有限公司 HTML content page issuing time extracting method and system
CN109284384A (en) * 2018-10-10 2019-01-29 拉扎斯网络科技(上海)有限公司 Text analyzing method, apparatus, electronic equipment and readable storage medium storing program for executing
CN109740146A (en) * 2018-12-10 2019-05-10 厦门市美亚柏科信息股份有限公司 A kind of public sentiment monitoring method, terminal and storage medium
CN109740146B (en) * 2018-12-10 2023-02-03 厦门市美亚柏科信息股份有限公司 Public opinion monitoring method, terminal and storage medium
CN109726347A (en) * 2018-12-29 2019-05-07 杭州迪普科技股份有限公司 Network request automatic classification method and relevant device
CN111488511A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN111488511B (en) * 2019-01-25 2024-04-09 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN110046746A (en) * 2019-03-18 2019-07-23 北京牡丹电子集团有限责任公司数字电视技术中心 A kind of dispatching method of the network public-opinion device based on intensified learning
CN110442807A (en) * 2019-08-05 2019-11-12 腾讯科技(深圳)有限公司 A kind of webpage type identification method, device, server and storage medium
CN112100530A (en) * 2020-08-03 2020-12-18 百度在线网络技术(北京)有限公司 Webpage classification method and device, electronic equipment and storage medium
CN112100530B (en) * 2020-08-03 2023-12-22 百度在线网络技术(北京)有限公司 Webpage classification method and device, electronic equipment and storage medium
CN112115357A (en) * 2020-09-11 2020-12-22 华中师范大学 Online course forum interaction mode identification method and system
CN112199148A (en) * 2020-10-15 2021-01-08 Tcl通讯(宁波)有限公司 Information processing method and device, storage medium and terminal
CN112287272A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN112287272B (en) * 2020-10-27 2023-05-23 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN115374334B (en) * 2022-10-26 2023-01-06 墨责(北京)科技传播有限公司 Text page acquisition method of webpage acquisition page based on machine learning
CN115374334A (en) * 2022-10-26 2022-11-22 墨责(北京)科技传播有限公司 Text page acquisition method of webpage acquisition page based on machine learning
CN116596386A (en) * 2023-05-20 2023-08-15 中咨海外咨询有限公司 Feasibility analysis and evaluation method for engineering construction project
CN116596386B (en) * 2023-05-20 2023-10-10 中咨海外咨询有限公司 Feasibility analysis and evaluation method for engineering construction project

Also Published As

Publication number Publication date
CN103544210B (en) 2017-01-18

Similar Documents

Publication Publication Date Title
CN103544210B (en) System and method for identifying webpage types
Chakraborty et al. Stop clickbait: Detecting and preventing clickbaits in online news media
US7565350B2 (en) Identifying a web page as belonging to a blog
US8321396B2 (en) Automatically extracting by-line information
US20150066934A1 (en) Automatic classification of segmented portions of web pages
Geçkil et al. A clickbait detection method on news sites
CN102890702A (en) Internet forum-oriented opinion leader mining method
Daas Natural language processing
CN104965823A (en) Big data based opinion extraction method
Reinanda et al. Document filtering for long-tail entities
JP4911599B2 (en) Reputation information extraction device and reputation information extraction method
CN109165373B (en) Data processing method and device
CN105183765A (en) Big data-based topic extraction method
Papadakis et al. Graph vs. bag representation models for the topic classification of web documents
CN104346382A (en) Text analysis system and method employing language query
Scharkow Content analysis, automatic
Bsoul et al. Building an optimal dataset for arabic fake news detection
CN115017302A (en) Public opinion monitoring method and public opinion monitoring system
Zhang et al. Event-based summarization for scientific literature in chinese
Zhao et al. Modeling Chinese microblogs with five Ws for topic hashtags extraction
Jena et al. Data extraction and web page categorization using text mining
CN113157857A (en) Hot topic detection method, device and equipment for news
Griazev et al. Web mining taxonomy
Selvadurai A natural language processing based web mining system for social media analysis
Arora et al. Web-based news straining and summarization using machine learning enabled communication techniques for large-scale 5G networks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant