CN103544210A

CN103544210A - System and method for identifying webpage types

Info

Publication number: CN103544210A
Application number: CN201310391961.5A
Authority: CN
Inventors: 李海燕; 王海洋; 刘大伟; 刘玮; 余智华; 隋雪青
Original assignee: Yantai Zhong Ke Network Technical Institute
Current assignee: Yantai Zhong Ke Network Technical Institute
Priority date: 2013-09-02
Filing date: 2013-09-02
Publication date: 2014-01-29
Anticipated expiration: 2033-09-02
Also published as: CN103544210B

Abstract

The invention relates to the field of network information retrieval and mining, in particular to a system and method for identifying webpage types. The method comprises the following steps that a heuristic rule is predefined, and a heuristic rule list is generated; a predetermined feature is extracted from a training webpage to form a standard feature vector which is optimized twice to form a simplified feature set, a classifier and a feature extractor are established, and a classification model is generated through the classifier; based on an URL and a source code of a webpage to be identified, rule matching is carried out on the heuristic rule list; if matching succeeds, the webpage type of the webpage to be identified is output; if matching fails, the classifier is used for carrying out webpage type classification on the webpage to be identified. The system and method for identifying webpage types are flexible and convenient to use, high in identifying speed and high in identifying accuracy, big change is of no need when a cross language webpage is identified, identifying efficiency is high, and high actual use value is achieved.

Description

A kind of system and method for identifying type of webpage

Technical field

The present invention relates to networked information retrieval and excavation applications, particularly a kind of system and method for identifying type of webpage.

Background technology

Along with the increase of the network information, by search engine, be sometimes difficult to retrieve the information document that user wants, the Search Results of simultaneously how expressing search engine to user also causes increasing concern.Traditional search system majority returns to a large amount of, a web document set that can match user inquiry.Yet the high recall rate of search engine result document and low precision make to find more and more difficult to user's Useful Information.In recent years, the method that researcher arranges according to subject for document had been carried out a large amount of research, and had obtained good effect.But, although document can successfully according to theme, classify, a large amount of different web page style type also having within each theme, for example according to theme, " NBA match " classifies, existing homepage in the document of classification results, news web page, picture page etc.Yet, some user only wants to see the news pages about " NBA match ", or only want to see the forum page about " NBA match " ... therefore, except theme, the style of file or type can think to express the second view of document, also become and meet another formula of criteria that search engine user is classified to webpage.

In addition, take type of webpage is sorted in and on network public sentiment monitoring system, also has good effect webpage as standard.Development along with Internet technology, network has replaced the conventional information media such as newspaper, broadcast, TV gradually, become requisite a kind of new media in people's life, bearing fast the role who transmits, diffuses information, no matter be domestic or international event, capital is published on network with ultrafast speed, and online friend is also stated one's views and expressed the view of certain public event, focus, focal issue, viewpoint, suggestion and opinion by network, thereby forms network public opinion.Network public-opinion, with its unprecedented quick velocity of propagation, becomes the gathering ground of expressing public opinion.For departments such as governments, network public-opinion is significant to safeguarding that the development of national long-term stability and the harmony of society are stablized for timely monitoring, the guiding of the people's livelihood, the will of the people; For negative public sentiment speech, need to guide timely and effectively and dissolve, thereby eliminate the threat to social safety, safeguard social stable development.At present, four of network public-opinion large main carriers are news (news), forum (bbs), blog (blog) and microblogging (weibo).Network public-opinion monitoring system is in regular hour spatial dimension, the viewpoint of on network, this event being held for Emergence and Development and the masses of certain social event, the system that attitude set is monitored.It is mainly to real-time the gathering of the magnanimity information on internet by acquisition system, afterwards webpage is carried out the information extraction of body matter, finally information is carried out to intelligentized analysis and processing, thereby realize identification, topic tracking, the excavation of sensitive subjects, the functions such as the analysis of public sentiment trend, public sentiment early warning and sentiment classification of public sentiment focus.Public sentiment carrier is mainly by the positive web page text of news, forum, blog is gathered and information extraction.Existing Web page information extraction technology is varied, yet because the structure of news, forum, the positive web page text of blog respectively has feature, different, therefore the perfect algorithm of neither one can be suitable for all network public-opinion carriers, therefore when processing dissimilar public sentiment carrier, select to be applicable to separately respectively the algorithm of its feature, thereby the accuracy that guarantee information extracts meets the accurate processing of monitoring system to data.Therefore, accurate identification to the type of public sentiment carrier is most important, the public sentiment system of part all adopts artificial mode to mate identification to the type of webpage at present, yet increase, webpage import address url(Uniform Resource Locator along with Websites quantity, URL(uniform resource locator)) also often change, when processing number in millions of website, the mode of artificial treatment seems, and efficiency is extremely low, therefore the automatic identification of type of webpage is also seemed to particularly important.

In recent years, the automatic identification of web document stylistic category caused increasing concern.Research for automatic identification stylistic category has had no small effect, in " a kind of efficient SVM Chinese Web page classification device based on presorting " of Xu Shiming, think that the parts such as web page title, key word have higher weights to classification results, the antistop list and the title content that have proposed take to set in advance are the method according to classifying in advance, but the web page characteristics which is used is comprehensive not." blog Web page classifying and the recognition technology research " of Zheng Dequan is by analyzing the feature of blog webpage, proposed to calculate according to structure of web page, key word the method identification blog web page of similarity, but need the artificial Criterion blog webpage collection that participates in, from the angle of practical engineering application, come effective relatively low.Zhang Cheng " the blog webpage based on dom tree construction is identified automatically " proposed for the dom(document object model that contains timestamp) tree carries out the blog web page automatic identification algorithm of pattern match." An Examination of Genre Attributes for Web Page Classification " primary study of Lei Dong the features such as the content of text in webpage, Form type and functional label, the automatic identification technology of the types of web pages such as news and electronic business transaction has been proposed." news web page is the correlated characteristic research of identification automatically " of Hu Xuegang proposes comprehensive utilization webpage url feature, architectural feature and content medium-high frequency word as the recognition feature of news web page.Other feature of the date of " blog research " webpage head of giving chapter and verse of Yang Yuhang, key word and some is identified blog web page.The calendar of the arranged in sequence that " Automatically Collecting, Monitoring, and Mining Japanese Weblogs " proposition of Tomoyuki has most of blog web page is as the principal character of identification blog web page.These methods have all been considered the feature of webpage part aspect or have considered the exclusive feature of certain types of web pages according to different application backgrounds, although all obtained good classifying quality, but the type of webpage of identification is confined to the webpage of particular type, in actual engineering application, cannot meet the requirement to the type identification of webpage carrier, and along with the development of network technology, the inefficacy of Partial Feature may cause the inefficacy of whole identifying, as the calendar of blog webpage does not exist in part website.

In addition, the patent CN101872347A that is entitled as " method and apparatus of judgement type of webpage " first mates by url list of rules, if mate unsuccessful words, extract again url date feature, meta, rss feed, atom feed feature, text feature, chain feature, the anchor text of webpage, and the number of times of repeat pattern appearance; The method has considered regular identifying schemes and the method based on statistical learning, but the tag feature of considering, text feature, architectural feature etc. are comprehensive not and perfect, and the features such as meta, rss feed, atom feed are not all to work to distinguishing any type of webpage.And, the scheme of rule identification at identification division type of webpage as distinguished bbs list page " http://bbs.tianya.cn/list-no04-1.shtml " and bbs text page " http://bbs.tianya.cn/post-no04-2300663-1.shtml ", when distinguishing microblogging webpage " http://sd.sina.com.cn/news/shenbian/2013-06-24/11382535.html " and news web page " http://sd.sina.com.cn/news/sdyw/2013-06-24/070827368.html ", site information " bbs.tianya.cn " and " sd.sina.com.cn " do not have the effect of identification in rule match identifying schemes, need to carry out Classification and Identification.The successful identification of rule match identifying schemes is only wanted the webpage of each column at several type of webpage place of distinguishing to do training sample list of rules to be carried out after carrying out Classification and Identification more under news in each website, ability is meaningful; Once and the type of webpage that will distinguish increases or change, and variation has occurred, list of rules needs all training again to upgrade.In the face of millions of the columns that website is more than one hundred million in Practical Project, method feasibility is lower, and maintenance cost is also large.So, if rule match identifying schemes is not extensively comprehensively just identified type of webpage in situation at training set, can not have any effect or occur the situation of irreversible identification error.

In addition, the patent CN103020067A that is entitled as " a kind of method and apparatus of definite type of webpage " has proposed that each n unit's phrase (n-gram) of corresponding all query forms when obtaining in search daily record that webpage to be identified is clicked proper vector and the correlativity between definite vector identify; Different from the present invention, the practical application scene of both identification, knowledge background difference, different application scenarioss, need to provide different background knowledges do auxiliary and support to realize.The present invention is in the situation that only have the url of webpage and Web page text to carry out, so the technology path that both walk is also different.

Angle from practical engineering application, no matter be the webpage that user passes through certain theme of the interested particular type of search engine retrieving user, or by network public-opinion monitoring system, carry out certain interested operation, to the accuracy of result for retrieval and real-time, all require very high.Up to now, the method of identification type of webpage or identification certain web page type is nothing but the heuristic rule feature based on artificial setting, or the machine learning method based on statistics, in the situation that providing respective background knowledge, application-specific scene can produce good effect.Based on didactic rule and method, be that the rule of distinguishing effect based on having of artificial setting is identified, although slightly better in speed, often precision is not high enough, and popularization is not high yet; The method of machine learning based on classification, is to carry out on the basis of a large amount of training sample statistics,, sorter appropriate in feature selecting select suitable in, precision and speed can get a desired effect.Nicety of grading a big chunk degree of sorter depends on the quality of feature, the feature of carrying out at present type of webpage identification mainly comprises webpage url feature and some basic features of web page contents itself (as label (tag) feature), part foreign literature is when processing English webpage, also used for reference plain text has been carried out to the method that natural language analysis is processed extraction feature, although effect is pretty good, but based on carrying out participle, part of speech, the method that relates to natural language processing of the analyses such as grammer is often subject to the restriction of languages, participle for example, part of speech, grammer, semantic analysis, Chinese and English method completely different (English does not need participle), Chinese is more more complex, when the webpage being applied to across language, the part that algorithm need to be changed is larger, and also lower comparatively speaking in efficiency.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of system and method for identifying type of webpage, has solved in prior art based on heuristic rule and has carried out that type of webpage recognition effect is poor, the Feature Selection of sorter is improper, especially need to do larger change and the lower problem of efficiency when identifying across the webpage of language.

The technical scheme that the present invention solves the problems of the technologies described above is as follows: a kind of method of identifying type of webpage, comprises the following steps:

(1) specific one or more type of webpage are pre-defined heuristic rule and generate heuristic rule list, type of webpage corresponding to described arbitrary heuristic rule;

(2) choose training webpage, from training webpage, extract predefined predetermined characteristic and form standardized proper vector, described standardized proper vector is carried out to two suboptimization and form the characteristic set of simplifying, and building sorter and feature extractor based on described characteristic set of simplifying, described sorter generates for determining the disaggregated model of webpage type of webpage to be identified by described characteristic set of simplifying; The characteristic set of simplifying described in described feature extractor basis has been set the setting feature to web page extraction to be identified;

(3) URL(uniform resource locator) based on webpage to be identified (URL) and source code, executing rule coupling in described heuristic rule list, if the URL of described identification webpage and source code meet the condition of described heuristic rule definition, rule match success, enters step (4); Otherwise, enter step (5);

(4) according to the rule of coupling, export the type of webpage of webpage to be identified;

(5) URL of webpage to be identified and source code are input in described feature extractor, described feature extractor extracts the setting feature of webpage to be identified, described sorter is according to the described setting feature and the described disaggregated model that are drawn into, webpage to be identified is carried out to type of webpage classification, export the type of webpage of webpage to be identified.

A system of identifying type of webpage, comprising:

Rule memory, described rule memory is used for storing described heuristic rule list;

Rule match device, described rule match device is for URL(uniform resource locator) (URL) and source code based on webpage to be identified, executing rule coupling in described list of rules, if the URL of described identification webpage and source code meet the condition of described heuristic rule definition, rule match success, and the rule of mating according to success is exported the type of webpage of webpage to be identified;

Characteristic processing device, described characteristic processing device is for extracting predefined predetermined characteristic and form standardized proper vector from training webpage, described standardized proper vector is carried out to two suboptimization and form the characteristic set of simplifying, and build sorter and feature extractor based on described characteristic set of simplifying;

Feature extractor, described feature extractor, for when the performed rule match of rule match device is unsuccessful, extracts described setting feature according to described characteristic set of simplifying from webpage to be identified;

Sorter, described sorter is for generating disaggregated model and according to the described setting feature of described disaggregated model and the extraction of described feature extractor, exporting the type of webpage of webpage to be identified.

The method that the present invention adopts heuristic rule adaptation and sorter to be used in conjunction with each other, heuristic rule adaptation adopts the heuristic rule of artificial definition to determine the type of webpage, to type of webpage feature clearly or when the webpage that meets specific rule carries out rule match, speed is fast and accuracy of identification is high, can for different type of webpage, change at any time the content of heuristic rule, dirigibility is very large simultaneously; Not obvious or while not having the webpage of specific rule to identify to feature, can directly adopt the classifier methods of machine learning to identify.The present invention has comprehensively comprised 6 types of features, and every type of feature has been carried out again the definition of specific features value, and the setting of eigenwert only need to once travel through to the node of webpage the process extracting, and has guaranteed the speed extracting; And user can self-defining with revise specific features and the eigenwert adopting, dirigibility is large; And the definition of most of characteristic type does not relate to languages problem, is applicable to the environment across language; Characteristic optimization by two links obtains finally preferably characteristic set simultaneously, has guaranteed the quality extracting.Therefore, no matter be recognition speed or accuracy of identification, also relate to the problem across languages, recognition system of the present invention and method can meet actual engineering demand.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet that the present invention identifies the method for type of webpage;

Fig. 2 is the schematic flow sheet figure that the present invention extracts the predetermined characteristic of training webpage;

Fig. 3 is that the present invention is optimized and builds the schematic flow sheet of sorter to proper vector;

Fig. 4 is that the present invention carries out the schematic flow sheet of Web page classifying to webpage to be identified by sorter;

Fig. 5 is the structural representation that the present invention identifies the system of type of webpage.

Embodiment

Below in conjunction with accompanying drawing, principle of the present invention and feature are described, example, only for explaining the present invention, is not intended to limit scope of the present invention.

Fig. 1 is the schematic flow sheet of method of the identification type of webpage of the present embodiment, as shown in Figure 1, comprises the following steps:

Based in the situation that application-specific scene has specific background knowledge, specific one or more type of webpage are pre-defined heuristic rule and generate heuristic rule list, described heuristic rule list storage in rule memory, the type of webpage that described arbitrary heuristic rule is corresponding unique.The content of described heuristic rule has different definition for different web pages type, and rule definition need meet the feature of the type webpage completely, if there is ambiguity, remove this rule; If all type of webpage all do not meet the unambiguous heuristic rule of this type of webpage completely,, without the identification of heuristic rule method, directly webpage to be identified is carried out to the identification of machine learning method.

Choose training webpage; by characteristic processing device, from training webpage, extract predefined predetermined characteristic and form standardized proper vector; described standardized proper vector is carried out to two suboptimization and form the characteristic set of simplifying, and build sorter and feature extractor based on described characteristic set of simplifying; Described sorter generates for determining the disaggregated model of webpage type of webpage to be identified by described characteristic set of simplifying; The characteristic set of simplifying described in described feature extractor basis has been set the setting feature to web page extraction to be identified;

In the present embodiment, the URL feature of extracting the character string of described predetermined characteristic for the URL(uniform resource locator) (URL) from training webpage and webpage to be identified and/or the web page characteristics of extracting from the node of dom tree corresponding to webpage source code, specifically ask for an interview Fig. 2.

By rule match device, the URL of webpage to be identified and source code are carried out to regular mating with the predefined heuristic rule list of user, if the URL of described identification webpage and source code meet the condition of described heuristic rule definition, rule match is successful, and according to the rule of coupling, exports the type of webpage of webpage to be identified.

If all type of webpage all do not meet the unambiguous heuristic rule of this type of webpage completely, the URL of webpage to be identified and source code are input in described feature extractor, described feature extractor extracts described setting feature from the URL of webpage to be identified and webpage source code, then sorter is according to described disaggregated model and described setting feature, webpage to be identified is carried out to type of webpage classification, export the type of webpage of webpage to be identified.

Fig. 2 is that in the present embodiment, characteristic processing device extracts the predetermined characteristic of training webpage the schematic flow sheet that forms proper vector, if not otherwise specified, the definition of the eigenwert of each feature of the present embodiment is all that to identify news, forum, the positive web page text of blog be example, in practical engineering application, use the eigenwert of which feature and feature to carry out artificial setting and change according to actual background knowledge and the type of webpage that will identify.Comprise the following steps:

Step S210: choose training webpage, extract described URL feature from the URL character string of training webpage; If described URL ends up with "/", described URL character string is the character string between beginning " http: // " and end "/" in URL; If described URL is with "/" ending, described URL character string is the later all character strings of beginning in URL " http: // ".

Preferably, described URL feature comprises any one or more in following:

URL depth value: URL depth value is that the quantitative value of "/" in URL character string adds 1;

URL fullstop quantitative value: URL fullstop quantitative value is the quantitative value of ". " in the character string before first "/" of URL;

URL date eigenwert: for representing whether URL character string has the date literal that meets date regular expression.In body webpage url, several conventional date regular expressions are as follows:

{“[0-9]{4}-[0-9]{2}-[0-9]{2}”,“[0-9]{4}[0-9]{2}[0-9]{2}”,“[0-9]{4}-[0-9]{2}/[0-9]{2}",“[0-9]{4}-[0-9]{2}”,“[0-9]{4}[0-9]{2}”,“[0-9]{4}/[0-9]{2}-[0-9]{2}”,“[0-9]{4}/[0-9]{2}[0-9]{2}”,“[0-9]{4}/[0-9]{2}/[0-9]{2}”,“[0-9]{4}/[0-9]{2}”,“[0-9]{4}_[0-9]{2}/[0-9]{2}”}

The prototype that its tempon language is the time is respectively { YYYY-MM-DD, YYYYMMDD, YYYY-MM/DD, YYYY-MM, YYYYMM, YYYY/MM-DD, YYYY/MMDD, YYYY/MM/DD, YYYY/MM, YYYY_MM/DD}, wherein YYYY is the four figures time, as 1999, YY represents the double figures time, if 03, MM was 2 figure place months, if 05, DD is the double figures date, as 31; Tempon language and regular expression can increase and decrease according to actual summary situation; Search and in url character string, whether have the date literal that meets above-mentioned regular expression, if had, to extract respectively and judge whether the date have legitimacy the date, and if illegal, continue to find and nextly meets the character string of date canonical and repeat this process; If there is legal date literal, this eigenwert is set to 1, if can not find legal date literal, this eigenwert is set to 0.

The frequency of URL type feature word, described URL type feature word is predefined for representing the Feature Words of type of webpage, four kinds of type of webpage Feature Words distinguishing body webpage, the positive web page text of forum, the positive web page text of blog, other types webpage in addition be take in described URL type feature word is example, specifically comprises the news url type feature Ci, url of forum type feature word, blog url type feature word and the 4th class url type feature word;

Preferably, described news url type feature word comprises: " story ", " article ", " content ", " news " and/or " xinwen "; The described url of forum type feature word comprises: " detail ", " thread-", " viewthread ", " read-", " tid ", " forum ", " luntan ", " bbs ", " tieba ", " guba ", " shequ ", " tiezi ", " huitie ", " post " and/or " showtopic "; Described blog url type feature word comprises: " blog ", " static " and/or " boke "; Described the 4th class url type feature word comprises: " node ", " main ", " list ", " index ", " more ", " category ", " item ", " default ", " brief ", " catid ", " specia ", " data ", " club ", " group ", " rss ", " board ", " formblogger ", " profile ", " link ", " search ", " login ", " front ", " class ", " forum-", " channel ", " fid " and/or " tag ".

The score numerical value of URL type feature word, because the residing position of url type feature word in url is that the degree of depth is also influential to the identification of type, so the present invention gives a mark to url type feature word, and the scoring function of described URL type feature word is:

Score (i) = \{\begin{matrix} \frac{Σ_{j = 1}^{D} \frac{2 \times E_{ij}}{D \times (D + 1)} \times \log (\frac{2 \times E_{ij}}{D \times (D + 1)})}{Σ_{m = 1}^{D} \frac{2 \times m}{D \times (D + 1)} \times \log (\frac{2 \times m}{D \times (D + 1)})}, ifD &NotEqual; 1 \\ 1, ifD = 1 \end{matrix}

Wherein, i is i url type feature word, the total depth that D is url, and j is j layer depth,

Step S220: read the source code of described training webpage, and described source code is converted into dom tree;

Step S230: travel through successively the node in dom tree, generate the web page characteristics of described training webpage;

Preferably, described web page characteristics comprises described text high frequency words feature, architectural feature, label characteristics, chain feature and/or grammar property, and described grammar property comprises punctuation mark feature and sentence feature;

Described text high frequency words is characterized as the frequency of high-frequency characteristic word appearance relevant to type of webpage in the text node of the dom tree that webpage source code is corresponding, and described high-frequency characteristic word comprises news type high-frequency characteristic word, blog type high-frequency characteristic Ci, forum type high-frequency characteristic word.

Preferably, described news type high-frequency characteristic word comprises: " news ", " text ", " source ", " information ", " report ", " daily paper ", " Times ", " evening paper ", " it is reported ", " reporter ", " news ", " media ", " this newspaper ", " special topic ", " " center ", " editor ", " channel ", " important news ", " current events ", " responsible editor ", " relevant report ", " report ", " source herein ", " responsible editor ", " keyword ", " summary ", " key word ", " our publication ", " submission ", " original text source ", " reporter ", " issuing time ", " contribution source ", " contribution ", " information issue ", " issue ", " literary composition/", " article ", " focus ", " gather and edit ", " statement online ", " copyright notice ", " solemnly declare ", " disclaimer ", " copyright statement ", " news ", " editor ", " story ", " stories ", " headline ", " report ", " newspaper ", " xinwen ", " peer link ", " comment warmly ", " news seniority among brothers and sisters " and/or " mobile phone is seen news ",

Preferably, described blog type high-frequency characteristic word comprises: " blog ", " blog article ", " bloger ", " daily record ", " piece of writing ", " send out comment ", " add concern ", " greeting ", " send out paper slip ", " popularity ", " classification ", " filing ", " file ", " collection ", " reading ", " write message ", " label ", " classification ", " classification ", " blogger ", " subscribe to this blog ", " one piece ", " subscription ", " rich report ", " weblog ", " weblog ", " blog ", " journal ", " diary ", " postedby ", " comments ", " archive ", " boke " and/or " message ",

Preferably, described forum type high-frequency characteristic word comprises: " forum ", " community ", " mhkc ", " post ", " reply ", " quote ", " building ", " new post ", " theme ", " be published in ", " reply in ", " main subsides ", " click ", " model ", " browse ", " elite ", " shielding ", " title ", " gold coin ", " report ", " complaint ", " report ", " money order receipt to be signed and returned to the sender ", " follow-up ", " prestige ", " building-owner ", " reply the date ", " only see this author ", " keeper ", " member ", " sofa ", " stool ", " floor ", " integration ", " grade ", " deliver model ", " short-message sending ", " rank ", " edition owner ", " off-line ", " online ", " plusing good friend ", " add as a friend ", " post ", " bean vermicelli ", " send out personal letter ", " concern ", " register ", " send short messages ", " thigh ", " bbs ", " forum ", " club ", " tieba ", " reply ", " discussion ", " luntan ", " shequ ", " tiezi ", " huitie " and/or " thread ".

The frequency that three labels of " h1 ", " h2 ", " h3 " of the number of the structure type Feature Words that the content attribute that described architectural feature is " title " and " meta " two kinds of label nodes in " head " subtree comprises and sign font size occur in whole dom tree; Preferably, described structure type Feature Words comprises structure of a news story type feature word, blog structure type Feature Words, forum's structure type Feature Words; Described structure of a news story type feature word is " news ", " news " and/or " xinwen "; Described blog structure type Feature Words is " blog ", " blog " and/or " boke "; Described forum structure type Feature Words is " club ", " bbs ", " forum ", " thread ", " the tieba ”,“ ”,“ of forum community ", " mhkc ", " model ", " money order receipt to be signed and returned to the sender ", " luntan ", " shequ ", " tieba ", " tiezi " and/or " huitie ";

Described label characteristics is the number percent that preferred 50 conventional labels account for the total label of described webpage, and described 50 conventional labels are:

" tbody ", " span ", " div ", " tr ", " td ", " table ", " ul ", " li ", " p ", " a ", " b ", " font ", " i ", " em ", " big ", " strong ", " small ", " sup ", " sub ", " u ", " br ", " hr ", " frame ", " frameset ", " noframes ", " iframe ", " form ", " input ", " textarea ", " button ", " select ", " option ", " label ", " fieldset ", " ol ", " dl ", " dt ", " dd ", " caption ", " th ", " thead ", " col ", " style ", " meta ", " script ", " noscript ", " applet ", " object ", " link " and/or " img ",

Described chain feature is number or the number percent of the url that comprises every class url type feature word in the property value of url link, and described property value comprises the property value of href attribute of " a ", " link " label and/or the property value of the src attribute of " img " label;

Described punctuation mark is characterized as in the text node of dom tree, the frequency that Chinese and English punctuation mark occurs, and described english punctuation mark comprises: ", ", ". ", " ", "; ", "! ", "-", ": " }, ",, ", " .. ", " ", "; ; ", "! ! ", "-", ":: " }; Described Chinese punctuation mark comprises: ", ", ".”,“？”,“；”,“！”,“……”,“、”,“：”}；{“，，”,“。。", " ", "; ; ", "! ! ", " ... ", ",, ", ":: " }; , described ",, " and represent a plurality of ", ", rather than two ", ", all the other are in like manner.

Described sentence feature comprises the frequency that frequency that in the text node of dom tree, each Chinese and English sentence marks occurs, every kind of sentence marks occur, total quantity and/or each sentence average byte quantity of all sentences; Described english sentence punctuation mark comprise ", ", ". ", " ", "; ", "! ", "-" }; ",, ", " .. ", " ", "; ; ", "! ! ", "-" }; Described Chinese sentence marks comprises: ", ", ".”,“？”,“；”,“！”,“……”}；{“，，”,“。。”,“？？”,“；；”,“！！”,“…………”}。Described punctuation mark feature and sentence feature be not because relating to the processes such as the complicated participle that relates to languages, part of speech, semantic analysis, and time complexity is much lower, and can use across languages.

Above feature, user can carry out additions and deletions to Feature Words according to actual demand and change and subtract.Simultaneously, when generating text high frequency words feature and/or grammar property, the text node of traversal does not comprise the text node that invisible node and/or father node are following label node: " script ", " style ", " object ", " iframe ", " textarea ", " noscript ", " noembed ", " marquee ", " frame ", " frameset ", " noframes ", " form ", " input ", " button ", " select ", " option ", " label ", " fieldset ", " applet ", " optgroup ", " legend ", " isindex " and/or " param ".

Step S240: repeat said process, all training webpages are carried out to URL feature and web page characteristics extraction, obtain respectively following characteristics vector: URL proper vector (R), high frequency words text vector (C), label vector (T), link vector (L), structure vector (U) and/or grammer vector (N).

The type feature that the present embodiment adopts does not all need manual intervention, does not need the coupling of many dom subtrees yet, does not need plain text to enter deep excavation yet, therefore can guarantee to extract the speed of feature; And the type feature adopting is except text high frequency words feature, the extraction of remaining feature does not relate to the problem of languages substantially, when the webpage of processing across language, only need modify to the definition of high frequency words, is applicable to identifying across language web page.

Fig. 3 is optimized and builds the schematic flow sheet of sorter to above 6 proper vectors in the present embodiment, this example has adopted optimizing process twice, thereby guarantees that nicety of grading is enough high.

Step S310: to described URL proper vector and web page characteristics vector, proper vector R, C, T, L, U, N carry out respectively standardization, obtain standardized each proper vector; In the present embodiment, the standardization formula of data is defined as

Std (ij) = \frac{fij - {fi}_{avg}}{{fi}_{\max} - {fi}_{\min}}

Wherein Std (ij) is j eigenwert standardization of i feature result afterwards, and fij is the result before j eigenwert standardization of i feature, fi _avgthe average of i feature, fi _maxthe maximal value of i feature, fi _minthe minimum value of i feature.

Step S320: to each standardized proper vector R, C, T, L, U, N, optimize for the first time, remove and affect the redundancy feature of nicety of grading and feature of noise, obtain preferably proper vector R ', C ', T ', L ', U ', N ';

Step S330: described preferably proper vector R ', C ', T ', L ', U ', N ' are formed to a characteristic set F, described characteristic set integral body is carried out to characteristic optimization for the second time and generate the characteristic set S simplifying, according to the described characteristic set S simplifying, build sorter, and generating feature withdrawal device, described feature extractor has been set the setting feature that webpage to be identified will extract, all features of the characteristic set S simplifying described in, and the normalizing parameter value of each feature; Preferably, maximal value, minimum value and the average of each feature in the characteristic set that described normalizing parameter value is simplified described in comprising.

Step S340: the described characteristic set S simplifying is input to and obtains described disaggregated model in sorter.

Step S320 and step S330, for the process that each standardized proper vector is optimized, characteristic optimization is by finding from primitive character space the impact that relevant character subset limits uncorrelated feature or redundancy feature, by reducing incoherent feature and this mode of redundancy feature quantity, select a part to there is the feature of good discrimination ability, the time that classification is carried out can reduce greatly, and also often produces classification results more accurately.The redundancy feature or the feature of noise that are about to affect nicety of grading remove, thereby improve the precision of classification.At present in information retrieval field, the main inclusion information gain of feature selection approach (the Information Gain that text based is popular, IG), mutual information (Mutual Information, MI), card side's (chi-square) feature selecting or difference on the frequency point-score (Relative Frequency Difference, RFD) etc., these methods are mainly to calculate to degree of correlation score of each characteristic allocation by statistical estimation, advantage is speed, but when selecting correlative character, has ignored the performance of sorter; Consider the high request to accuracy of identification in practical engineering application, the present embodiment has preferably been used the feature delet method (SVM_RFE) of the support vector machine recurrence that accuracy rate is relatively high, optimizing process obtains each preferably proper vector for the first time, has guaranteed the quality of the proper vector of each type; Be optimized on the whole for the second time, further guaranteed that each characteristic set is combined in the effect working, i.e. a nicety of grading afterwards.

Sorter can use support vector machine (Support Vector Machine, SVM) model, naive Bayesian (

bayes, NB) and decision tree (C4,5) etc., the present embodiment combines SVM_RFE characteristic optimization method, uses SVM as sorter.

The feature selection approach SVM_RFE adopting in the present embodiment and sorter SVM method are only for explanation, and other feature selection approachs and classifier methods are also applicable to this.

Fig. 4 is that the present embodiment carries out the schematic flow sheet of Web page classifying to webpage to be identified by sorter, comprises the following steps:

Step S410: the URL of webpage to be identified and source code are input to described feature extractor;

Step S420: described feature extractor is set feature described in web page extraction to be identified, and according to described normalizing parameter, described setting feature is carried out to standardization, obtain standardized characteristic set, and be input in sorter;

Step S430: sorter, according to standardized characteristic set and described disaggregated model, carries out type of webpage classification to webpage to be identified, exports the type of webpage of webpage to be identified.

Fig. 5 is the structural representation of system of the identification type of webpage of the present embodiment, as shown in Figure 5, comprises rule memory, rule match device, characteristic processing device, feature extractor and/or sorter:

Rule memory, for storing described heuristic rule list;

Rule match device, for URL(uniform resource locator) (URL) and the source code based on webpage to be identified, executing rule coupling in described list of rules, if the URL of described identification webpage and source code meet the condition of described heuristic rule definition, rule match success, and the rule of mating according to success is exported the type of webpage of webpage to be identified;

Characteristic processing device, for extracting predefined predetermined characteristic and form standardized proper vector from training webpage, described standardized proper vector is carried out to two suboptimization and form the characteristic set of simplifying, and build sorter and feature extractor based on described characteristic set of simplifying;

Feature extractor for when the performed rule match of rule match device is unsuccessful, extracts described setting feature according to described characteristic set of simplifying from webpage to be identified; Described feature extractor also, for according to the normalizing parameter value of each feature of described characteristic storage unit storage, carries out standardization to the setting feature of the webpage to be identified extracting.

Sorter, for generating disaggregated model and according to the setting feature of the described webpage to be identified of described disaggregated model and the extraction of described feature extractor, exporting the type of webpage of webpage to be identified.

Preferably, described characteristic processing device further comprises URL feature extraction unit, web page characteristics extraction unit, characteristic optimization unit, characteristic storage unit and/or construction unit:

URL feature extraction unit, for extracting described URL feature from the URL character string of training webpage, generates URL proper vector;

Web page characteristics extraction unit, for reading the source code of described training webpage, is converted into dom tree by described source code; And travel through successively the node in dom tree, extract the web page characteristics of described training webpage, and generating web page proper vector, described web page characteristics comprises described text high frequency words feature, architectural feature, label characteristics, chain feature and/or grammar property;

Characteristic optimization unit, carries out standardization for the URL proper vector to described training webpage and described web page characteristics vector and obtains standardized proper vector, and described standardized proper vector is carried out to two suboptimization generates the characteristic set of simplifying;

Characteristic storage unit, for all features of characteristic set and the normalizing parameter value of each feature of simplifying described in storing.

Construction unit, for characteristic set construction feature withdrawal device and the sorter of simplifying described in basis.

Preferably, described characteristic optimization unit comprises first order characteristic optimization unit and characteristic optimization unit, the second level, the described first order is optimized unit and is used for adopting SVM_RFE method to carry out characteristic optimization for the first time to each standardized proper vector, remove and affect the redundancy feature of nicety of grading and feature of noise, obtain preferably proper vector; Optimization unit, the described second level, for described each characteristic set integral body that preferably proper vector forms is carried out to characteristic optimization for the second time, generates the characteristic set of simplifying.

No matter the system and method for identification type of webpage of the present invention, be in links or the use at unit, all more flexible.Heuristic rule device can load at any time according to the actual requirements or unload, predefined feature does not relate to complicated processing procedure, thereby has guaranteed the speed extracting, and the extraction of most of feature all seldom relates to the problem of language, when using across languages, change does not need too large.After feature extraction, carry out continuously twice characteristic optimization process, not only guaranteed the quality of each characteristic set, also guaranteed all characteristic set combinations effect afterwards, further guaranteed the precision extracting.No matter be recognition speed or accuracy of identification, also relate to the problem across languages, this system can meet actual engineering demand.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. a method of identifying type of webpage, comprises the following steps:

2. method according to claim 1, it is characterized in that: described predetermined characteristic comprises the URL feature of extracting from the URL character string of webpage and/or the web page characteristics of extracting from the node of document dbject model corresponding to webpage source code (dom) tree, and described webpage comprises training webpage and webpage to be identified.

3. method according to claim 2, is characterized in that: if described URL ends up with "/", described URL character string is the character string between beginning " http: // " and end "/" in URL; If described URL is with "/" ending, described URL character string is the later all character strings of beginning in URL " http: // ".

4. method according to claim 3, is characterized in that: described URL feature comprises any one or more in following:

URL depth value, the quantitative value that described URL depth value is "/" in URL character string adds 1;

URL fullstop quantitative value, described URL fullstop quantitative value is the quantitative value of ". " in the character string before first "/" of URL;

URL date eigenwert, described date eigenwert is for representing whether URL character string has the date literal that meets date regular expression, if the date that exists described date literal and described date literal to represent is legal, described date eigenwert is made as " 1 "; Otherwise described date eigenwert is made as " 0 ";

The frequency of URL type feature word, described URL type feature word is predefined for representing the Feature Words of type of webpage; And/or

The score numerical value of URL type feature word, the scoring function of described URL type feature word is:

。

5. method according to claim 4, is characterized in that: described URL type feature word is the type feature word for definite type of webpage.

6. method according to claim 2, is characterized in that: described web page characteristics comprises text high frequency words feature, architectural feature, label characteristics, chain feature and/or grammar property, and described grammar property comprises punctuation mark feature and sentence feature;

Described text high frequency words is characterized as the frequency that high-frequency characteristic word relevant to type of webpage in the text node of the document dbject model that webpage source code is corresponding (dom) tree occurs, described high-frequency characteristic word is for determining each text high-frequency characteristic word of type of webpage;

The frequency that three labels of " h1 ", " h2 ", " h3 " of the number of the structure type Feature Words that the content attribute that described architectural feature is " title " and " meta " two kinds of label nodes in " head " subtree comprises and sign font size occur in whole dom tree, described structure type Feature Words is for determining each structure type Feature Words of type of webpage;

Described label characteristics is the number percent that 50 default conventional labels account for the total label of described webpage;

Described punctuation mark is characterized as in the text node of dom tree, the frequency that Chinese and English punctuation mark occurs;

Described sentence feature comprises the frequency that frequency that in the text node of dom tree, each Chinese and English sentence marks occurs, every kind of sentence marks occur, total quantity and/or each sentence average byte quantity of all sentences.

7. according to the arbitrary described method of claim 1～6, extract the predetermined characteristic of described training webpage and form proper vector and comprise the following steps:

Choose training webpage, from the URL character string of training webpage, extract described URL feature;

Read the source code of described training webpage, and described source code is converted into dom tree;

Travel through successively the node in dom tree, generate the web page characteristics of described training webpage, described web page characteristics comprises described text high frequency words feature, architectural feature, label characteristics, chain feature and/or grammar property;

Repeat said process, all training webpages are carried out to URL feature and web page characteristics extraction, obtain respectively following characteristics vector: URL proper vector, high frequency words text vector, label vector, link vector, structure vector and/or grammer vector.

8. method according to claim 7, it is characterized in that: while generating text high frequency words feature and/or grammar property, the text node of traversal does not comprise the text node that invisible node and/or father node are following label node: " script ", " style ", " object ", " iframe ", " textarea ", " noscript ", " noembed ", " marquee ", " frame ", " frameset ", " noframes ", " form ", " input ", " button ", " select ", " option ", " label ", " fieldset ", " applet ", " optgroup ", " legend ", " isindex " and/or " param ".

9. method according to claim 7, is optimized and builds sorter to described proper vector and comprise the following steps:

Described URL proper vector and web page characteristics vector are carried out respectively to standardization, obtain standardized each proper vector;

Adopt the feature delet method (SVM_RFE) of support vector machine recurrence to remove and affect the redundancy feature of nicety of grading and feature of noise each standardized proper vector, carry out characteristic optimization for the first time, obtain preferably proper vector;

Described preferably proper vector is formed to a characteristic set, described characteristic set integral body is carried out to characteristic optimization for the second time and generate the characteristic set of simplifying, according to described characteristic set of simplifying, build sorter and feature extractor, described feature extractor has been set setting feature that webpage to be identified will extract and the normalizing parameter value of each feature, all features of the characteristic set of simplifying described in described setting is characterized as;

Described characteristic set of simplifying is input to and in sorter, obtains described disaggregated model.

10. method according to claim 9, is characterized in that: described standardization formula is

11. methods according to claim 9, is characterized in that: maximal value, minimum value and the average of each feature in the characteristic set of simplifying described in described normalizing parameter value comprises.

12. methods according to claim 9, is characterized in that: by sorter, webpage to be identified is carried out to Web page classifying and comprise the following steps:

The URL of webpage to be identified and source code are input to described feature extractor;

Described feature extractor is set feature described in web page extraction to be identified, and according to described normalizing parameter, described setting feature is carried out to standardization, obtains standardized characteristic set, and is input in sorter;

Sorter, according to standardized characteristic set and described disaggregated model, carries out type of webpage classification to webpage to be identified, exports the type of webpage of webpage to be identified.

13. 1 kinds of systems of identifying type of webpage, comprising:

14. systems according to claim 13, is characterized in that: described characteristic processing device further comprises:

Characteristic storage unit, for all features of characteristic set and the normalizing parameter value of each feature of simplifying described in storing;

15. systems according to claim 14, it is characterized in that: described characteristic optimization unit comprises first order characteristic optimization unit and characteristic optimization unit, the second level, the described first order is optimized unit and is used for adopting SVM_RFE method to carry out characteristic optimization for the first time to each standardized proper vector, remove and affect the redundancy feature of nicety of grading and feature of noise, obtain preferably proper vector; Optimization unit, the described second level, for described each characteristic set integral body that preferably proper vector forms is carried out to characteristic optimization for the second time, generates the characteristic set of simplifying.

16. systems according to claim 14, is characterized in that: described feature extractor also, for according to the normalizing parameter value of each feature of described characteristic storage unit storage, carries out standardization to the setting feature of the webpage to be identified extracting.