CN103577547B - Webpage type identification method and device - Google Patents

Webpage type identification method and device Download PDF

Info

Publication number
CN103577547B
CN103577547B CN201310476416.6A CN201310476416A CN103577547B CN 103577547 B CN103577547 B CN 103577547B CN 201310476416 A CN201310476416 A CN 201310476416A CN 103577547 B CN103577547 B CN 103577547B
Authority
CN
China
Prior art keywords
feature
webpage
page
type
purpose page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310476416.6A
Other languages
Chinese (zh)
Other versions
CN103577547A (en
Inventor
梁捷
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Ucweb Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ucweb Inc filed Critical Ucweb Inc
Priority to CN201310476416.6A priority Critical patent/CN103577547B/en
Publication of CN103577547A publication Critical patent/CN103577547A/en
Application granted granted Critical
Publication of CN103577547B publication Critical patent/CN103577547B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation

Abstract

The invention discloses a kind of webpage type identification method and device, this method includes:Counted respectively in the sample web page of multiple known web pages types and whether include multiple purpose page features, obtain statistical result;The known web pages type and statistical result of multiple sample web pages are analyzed using decision Tree algorithms, obtain the corresponding relation between the priority ranking of purpose page feature, and purpose page feature and type of webpage;Search the purpose page feature successively in webpage to be identified according to the priority ranking, the type of webpage of webpage to be identified is determined according to lookup result and the corresponding relation.Compared with prior art, this method can be ranked up using sample web page to the validity of multiple purpose page features, when identifying webpage to be identified, the higher purpose page feature of validity is first searched according to sequence, then the relatively low purpose page feature of validity is searched, the time that identification expends is shortened, improves recognition efficiency.

Description

Webpage type identification method and device
Technical field
The present invention relates to moving communicating field, more particularly to a kind of webpage type identification method and device.
Background technology
Novel reader is a kind of software for providing novel and downloading function, can not only be provided under local novel reading Carry, typically also support the functions such as the download, reading, search of the network novel.The network novel is downloaded or read, and is with internet Based on the webpage of each novel class, by the way that the novel on these webpages is extracted, then suitable form is reintegrated into It is presented to user.The extraction algorithm difference used due to the catalog page of webpage novel with content page, it usually needs sentence first The type of webpage of disconnected novel, then extracted again using corresponding extraction algorithm according to type of webpage.
The method of identification type of webpage has at present:Identified based on white list and based on page keyword recognition.Based on white name Single knowledge method for distinguishing refers to each target web on internet being included into white list, for the page of different web pages in white list Region feature uses different recognizers, and such as starting point net, I reads to net novel webpage respective imposition layout's method respectively, in advance First the recognizer according to corresponding to its typesetting characteristic Design goes out each website distinguishes the type of webpage of the novel of these websites.Base Whether type of webpage is identified comprising the keyword for distinguishing catalog page and content page according to the page in page keyword method, Such as a certain webpage includes " setting font ", then it is assumed that current web page type is content page.
The shortcomings that certain all be present in the above-mentioned method based on white list and page keyword recognition.Based on white list identification Method, it can not often be accurately identified for the type of webpage for not being added to webpage in white list, and with internet web page quantity Huge and website is continuously increased, and the number of the webpage in white list is also being on the increase, and causes maintenance cost very high;And it is based on The method of page keyword recognition, because Webpage difference is very big, the keyword for distinguishing type of webpage may not apply to All webpages, therefore page keyword method can not often accurately identify type of webpage.
The content of the invention
The embodiments of the invention provide a kind of webpage type identification method and device, and existing in the prior art with solution can not The problem of being accurately identified to type of webpage.
In order to solve the above-mentioned technical problem, in a first aspect, the embodiment of the invention discloses a kind of webpage type identification method, Including:Counted respectively in the sample web page of multiple known web pages types and whether include multiple purpose page features, counted As a result;The known web pages type and statistical result of multiple sample web pages are analyzed using decision Tree algorithms, obtain purpose page Corresponding relation between the priority ranking of region feature, and purpose page feature and type of webpage;Arranged according to the priority Sequence searches the purpose page feature successively in webpage to be identified, is determined according to lookup result and the corresponding relation to be identified The type of webpage of webpage.
It is described to distinguish in the sample web page of multiple known web pages types in the first possible embodiment of first aspect The step of whether statistics includes multiple purpose page features, obtain statistical result includes:The sample web page is judged one by one whether Include purpose page feature;When the sample web page includes the purpose page feature, fisrt feature is recorded as;When the sample When this webpage does not include the purpose page feature, second feature is recorded as;Structure is special comprising all sample web pages corresponding first Sign, the form of second feature, using the form as statistical result.
The first possible embodiment with reference to first aspect, it is described in second of possible embodiment of first aspect The known web pages type and statistical result of multiple sample web pages are analyzed using decision Tree algorithms, obtain purpose page feature Priority ranking, and the step of corresponding relation between purpose page feature and type of webpage includes:According to the form Calculate the information gain of multiple purpose page features;Multiple purpose page features are descending according to information gain It is ranked up, obtains the priority ranking of purpose page feature;According to the known web pages type of multiple sample web pages and the mesh Page feature priority ranking generation purpose page feature and type of webpage corresponding relation.
With reference to second of possible embodiment of first aspect, in the third possible embodiment of first aspect, by with Under type calculates the information gain of each purpose page feature:The corresponding of purpose page feature is calculated according to the form The ratio of fisrt feature and the ratio of second feature;The comentropy of fisrt feature and second feature is calculated respectively;According to described The comentropy of one feature and second feature calculates the conditional entropy of purpose page feature;Purpose page feature is calculated according to the form Comentropy;The conditional entropy that the comentropy of purpose page feature is subtracted to purpose page feature obtains the information of purpose page feature Gain.
With reference to second first aspect, first aspect the first possible embodiment, first aspect of possible embodiment party Formula or first aspect the third possible embodiment, it is described to be searched successively according to the priority ranking in webpage to be identified The purpose page feature, the step of determining the type of webpage of webpage to be identified according to lookup result and the corresponding relation, wrap Include:The maximum purpose page feature of priority ranking is searched in webpage to be identified;Judge whether deposited in the webpage to be identified In the purpose page feature that priority ranking is maximum;When the purpose page that priority ranking maximum in the webpage to be identified be present During feature, the type of webpage corresponding with existing purpose page feature, the net that will be found are searched in the corresponding relation Type of webpage of the page type as webpage to be identified;When the purpose page that priority ranking maximum is not present in the webpage to be identified During region feature, other purpose page features are searched successively in webpage to be identified according to priority ranking is descending, until looking into The type of webpage of webpage to be identified is found, or, completed until all purposes page feature in mapping table is searched.
Second aspect, the embodiment of the invention discloses a kind of type of webpage identification device, including:Statistic unit, for Counted respectively in the sample web page of multiple known web pages types and whether include multiple purpose page features, obtain statistical result;Point Unit is analysed, for, to analyzing the known web pages type and statistical result of multiple sample web pages, being obtained using decision Tree algorithms Corresponding relation between the priority ranking of purpose page feature, and purpose page feature and type of webpage;Type of webpage Determining unit, for searching the purpose page feature successively in webpage to be identified according to the priority ranking, according to looking into Result and the corresponding relation is looked for determine the type of webpage of webpage to be identified.
In the first possible embodiment of second aspect, the statistic unit includes:First judging unit, for one by one Judge whether the sample web page includes purpose page feature;Recording unit, for including the purpose when the sample web page During page feature, fisrt feature is recorded as;When the sample web page does not include the purpose page feature, it is special to be recorded as second Sign;Form construction unit, the form of fisrt feature, second feature is corresponded to comprising all sample web pages for building, by the table Lattice are as statistical result.
The first possible embodiment with reference to second aspect, it is described in second of possible embodiment of second aspect Analytic unit includes:Information gain computing unit, for calculating the information of multiple purpose page features according to the form Gain;Sequencing unit, for multiple purpose page features to be ranked up according to information gain is descending, obtain purpose The priority ranking of page feature;Corresponding relation generation unit, for the known web pages type according to multiple sample web pages and institute State the priority ranking generation purpose page feature of purpose page feature and the corresponding relation of type of webpage.
It is described in the third possible embodiment of second aspect with reference to second of possible embodiment of second aspect Information gain computing unit includes:Ratio calculation unit, for calculating corresponding the of purpose page feature according to the form The ratio of one feature and the ratio of second feature;First information entropy computing unit, for calculating fisrt feature and the second spy respectively The comentropy of sign;Conditional entropy computing unit, for calculating the purpose page according to the comentropy of the fisrt feature and second feature The conditional entropy of feature;Second comentropy computing unit, for calculating the comentropy of purpose page feature according to the form;Information Gain computation subunit, the conditional entropy for the comentropy of purpose page feature to be subtracted to purpose page feature obtain the purpose page The information gain of feature.
With reference to second second aspect, second aspect the first possible embodiment, second aspect of possible embodiment party Formula or second aspect the third possible embodiment, type of webpage is true described in the 4th kind of possible embodiment of second aspect Order member includes:Purpose page feature searching unit, it is special that the maximum purpose page of priority ranking is searched in webpage to be identified Sign;Second judging unit, for judging in the webpage to be identified with the presence or absence of the purpose page feature that priority ranking is maximum; Type of webpage searching unit, for when the maximum purpose page feature of priority ranking in the webpage to be identified be present, The type of webpage corresponding with existing purpose page feature is searched in the corresponding relation, using the type of webpage found as The type of webpage of webpage to be identified;When the purpose page feature of priority ranking maximum is not present in the webpage to be identified, The purpose page feature searching unit searches other mesh successively according further to priority ranking is descending in webpage to be identified Page feature, until find the type of webpage of webpage to be identified, or, until having searched all purposes in mapping table Page feature.
The webpage type identification method provided from above technical scheme, the embodiment of the present application, is counted multiple first The sample web page of known web pages type includes situation to multiple purpose web page characteristics, obtains sample web page to multiple purpose pages The statistical result of feature, is then analyzed using decision Tree algorithms, obtains the priority ranking of purpose page feature, and mesh Page feature and type of webpage between corresponding relation, the priority ranking of purpose page feature is exactly that purpose page feature is known The validity sequence of other type of webpage, it is special finally to search multiple purpose pages successively according to priority ranking in webpage to be identified Levy, and the web page class of webpage to be identified is determined according to the corresponding relation between lookup result and purpose page feature and type of webpage Type.
Compared with prior art, this method can be arranged the validity of multiple purpose page features using sample web page Sequence, when identifying webpage to be identified, the higher purpose page feature of validity is first searched according to sequence, then search validity compared with Low purpose page feature, improves recognition accuracy, and shortens the time that identification expends, and improves recognition efficiency.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below by the attached of embodiment Figure is briefly described, it should be apparent that, for those of ordinary skills, before creative labor is not paid Put, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet for webpage type identification method that the embodiment of the present application provides;
Fig. 2 is the detailed process schematic diagram for the S100 that the embodiment of the present application provides;
Fig. 3 is the detailed process schematic diagram for the S200 that the embodiment of the present application provides;
Fig. 4 is the detailed process schematic diagram for the S201 that the embodiment of the present application provides;
Fig. 5 is the visualization of the final result for the corresponding relation that page feature and type of webpage are obtained in the embodiment of the present application Schematic diagram;
Fig. 6 is the detailed process schematic diagram for the S300 that the embodiment of the present application provides;
Fig. 7 is a kind of structural representation for type of webpage identification device that the embodiment of the present application provides;
Fig. 8 is the structural representation for the statistic unit that the embodiment of the present application provides;
Fig. 9 is the structural representation for the analytic unit that the embodiment of the present application provides;
Figure 10 is the structural representation for the information gain computing unit that the embodiment of the present application provides;
Figure 11 is the structural representation for the type of webpage determining unit that the embodiment of the present application provides.
Embodiment
In order that those skilled in the art more fully understand the technical scheme in the embodiment of the present invention, and make of the invention real Apply the above-mentioned purpose of example, feature and advantage can be more obvious understandable, below in conjunction with the accompanying drawings to technical side in the embodiment of the present invention Case is described in further detail.
Referring to Fig. 1, a kind of schematic flow sheet of the webpage type identification method provided for the embodiment of the present application, methods described Comprise the following steps:
S100:Counted respectively in the sample web page of multiple known web pages types and whether include multiple purpose page features, Obtain statistical result.
The sample web page of known web pages type can randomly select the webpage of novel website, and the type of webpage of sample web page can With including:Novel content pages and listing of novel page etc..Purpose page feature refers to the feature included in sample web page, can basis It is special that number of words, feature critical word or webpage number of words and feature critical word in webpage extract multiple purpose pages from sample web page Sign, furthermore it is also possible to receive multiple page features of user's input., can also be according to other in the application other embodiment Parameter chooses purpose page feature, will not enumerate herein, and other modes can be used to obtain purpose page feature.
In the embodiment of the present application, as shown in Fig. 2 the step may comprise steps of:
S101:Judge whether the sample web page includes purpose page feature one by one.
For each sample web page, judge that the sample web page includes the situation of each purpose page feature, work as sample web page During comprising some purpose page feature, S102 is carried out, when sample web page does not include some purpose page feature, is carried out S103。
S102:It is recorded as fisrt feature.
S103:It is recorded as second feature.
Fisrt feature is to be used to distinguish whether sample web page includes a certain purpose page feature with second feature, be it requires Fisrt feature is different from second feature.In the embodiment of the present application, fisrt feature can be 1, and second feature can be 0, here Numerical value come distinguish sample web page whether comprising a certain purpose page feature be only the application a preferred embodiment, in the application In other embodiment, it can also distinguish whether sample web page includes some purpose web page characteristics using other manner, such as: Fisrt feature and second feature select different letters, or, fisrt feature and second feature select different low and high levels Signal.
S104:Structure corresponds to the form of fisrt feature, second feature comprising all sample web pages, using the form as system Count result.
Referring to table 1, the example of the statistical result of 24 sample web pages provided for the embodiment of the present application, implement in the application In example, the type of webpage on sample web page in last column increases in the statistical result of sample web page, and sample net The type of webpage of page represents that the type of webpage of sample web page is represented when being catalogue page with 0 with 1 when being content pages.
Table 1
S200:The known web pages type and statistical result of multiple sample web pages are analyzed using decision Tree algorithms, obtained Corresponding relation between the priority ranking of purpose page feature, and purpose page feature and type of webpage.
In the embodiment of the present application, as shown in figure 3, the step may comprise steps of:
S201:The information gain of multiple purpose page features is calculated according to the form.
For the information gain of each purpose page feature, as shown in figure 4, can carry out in such a way:
S2011:The ratio and second feature of the corresponding fisrt feature of purpose page feature are calculated according to the form Ratio.
The ratio of fisrt feature is that fisrt feature corresponds to the consistent probability of sample web page type, in the embodiment of the present application, By taking upper table 1 as an example, comprising two kinds consistent of situation of type of webpage corresponding to " more than 1000 numbers of words of the page ", i.e., type of webpage is 1(Content pages), type of webpage 0(Catalogue page), and when the number that fisrt feature is 1 is 13, corresponding type of webpage is 1 Number is 12, probability 12/13, and the number that corresponding type of webpage is 0 is 1, probability 1/13.So fisrt feature is corresponding The consistent probability of sample web page type has two kinds, i.e. the ratio of the First Eigenvalue has two, is respectively:12/13 and 1/13.
S2012:The comentropy of fisrt feature and second feature is calculated respectively.
The information content that information source contains is that the average of information that be possible to that information source is sent has uncertainty, what information source contained Information content is referred to as comentropy.Assuming that some information source has n information, and the probability that one of information x occurs is p, then should Information content contained by information x is:
Ix=-log(px),(1)
Information unit:If it is bottom with 2:Unit is bit;If using e the bottom of as:Unit is nat;If it is bottom with 10:Unit is hart。
The calculation formula of comentropy is:
Wherein, n is included the number of information by information source, and the information that information source packets contain is respectively:x1、x2、……、xi、……、 Xn, wherein xi are that i-th information x, pxi are the probability that i-th of information xi occurs, and Ixi is i-th of information xi information content.
Summed again it can be seen that the comentropy of information source is equal to after the probability for including information is multiplied by respective information content.In this Shen Please be in embodiment, the fisrt feature of " more than 1000 numbers of words of the page " is represented with 1, and the second of " more than 1000 numbers of words of the page " is special Requisition 0 represents, then the comentropy of fisrt feature 1 is:
H(1)=-log(12/13)*(12/13)+-log(1/13)*(1/13), (3)
Accordingly, because the ratio of Second Eigenvalue has two, it is respectively:11/11 and 0, the comentropy of second feature 0 For:
H(0)=-log(1)+-log(0),(4)
S2013:The conditional entropy of purpose page feature is calculated according to the comentropy of the fisrt feature and second feature.
For fisrt feature, the quantity that conditional entropy is multiplied by fisrt feature equal to the comentropy of fisrt feature accounts for sample web page Total quantity.
So in the embodiment of the present application, the conditional entropy of fisrt feature is equal to:
K(1)=H(1)*(13/24)=[-log(12/13)*(12/13)+-log(1/13)*(1/13)]*(13/24), (5)
Similarly, the conditional entropy of second feature is equal to:
K(0)=H(0)*(11/24)=[-log(0)+-log(1)] * (11/24),(6)
And the conditional entropy of purpose page feature is equal to the conditional entropy and second of fisrt feature corresponding to the purpose page feature The conditional entropy sum of feature, it is possible to the conditional entropy of purpose page feature is calculated.
S2014:The comentropy of purpose page feature is calculated according to the form.
By taking table 1 as an example, purpose page feature is for the comentropy of " more than 1000 numbers of words of the page ":
H (more than 1000 numbers of words of the page)=- log(12/24)+-log(12/24), (7)
S2015:The conditional entropy that the comentropy of purpose page feature is subtracted to purpose page feature obtains purpose page feature Information gain;
In the embodiment of the present application, so that purpose page feature is " more than 1000 numbers of words of the page " as an example, letter corresponding to it Ceasing gain is specially:
-log(12/24)+-log(12/24)-[-log(12/13)*(12/13)+-log(1/13)*(1/13)]*(13/ 24)
+[-log(0)+-log(1)]*(11/24).
By above-mentioned calculating, the information gain of a purpose page feature can be obtained, for other purpose page features Also calculated respectively according to mode shown in Fig. 4, finally give the information gain of each page feature.
S202:Multiple purpose page features are ranked up according to information gain is descending, obtain the purpose page The priority ranking of feature.
S203:Generated according to the known web pages type of multiple sample web pages and the priority ranking of the purpose page feature The corresponding relation of purpose page feature and type of webpage.
The corresponding relation of page feature and type of webpage, it can be the corresponding type of webpage of a page feature, also may be used Think the corresponding type of webpage of combination of multiple page features.Pair of page feature and type of webpage in the embodiment of the present application The final result that should be related to is:
Content includes " xth x chapters " link surpass 10:{‘0’:{ more than 1000 words of the page:{‘0’:' catalogue page ', ‘1’:' content pages ' } }, ' 1':' catalogue page ' } };
As shown in figure 5, to obtain the final result of the corresponding relation of page feature and type of webpage in the embodiment of the present application Visualization schematic diagram.It can be seen that, in the embodiment of the present application, do not have " to exist in the mapping table finally given by Fig. 5 Link, its content include ' return to catalogue ' " purpose page feature because this purpose page feature is not enough to for judging Page type, as invalid feature, so in order to avoid interference, invalid feature can be deleted in mapping table.
S300:The purpose page feature is searched successively according to the priority ranking in webpage to be identified, according to looking into Result and the corresponding relation is looked for determine the type of webpage of webpage to be identified.
In the embodiment of the present application, as shown in fig. 6, the step may comprise steps of:
S301:The maximum purpose page feature of priority ranking is searched in webpage to be identified.
The bigger purpose page feature of information gain, the accuracy of its determination page type is higher, so, it is real in the application Apply in example, multiple purpose page features are searched successively according to priority ranking is descending.
S302:Judge in the webpage to be identified with the presence or absence of the purpose page feature that priority ranking is maximum.
When the purpose page feature of priority ranking maximum in the webpage to be identified be present, S303 is carried out, it is described to treat When identifying the purpose page feature that priority ranking maximum is not present in webpage, S304 is carried out.
S303:The type of webpage corresponding with existing purpose page feature is searched in the corresponding relation, will be searched Type of webpage of the type of webpage arrived as webpage to be identified, and terminate.
S304:Judge whether all to have looked into all purposes page feature in mapping table in webpage to be identified Look for;
If it is, terminating, S305 is otherwise performed;
S305:The information gain of remaining purpose page feature to not searched in webpage to be identified is ranked up, and is returned S301 is returned, searches the purpose page feature after rearrangement successively in webpage to be identified until by all mesh in mapping table Page feature search complete.
In the application other embodiment, information gain can also be recalculated to remaining purpose page feature, according to Information gain after recalculating is ranked up to purpose page feature, and is utilized and recalculated the page feature after information gain Sequence searched in webpage to be identified.
The webpage type identification method that the embodiment of the present application provides, the sample net of multiple known web pages types is counted first Page includes situation to multiple purpose web page characteristics, obtains statistical result of the sample web page to multiple purpose page features, then Analyzed using decision Tree algorithms, obtain the priority ranking of purpose page feature, and purpose page feature and web page class Corresponding relation between type, the priority ranking of purpose page feature are exactly the validity of purpose page feature identification type of webpage Sequence, finally searches multiple purpose page features successively according to priority ranking in webpage to be identified, and according to lookup result Corresponding relation between purpose page feature and type of webpage determines the type of webpage of webpage to be identified.
Compared with prior art, this method can be arranged the validity of multiple purpose page features using sample web page Sequence, when identifying webpage to be identified, the higher purpose page feature of validity is first searched according to sequence, then search validity compared with Low purpose page feature, improves recognition accuracy, and shortens the time that identification expends, and improves recognition efficiency.
In addition, when applying this method on mobile phone or other mobile terminals, can be after the mobile terminals such as mobile phone Platform obtains the corresponding relation between the priority ranking of purpose page feature and purpose page feature and type of webpage in advance, then Corresponding relation between the priority ranking of the purpose page feature got and purpose page feature and type of webpage is carried out Storage.When the foreground application of the mobile front end such as mobile phone, such as:Browser to the type of webpage of webpage to be identified, it is necessary to know When other, it can directly read between priority ranking and purpose page feature and the type of webpage of the purpose page feature of storage Corresponding relation, and the type of webpage of webpage to be identified is identified, and then the complicated journey of the computing of foreground application can be reduced Degree, improve the speed to type of webpage identification of foreground application.
Fig. 7 is a kind of structural representation for type of webpage identification device that the embodiment of the present application provides.
As shown in fig. 7, the type of webpage identification device includes:
Statistic unit 1, whether multiple purposes are included for being counted respectively in the sample web page of multiple known web pages types Page feature, obtain statistical result;
Analytic unit 2, for utilizing decision Tree algorithms to the known web pages type and statistical result to multiple sample web pages Analyzed, obtain the corresponding pass between the priority ranking of purpose page feature, and purpose page feature and type of webpage System;
Type of webpage determining unit 3, for searching the mesh successively in webpage to be identified according to the priority ranking Page feature, the type of webpage of webpage to be identified is determined according to lookup result and the corresponding relation.
As shown in figure 8, in the embodiment of the present application, statistic unit 1 can include:
First judging unit 11, for judging whether the sample web page includes purpose page feature one by one;
Recording unit 12, for when the sample web page includes the purpose page feature, being recorded as fisrt feature;When When the sample web page does not include the purpose page feature, second feature is recorded as;
Form construction unit 13, the form of fisrt feature, second feature is corresponded to comprising all sample web pages for building, will The form is as statistical result.
As shown in figure 9, in the embodiment of the present application, analytic unit 2 can include:
Information gain computing unit 21, the information for calculating multiple purpose page features according to the form increase Benefit;
Sequencing unit 22, for multiple purpose page features to be ranked up according to information gain is descending, obtain To the priority ranking of purpose page feature;
Corresponding relation generation unit 23, it is special for the known web pages type according to multiple sample web pages and the purpose page The priority ranking generation purpose page feature of sign and the corresponding relation of type of webpage.
As shown in Figure 10, in the embodiment of the present application, information gain computing unit 21 can include:
Ratio calculation unit 211, the ratio of the corresponding fisrt feature for calculating purpose page feature according to the form The ratio of value and second feature;
First information entropy computing unit 212, for calculating the comentropy of fisrt feature and second feature respectively;
Conditional entropy computing unit 213, for calculating the purpose page according to the comentropy of the fisrt feature and second feature The conditional entropy of feature;
Second comentropy computing unit 214, for calculating the comentropy of purpose page feature according to the form;
Information gain computation subunit 215, for the comentropy of purpose page feature to be subtracted to the bar of purpose page feature Part entropy obtains the information gain of purpose page feature.
As shown in Figure 10, in the embodiment of the present application, type of webpage determining unit 3 can include:
Purpose page feature searching unit 31, it is special that the maximum purpose page of priority ranking is searched in webpage to be identified Sign;
Second judging unit 32, for judging in the webpage to be identified with the presence or absence of the purpose page that priority ranking is maximum Region feature;
Type of webpage searching unit 33, for when the purpose page that priority ranking maximum in the webpage to be identified be present During feature, the type of webpage corresponding with existing purpose page feature, the net that will be found are searched in the corresponding relation Type of webpage of the page type as webpage to be identified.
When the purpose page feature of priority ranking maximum is not present in the webpage to be identified, sequencing unit 22 may be used also It is ranked up with the information gain to the remaining purpose page feature do not searched in webpage to be identified, and purpose page feature Searching unit 31 searches other purpose page features successively according further to priority ranking is descending in webpage to be identified, until The type of webpage of webpage to be identified is found, or, until having searched all purposes page feature in mapping table.
In addition, in the application other embodiment, when the mesh that priority ranking maximum is not present in the webpage to be identified Page feature when, analytic unit 2 can also recalculate information gain to remaining purpose page feature, according to recalculating Information gain afterwards is ranked up to purpose page feature, and determines the corresponding pass between purpose page feature and type of webpage It is that then purpose page feature searching unit 31 searches the purpose page feature after rearrangement successively in webpage to be identified, Until having searched all purposes page feature in mapping table.
Compared with prior art, the device can be arranged the validity of multiple purpose page features using sample web page Sequence, when identifying webpage to be identified, the higher purpose page feature of validity is first searched according to sequence, then search validity compared with Low purpose page feature, improves recognition accuracy, and shortens the time that identification expends, and improves recognition efficiency.
It is understood that the present invention can be used in numerous general or special purpose computing system environments or configuration.Such as:It is individual People's computer, server computer, handheld device or portable set, laptop device, multicomputer system, based on microprocessor The system of device, set top box, programmable consumer-elcetronics devices, network PC, minicom, mainframe computer including to take up an official post DCE of what system or equipment etc..
The present invention can be described in the general context of computer executable instructions, such as program Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type Part, data structure etc..The present invention can also be put into practice in a distributed computing environment, in these DCEs, by Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with In the local and remote computer-readable storage medium including storage device.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element also be present in process, method, article or equipment including the key element.
Described above is only the embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims (4)

  1. A kind of 1. webpage type identification method, it is characterised in that including:
    Counted respectively in the sample web page of multiple known web pages types and whether include multiple purpose page features, obtain statistics knot Fruit, including:Judge whether the sample web page includes purpose page feature one by one;When the sample web page includes the purpose page During region feature, fisrt feature is recorded as;When the sample web page does not include the purpose page feature, it is recorded as and the first spy Levy different second feature;Structure corresponds to the form of fisrt feature, second feature comprising all sample web pages, and the form is made For statistical result;
    The known web pages type and statistical result of multiple sample web pages are analyzed using decision Tree algorithms, wherein:
    The information gain of multiple purpose page features is calculated according to the form, by multiple purpose page features according to Information gain is descending to be ranked up, and obtains the priority ranking of purpose page feature, according to known to multiple sample web pages The priority ranking of type of webpage and the purpose page feature generates the corresponding relation of purpose page feature and type of webpage, its Middle the step of calculating each information gain of the purpose page feature, includes:
    The ratio of corresponding fisrt feature and the ratio of second feature of purpose page feature are calculated according to the form;Wherein, The ratio of the fisrt feature is that fisrt feature correspond to the consistent probability of sample web page type, and the ratio of the second feature is the Two features correspond to the consistent probability of sample web page type;
    The comentropy of fisrt feature and second feature is calculated respectively;
    The conditional entropy of purpose page feature is calculated according to the comentropy of the fisrt feature and second feature, wherein:Described first The ratio for the total quantity that the quantity that the comentropy of feature is multiplied by the fisrt feature accounts for sample web page is worth to the conditional entropy of fisrt feature, Being multiplied by the ratio of the total quantity that the quantity of the second feature accounts for sample web page with the comentropy of the second feature, to be worth to second special The conditional entropy of sign, using the conditional entropy sum of the conditional entropy of the fisrt feature and the second feature as the purpose page The conditional entropy of region feature;
    Mesh is calculated using the method different from the computational methods of fisrt feature and the comentropy of second feature according to the form Page feature comentropy, wherein:The quantity for calculating the fisrt feature of the type of webpage of sample web page first accounts for sample web page Total quantity the first ratio, the quantity of the second feature of the type of webpage of sample web page accounts for the second of the total quantity of sample web page Ratio;Then it is the logarithm of the first ratio described in bottom and with 2 logarithms for being the second ratio described in bottom to ask respectively with 2, by gained Comentropy of two logarithm value sums as the purpose page feature;
    The conditional entropy that the comentropy of purpose page feature is subtracted to purpose page feature obtains the information gain of purpose page feature;
    The purpose page feature is searched successively in webpage to be identified according to the priority ranking, according to lookup result and institute The type of webpage that corresponding relation determines webpage to be identified is stated, including:
    Search in webpage to be identified with the presence or absence of the purpose page feature that priority ranking is maximum, when the priority ranking being present During the purpose page feature of maximum, the web page class corresponding with existing purpose page feature is searched in the corresponding relation Type, the type of webpage using the type of webpage found as webpage to be identified;
    When the purpose page feature of priority ranking maximum is not present in the webpage to be identified, according to priority ranking by big Other purpose page features are searched successively in webpage to be identified to small, until finding the type of webpage of webpage to be identified.
  2. 2. according to the method for claim 1, it is characterised in that the corresponding relation of the purpose page feature and type of webpage Including:The corresponding type of webpage of combination of one corresponding type of webpage of page feature and/or multiple page features.
  3. A kind of 3. type of webpage identification device, it is characterised in that including:
    Statistic unit, it is whether special comprising multiple purpose pages for being counted respectively in the sample web page of multiple known web pages types Sign, obtains statistical result, it includes:
    First judging unit, for judging whether the sample web page includes purpose page feature one by one;
    Recording unit, for when the sample web page includes the purpose page feature, being recorded as fisrt feature;When the sample When this webpage does not include the purpose page feature, the second feature different from fisrt feature is recorded as;
    Form construction unit, the form of fisrt feature, second feature is corresponded to comprising all sample web pages for building, by the table Lattice are as statistical result;
    Analytic unit, for being divided using decision Tree algorithms the known web pages type and statistical result of multiple sample web pages Analysis, obtains the corresponding relation between the priority ranking of purpose page feature, and purpose page feature and type of webpage;
    Type of webpage determining unit, for searching the purpose page successively in webpage to be identified according to the priority ranking Feature, the type of webpage of webpage to be identified is determined according to lookup result and the corresponding relation, including:Search in webpage to be identified With the presence or absence of the purpose page feature that priority ranking is maximum, when the purpose page feature that the priority ranking maximum be present When, the type of webpage corresponding with existing purpose page feature, the web page class that will be found are searched in the corresponding relation Type of webpage of the type as webpage to be identified;When being not present in the webpage to be identified, the maximum purpose page of priority ranking is special During sign, other purpose page features are searched successively in webpage to be identified according to priority ranking is descending, until finding The type of webpage of webpage to be identified;
    Wherein, the analytic unit includes:
    Information gain computing unit, for calculating the information gain of multiple purpose page features according to the form;
    Sequencing unit, for multiple purpose page features to be ranked up according to information gain is descending, obtain purpose The priority ranking of page feature;
    Corresponding relation generation unit, for the excellent of the known web pages type according to multiple sample web pages and the purpose page feature The corresponding relation of first level sequence generation purpose page feature and type of webpage;
    Wherein, described information gain calculating unit includes:
    Ratio calculation unit, the ratio and second of the corresponding fisrt feature for calculating purpose page feature according to the form The ratio of feature;Wherein, the ratio of the fisrt feature is that fisrt feature corresponds to the consistent probability of sample web page type, described the The ratio of two features is that second feature corresponds to the consistent probability of sample web page type;
    First information entropy computing unit, for calculating the comentropy of fisrt feature and second feature respectively;
    Conditional entropy computing unit, for calculating the bar of purpose page feature according to the comentropy of the fisrt feature and second feature Part entropy, including:With the comentropy of the fisrt feature be multiplied by the fisrt feature quantity account for sample web page total quantity ratio The conditional entropy of fisrt feature is obtained, the quantity that the second feature is multiplied by with the comentropy of the second feature accounts for the total of sample web page The ratio of quantity is worth to the conditional entropy of second feature, by the conditional entropy of the fisrt feature and the conditional entropy phase of the second feature In addition and the conditional entropy as the purpose page feature;
    Second comentropy computing unit, for using the calculating with fisrt feature and the comentropy of second feature according to the form Method different method calculates the comentropy of purpose page feature, wherein:The of the type of webpage of sample web page is calculated first The quantity of one feature accounts for the first ratio of the total quantity of sample web page, and the quantity of the second feature of the type of webpage of sample web page accounts for Second ratio of the total quantity of sample web page;Then ask respectively and be the logarithm of the first ratio described in bottom with 2 and be described in bottom with 2 The logarithm of two ratios, the comentropy using two logarithm value sums of gained as the purpose page feature;
    Information gain computation subunit, the conditional entropy for the comentropy of purpose page feature to be subtracted to purpose page feature obtain The information gain of purpose page feature.
  4. 4. device according to claim 3, it is characterised in that the corresponding relation of the purpose page feature and type of webpage Including:The corresponding type of webpage of combination of one corresponding type of webpage of page feature and/or multiple page features.
CN201310476416.6A 2013-10-12 2013-10-12 Webpage type identification method and device Active CN103577547B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310476416.6A CN103577547B (en) 2013-10-12 2013-10-12 Webpage type identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310476416.6A CN103577547B (en) 2013-10-12 2013-10-12 Webpage type identification method and device

Publications (2)

Publication Number Publication Date
CN103577547A CN103577547A (en) 2014-02-12
CN103577547B true CN103577547B (en) 2017-11-10

Family

ID=50049323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310476416.6A Active CN103577547B (en) 2013-10-12 2013-10-12 Webpage type identification method and device

Country Status (1)

Country Link
CN (1) CN103577547B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294485B (en) * 2015-06-05 2019-11-01 华为技术有限公司 Determine the method and device in significant place
CN108345599B (en) * 2017-01-23 2021-12-14 阿里巴巴集团控股有限公司 Webpage type determination method and device and computer readable medium
CN109559141A (en) * 2017-09-27 2019-04-02 北京国双科技有限公司 A kind of automatic classification method, the apparatus and system of intention pattern
CN108228463B (en) * 2018-01-10 2021-09-21 百度在线网络技术(北京)有限公司 Method and device for detecting first screen time
CN110750739B (en) * 2018-07-04 2022-07-05 北京国双科技有限公司 Page type determination method and device
CN110336835B (en) * 2019-08-05 2021-10-19 深信服科技股份有限公司 Malicious behavior detection method, user equipment, storage medium and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080244715A1 (en) * 2007-03-27 2008-10-02 Tim Pedone Method and apparatus for detecting and reporting phishing attempts
CN101251855B (en) * 2008-03-27 2010-12-22 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN101650715B (en) * 2008-08-12 2011-06-29 厦门市美亚柏科信息股份有限公司 Method and device for screening links on web pages
CN101534306B (en) * 2009-04-14 2012-01-11 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
CN101794311B (en) * 2010-03-05 2012-06-13 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于决策树的分类算法研究》;王宏威;《软件导论》;20070930(第17期);第134-135页 *

Also Published As

Publication number Publication date
CN103577547A (en) 2014-02-12

Similar Documents

Publication Publication Date Title
CN103577547B (en) Webpage type identification method and device
EP4016432A1 (en) Method and apparatus for training fusion ordering model, search ordering method and apparatus, electronic device, storage medium, and program product
US9767183B2 (en) Method and system for enhanced query term suggestion
CN102612691B (en) Method and system for scoring texts
CN107704503A (en) User's keyword extracting device, method and computer-readable recording medium
CN107807915B (en) Error correction model establishing method, device, equipment and medium based on error correction platform
WO2020233344A1 (en) Searching method and apparatus, and storage medium
CN110874528B (en) Text similarity obtaining method and device
CN103559313B (en) Searching method and device
WO2011134104A1 (en) Method, system and appartus for selecting acronym expansion
CN113127621A (en) Dialogue module pushing method, device, equipment and storage medium
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
US10043511B2 (en) Domain terminology expansion by relevancy
CN113791837B (en) Page processing method, device, equipment and storage medium
CN113569118A (en) Self-media pushing method and device, computer equipment and storage medium
US20130282628A1 (en) Method and Apparatus for Performing Dynamic Textual Complexity Analysis Using Machine Learning Artificial Intelligence
CN111125543B (en) Training method of book recommendation sequencing model, computing device and storage medium
CN108052520A (en) Conjunctive word analysis method, electronic device and storage medium based on topic model
CN114647739B (en) Entity chain finger method, device, electronic equipment and storage medium
CN113641767B (en) Entity relation extraction method, device, equipment and storage medium
CN116383340A (en) Information searching method, device, electronic equipment and storage medium
CN113792230B (en) Service linking method, device, electronic equipment and storage medium
CN113486169B (en) Synonymous statement generation method, device, equipment and storage medium based on BERT model
JP2024507029A (en) Web page identification methods, devices, electronic devices, media and computer programs
CN109684467A (en) A kind of classification method and device of text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200527

Address after: 310051 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 100080, room 16, building 10-20, Building 29, Haidian District, Suzhou Street, Beijing

Patentee before: UC MOBILE Ltd.