CN103577547A - Webpage type identification method and device - Google Patents

Webpage type identification method and device Download PDF

Info

Publication number
CN103577547A
CN103577547A CN201310476416.6A CN201310476416A CN103577547A CN 103577547 A CN103577547 A CN 103577547A CN 201310476416 A CN201310476416 A CN 201310476416A CN 103577547 A CN103577547 A CN 103577547A
Authority
CN
China
Prior art keywords
webpage
object page
page feature
type
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310476416.6A
Other languages
Chinese (zh)
Other versions
CN103577547B (en
Inventor
梁捷
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Ucweb Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ucweb Inc filed Critical Ucweb Inc
Priority to CN201310476416.6A priority Critical patent/CN103577547B/en
Publication of CN103577547A publication Critical patent/CN103577547A/en
Application granted granted Critical
Publication of CN103577547B publication Critical patent/CN103577547B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation

Abstract

The invention discloses a webpage type identification method and device. The method comprises the steps of respectively carrying out statistics in a plurality of sample webpages with known webpage types to judge whether a plurality of target page characteristics are contained or not and to obtain a statistic result; analyzing the known webpage types of the sample webpages and the statistic result by using a decision tree algorithm to obtain the priority ranking of the target page characteristics and the corresponding relationship between the target page characteristics and the webpage types; sequentially finding the target page characteristics in the webpages to be identified according to the priority ranking and determining the webpage types of the webpages to be identified according to the finding result and the corresponding relationship. Compared with the prior art, the method can be used for ranking the effectiveness of the target page characteristics by using the sample webpages; when the webpages to be identified are identified, the target page characteristics with higher effectiveness are firstly found according to the ranking, then the target page characteristics with lower effectiveness are found, time consumed by identification is shortened, and the identification efficiency is improved.

Description

Type of webpage recognition methods and device
Technical field
The present invention relates to moving communicating field, particularly relate to a kind of type of webpage recognition methods and device.
Background technology
Novel reader is a kind of software that provides novel to download function, not only can provide local novel to read and download, generally and functions such as the download of network enabled novel, reading, search.The network novel is downloaded or is read, and is to take the webpage of each novel class on internet as basis, by the novel on these webpages is extracted, then reintegrates into suitable form and presents to user.Because the catalog page of webpage novel is different with the extraction algorithm that content page adopts, conventionally need to first judge the type of webpage of novel, then according to type of webpage, adopt again corresponding extraction algorithm to extract.
The method of identification type of webpage has at present: based on white list identification with based on page keyword recognition.Based on white list knowledge method for distinguishing, refer to each target web on internet is included in white list, page feature for different web pages in white list adopts different recognizers, as starting point net, I reads the novel webpages such as net respectively imposition layout's method separately, goes out the type of webpage that recognizer corresponding to each website distinguished the novel of these websites in advance according to its typesetting characteristic Design.Type of webpage identified in the key word that whether comprises differentiation catalog page and content page according to the page based on page key word method, and for example a certain webpage comprises " setting font ", thinks that current web page type is content page.
All there is certain shortcoming in the above-mentioned method based on white list and page keyword recognition.Based on white list, know method for distinguishing, for the type of webpage that does not join webpage in white list, often cannot accurately identify, and along with internet web page enormous amount and website constantly increase, the number of the webpage in white list is also being on the increase, cause maintenance cost very high; And method based on page keyword recognition, because Webpage difference is very large, may inapplicable all webpages for distinguishing the key word of type of webpage, so page key word method often cannot accurately be identified type of webpage.
Summary of the invention
The embodiment of the present invention provides a kind of type of webpage recognition methods and device, to solve in prior art, exists and cannot carry out the accurately problem of identification to type of webpage.
In order to solve the problems of the technologies described above, first aspect, the embodiment of the invention discloses a kind of type of webpage recognition methods, comprising: in the sample webpage of a plurality of known web pages types, whether statistics comprises a plurality of object page features respectively, obtains statistics; Utilize decision Tree algorithms to analyze known web pages type and the statistics of a plurality of sample webpages, obtain the prioritization of object page feature, and the corresponding relation between object page feature and type of webpage; According to described prioritization, in webpage to be identified, search successively described object page feature, according to lookup result and described corresponding relation, determine the type of webpage of webpage to be identified.
In the possible embodiment of first aspect the first, described in the sample webpage of a plurality of known web pages types respectively statistics whether comprise a plurality of object page features, the step that obtains statistics comprises: judge one by one whether described sample webpage comprises object page feature; When described sample webpage comprises described object page feature, be recorded as First Characteristic; When described sample webpage does not comprise described object page feature, be recorded as Second Characteristic; The form that structure comprises the corresponding First Characteristic of all sample webpages, Second Characteristic, using described form as statistics.
In conjunction with the possible embodiment of first aspect the first, in the possible embodiment of first aspect the second, the described decision Tree algorithms of utilizing is analyzed known web pages type and the statistics of a plurality of sample webpages, obtain the prioritization of object page feature, and the step of the corresponding relation between object page feature and type of webpage comprises: the information gain of calculating a plurality of described object page features according to described form; A plurality of described object page features are sorted according to information gain is descending, obtain the prioritization of object page feature; According to the prioritization of the known web pages type of a plurality of sample webpages and described object page feature, generate the corresponding relation of object page feature and type of webpage.
In conjunction with the possible embodiment of first aspect the second, in the third possible embodiment of first aspect, calculate in the following manner the information gain of object page feature described in each: the ratio of corresponding First Characteristic and the ratio of Second Characteristic that according to described form, calculate object page feature; Calculate respectively the information entropy of First Characteristic and Second Characteristic; According to the information entropy of described First Characteristic and Second Characteristic, calculate the conditional entropy of object page feature; According to described form, calculate the information entropy of object page feature; The conditional entropy that the information entropy of object page feature is deducted to object page feature obtains the information gain of object page feature.
In conjunction with first aspect, possible embodiment or the third possible embodiment of first aspect of embodiment, first aspect the second that first aspect the first is possible, describedly in webpage to be identified, according to described prioritization, search successively described object page feature, according to lookup result and described corresponding relation, determine that the step of the type of webpage of webpage to be identified comprises: the object page feature of searching prioritization maximum in webpage to be identified; Judge the object page feature that whether has prioritization maximum in described webpage to be identified; While there is the object page feature of prioritization maximum in described webpage to be identified, in described corresponding relation, search the type of webpage corresponding with the object page feature existing, the type of webpage using the type of webpage finding as webpage to be identified; While there is not the object page feature of prioritization maximum in described webpage to be identified, according to descending other object page feature of searching successively in webpage to be identified of prioritization, until find the type of webpage of webpage to be identified, or, until all object page features in mapping table have been searched.
Second aspect, the embodiment of the invention discloses a kind of type of webpage recognition device, comprising: statistic unit, for the sample webpage in a plurality of known web pages types, add up respectively whether comprise a plurality of object page features, and obtain statistics; Analytic unit, for utilizing decision Tree algorithms to analyze known web pages type and statistics to a plurality of sample webpages, obtains the prioritization of object page feature, and the corresponding relation between object page feature and type of webpage; Type of webpage determining unit, for searching successively described object page feature according to described prioritization at webpage to be identified, determines the type of webpage of webpage to be identified according to lookup result and described corresponding relation.
In the possible embodiment of second aspect the first, described statistic unit comprises: the first judging unit, for judging one by one whether described sample webpage comprises object page feature; Record cell, for when described sample webpage comprises described object page feature, is recorded as First Characteristic; When described sample webpage does not comprise described object page feature, be recorded as Second Characteristic; Form construction unit, for building the form that comprises the corresponding First Characteristic of all sample webpages, Second Characteristic, using described form as statistics.
In conjunction with the possible embodiment of second aspect the first, in the possible embodiment of second aspect the second, described analytic unit comprises: information gain computing unit, for calculate the information gain of a plurality of described object page features according to described form; Sequencing unit, for a plurality of described object page features are sorted according to information gain is descending, obtains the prioritization of object page feature; Corresponding relation generation unit, for generating the corresponding relation of object page feature and type of webpage according to the prioritization of the known web pages type of a plurality of sample webpages and described object page feature.
In conjunction with the possible embodiment of second aspect the second, in the third possible embodiment of second aspect, described information gain computing unit comprises: ratio calculation unit, for calculate the ratio of corresponding First Characteristic and the ratio of Second Characteristic of object page feature according to described form; First information entropy computing unit, for calculating respectively the information entropy of First Characteristic and Second Characteristic; Conditional entropy computing unit, for calculating the conditional entropy of object page feature according to the information entropy of described First Characteristic and Second Characteristic; The second information entropy computing unit, for calculating the information entropy of object page feature according to described form; Information gain computation subunit, obtains the information gain of object page feature for the information entropy of object page feature being deducted to the conditional entropy of object page feature.
In conjunction with second aspect, possible embodiment or the third possible embodiment of second aspect of embodiment, second aspect the second that second aspect the first is possible, described in the 4th kind of possible embodiment of second aspect, type of webpage determining unit comprises: object page feature is searched unit, searches the object page feature of prioritization maximum in webpage to be identified; The second judging unit, for judging whether described webpage to be identified exists the object page feature of prioritization maximum; Type of webpage is searched unit, for when there is the object page feature of prioritization maximum in described webpage to be identified, the corresponding type of webpage of object page feature of searching in described corresponding relation and existing, the type of webpage using the type of webpage finding as webpage to be identified; While there is not the object page feature of prioritization maximum in described webpage to be identified, described object page feature is searched unit also according to descending other object page feature of searching successively in webpage to be identified of prioritization, until find the type of webpage of webpage to be identified, or, until searched all object page features in mapping table.
From above technical scheme, this type of webpage recognition methods that the embodiment of the present application provides, first add up the situation that comprises of the sample webpage of a plurality of known web pages types to a plurality of object web page characteristics, obtain the statistics of sample webpage to a plurality of object page features, then utilize decision Tree algorithms analysis, obtain the prioritization of object page feature, and the corresponding relation between object page feature and type of webpage, the prioritization of object page feature is exactly the validity sequence of object page feature identification type of webpage, finally according to prioritization, in webpage to be identified, search successively a plurality of object page features, and according to the corresponding relation between lookup result and object page feature and type of webpage, determine the type of webpage of webpage to be identified.
Compared with prior art, the method can utilize sample webpage to sort to the validity of a plurality of object page features, when identification webpage to be identified, according to sequence, first search the object page feature that validity is higher, then search the object page feature that validity is lower, improve recognition accuracy, and shortened the time that identification expends, improved recognition efficiency.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, below the accompanying drawing to embodiment is briefly described, apparently, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
The schematic flow sheet of a kind of type of webpage recognition methods that Fig. 1 provides for the embodiment of the present application;
The detailed process schematic diagram of the S100 that Fig. 2 provides for the embodiment of the present application;
The detailed process schematic diagram of the S200 that Fig. 3 provides for the embodiment of the present application;
The detailed process schematic diagram of the S201 that Fig. 4 provides for the embodiment of the present application;
Fig. 5 is the visual schematic diagram of net result that obtains the corresponding relation of page feature and type of webpage in the embodiment of the present application;
The detailed process schematic diagram of the S300 that Fig. 6 provides for the embodiment of the present application;
The structural representation of a kind of type of webpage recognition device that Fig. 7 provides for the embodiment of the present application;
The structural representation of the statistic unit that Fig. 8 provides for the embodiment of the present application;
The structural representation of the analytic unit that Fig. 9 provides for the embodiment of the present application;
The structural representation of the information gain computing unit that Figure 10 provides for the embodiment of the present application;
The structural representation of the type of webpage determining unit that Figure 11 provides for the embodiment of the present application.
Embodiment
In order to make those skilled in the art person understand better the technical scheme in the embodiment of the present invention, and the above-mentioned purpose of the embodiment of the present invention, feature and advantage can be become apparent more, below in conjunction with accompanying drawing, technical scheme in the embodiment of the present invention is described in further detail.
Referring to Fig. 1, the schematic flow sheet of a kind of type of webpage recognition methods providing for the embodiment of the present application, said method comprising the steps of:
S100: whether statistics comprises a plurality of object page features respectively in the sample webpage of a plurality of known web pages types, obtains statistics.
The sample webpage of known web pages type can be chosen the webpage of novel website at random, and the type of webpage of sample webpage can comprise: novel content pages and listing of novel page etc.Object page feature refers to the feature comprising in sample webpage, can from sample webpage, extract a plurality of object page features according to the number of words in webpage, feature key word or webpage number of words and feature key word, in addition, can also receive a plurality of page features of user's input.In other embodiment of the application, can also choose object page feature according to other parameter, at this, will not enumerate, and can adopt other modes to obtain object page feature.
In the embodiment of the present application, as shown in Figure 2, this step can comprise the following steps:
S101: judge one by one whether described sample webpage comprises object page feature.
For each sample webpage, judge the situation that this sample webpage comprises each object page feature, when sample webpage comprises some object page features, carry out S102, when sample webpage does not comprise some object page features, carry out S103.
S102: be recorded as First Characteristic.
S103: be recorded as Second Characteristic.
First Characteristic and Second Characteristic are whether to comprise a certain object page feature for distinguishing sample webpage, so require First Characteristic different from Second Characteristic.In the embodiment of the present application, First Characteristic can be 1, Second Characteristic can be 0, here numerical value is distinguished sample webpage and whether is comprised the preferred embodiment that a certain object page feature is only the application, in other embodiment of the application, can also adopt alternate manner to distinguish sample webpage and whether comprise some object web page characteristics, for example: First Characteristic and Second Characteristic are with selecting different letters, or First Characteristic and Second Characteristic are selected different low and high level signals.
S104: build the form that comprises the corresponding First Characteristic of all sample webpages, Second Characteristic, using described form as statistics.
Referring to table 1, the example of the statistics of 24 sample webpages that provide for the embodiment of the present application, in the embodiment of the present application, in the end in a hurdle, the type of webpage about sample webpage is increased in the statistics of sample webpage, and when the type of webpage of sample webpage is content pages, with 1, represent, when the type of webpage of sample webpage is catalogue page, with 0, represent.
Figure BDA0000394867020000071
Table 1
S200: utilize decision Tree algorithms to analyze known web pages type and the statistics of a plurality of sample webpages, obtain the prioritization of object page feature, and the corresponding relation between object page feature and type of webpage.
In the embodiment of the present application, as shown in Figure 3, this step can comprise the following steps:
S201: the information gain of calculating a plurality of described object page features according to described form.
For the information gain of each object page feature, as shown in Figure 4, can carry out in such a way:
S2011: the ratio of corresponding First Characteristic and the ratio of Second Characteristic that calculate object page feature according to described form.
The ratio of First Characteristic is the consistent probability of the corresponding sample type of webpage of First Characteristic, in the embodiment of the present application, above table 1 is example, comprises two kinds of the consistent situations of type of webpage corresponding to " page surpasses 1000 numbers of words ", and type of webpage is 1(content pages), type of webpage is 0(catalogue page), and the number that is 1 when First Characteristic is 13, the number that corresponding type of webpage is 1 is 12, and probability is 12/13, the number that corresponding type of webpage is 0 is 1, and probability is 1/13.So the consistent probability of the corresponding sample type of webpage of First Characteristic has two kinds, the ratio of the First Eigenvalue has two, is respectively: 12/13 and 1/13.
S2012: the information entropy of calculating respectively First Characteristic and Second Characteristic.
The quantity of information that information source contains be information source send likely information on average there is uncertainty, the quantity of information that information source contains is called information entropy.Suppose that certain information source has n information, and the probability that one of them information x occurs is p, the contained quantity of information of this information x is so:
I x=-log(p x), (1)
Information unit: if take 2 end of as: unit is bit; If take e the end of as: unit is nat; If take 10 the end of as: unit is hart.
The computing formula of information entropy is:
H ( x ) = Σ i = 1 n P xi I xi = - Σ i = 1 n p xi log ( p xi ) ,
Wherein, n is the number of information source institute inclusion information, and the information that information source packets contains is respectively: x1, x2 ..., xi ..., xn, wherein xi is i information x, pxi is the probability that i information xi occurs, the quantity of information that Ixi is i information xi.
Visible, the probability that the information entropy of information source equals institute's inclusion information is multiplied by separately sues for peace after quantity of information again.In the embodiment of the present application, the First Characteristic of " page surpass 1000 numbers of words " represents with 1, and the Second Characteristic of " page is over 1000 numbers of words " represents with 0, and the information entropy of First Characteristic 1 is so:
H(1)=-log(12/13)*(12/13)+-log(1/13)*(1/13), (3)
Accordingly, because the ratio of Second Eigenvalue has two, be respectively: 11/11 and 0, the information entropy of Second Characteristic 0 is:
H(0)=-log(1)+-log(0), (4)
S2013: the conditional entropy of calculating object page feature according to the information entropy of described First Characteristic and Second Characteristic.
For First Characteristic, the quantity that the information entropy that conditional entropy equals First Characteristic is multiplied by First Characteristic accounts for the total quantity of sample webpage.
So in the embodiment of the present application, the conditional entropy of First Characteristic equals:
K(1)=H(1)*(13/24)=[-log(12/13)*(12/13)+-log(1/13)*(1/13)]*(13/24), (5)
In like manner, the conditional entropy of Second Characteristic equals:
K(0)=H(0)*(11/24)=[-log(0)+-log(1)]*(11/24), (6)
And the conditional entropy of object page feature equals the conditional entropy of First Characteristic corresponding to this object page feature and the conditional entropy sum of Second Characteristic, so can calculate the conditional entropy of object page feature.
S2014: the information entropy of calculating object page feature according to described form.
Take table 1 as example, and the information entropy that the object page is characterized as " page surpasses 1000 numbers of words " is:
H (page surpasses 1000 numbers of words)=-log(12/24)+-log(12/24), (7)
S2015: the conditional entropy that the information entropy of object page feature is deducted to object page feature obtains the information gain of object page feature;
In the embodiment of the present application, it is example that the object page of take is characterized as " page surpasses 1000 numbers of words ", and its corresponding information gain is specially:
-log(12/24)+-log(12/24)-[-log(12/13)*(12/13)+-log(1/13)*(1/13)]*(13/24)
+[-log(0)+-log(1)]*(11/24)。
By above-mentioned calculating, can obtain the information gain of an object page feature, for other object page feature, also according to mode shown in Fig. 4, calculate respectively, finally obtain the information gain of each page feature.
S202: a plurality of described object page features are sorted according to information gain is descending, obtain the prioritization of object page feature.
S203: the corresponding relation that generates object page feature and type of webpage according to the prioritization of the known web pages type of a plurality of sample webpages and described object page feature.
The corresponding relation of page feature and type of webpage, can be the corresponding type of webpage of a page feature, can be also a type of webpage of combination correspondence of a plurality of page features.The net result of the corresponding relation of page feature and type of webpage is in the embodiment of the present application:
Content comprises " and xx chapter " link surpasses 10: ' 0 ': { page surpass 1000 words: { ' 0 ': ' catalogue page ', ' 1 ': ' content pages ' } }, ' 1': ' catalogue page ' } };
As shown in Figure 5, for obtaining the visual schematic diagram of net result of the corresponding relation of page feature and type of webpage in the embodiment of the present application.By Fig. 5, can be seen, in the embodiment of the present application, the object page feature that there is no " link; its content comprises ' returning catalogue ' " in the mapping table finally obtaining, this is because this object page feature is not enough to for judging page type, be invalid feature, so for fear of interference, invalid feature can be deleted in mapping table.
S300: search successively described object page feature according to described prioritization in webpage to be identified, determine the type of webpage of webpage to be identified according to lookup result and described corresponding relation.
In the embodiment of the present application, as shown in Figure 6, this step can comprise the following steps:
S301: the object page feature of searching prioritization maximum in webpage to be identified.
The object page feature that information gain is larger, its accuracy of determining page type is higher, so, in the embodiment of the present application, according to prioritization is descending, search successively a plurality of object page features.
S302: judge the object page feature that whether has prioritization maximum in described webpage to be identified.
While there is the object page feature of prioritization maximum in described webpage to be identified, carry out S303, while there is not the object page feature of prioritization maximum in described webpage to be identified, carry out S304.
S303: in described corresponding relation, search the type of webpage corresponding with the object page feature existing, the type of webpage using the type of webpage finding as webpage to be identified, and finish.
S304: judge whether in webpage to be identified, all object page features in mapping table all to be searched;
If so, finish, otherwise carry out S305;
S305: the information gain to the residue object page feature of not searching in webpage to be identified sorts, and return to S301, in webpage to be identified, search successively the object page feature after rearrangement until all object page features in mapping table have been searched.
In other embodiment of the application, can also recalculate information gain to remaining object page feature, according to the information gain after recalculating, object page feature is sorted, and utilize the sequence of recalculating the page feature after information gain to search in webpage to be identified.
This type of webpage recognition methods that the embodiment of the present application provides, first add up the situation that comprises of the sample webpage of a plurality of known web pages types to a plurality of object web page characteristics, obtain the statistics of sample webpage to a plurality of object page features, then utilize decision Tree algorithms analysis, obtain the prioritization of object page feature, and the corresponding relation between object page feature and type of webpage, the prioritization of object page feature is exactly the validity sequence of object page feature identification type of webpage, finally according to prioritization, in webpage to be identified, search successively a plurality of object page features, and according to the corresponding relation between lookup result and object page feature and type of webpage, determine the type of webpage of webpage to be identified.
Compared with prior art, the method can utilize sample webpage to sort to the validity of a plurality of object page features, when identification webpage to be identified, according to sequence, first search the object page feature that validity is higher, then search the object page feature that validity is lower, improve recognition accuracy, and shortened the time that identification expends, improved recognition efficiency.
In addition, in the time of on the method being applied to mobile phone or other mobile terminals, can obtain in advance on the backstage of the mobile terminals such as mobile phone the prioritization of object page feature and the corresponding relation between object page feature and type of webpage, then the prioritization of the object page feature getting and the corresponding relation between object page feature and type of webpage be stored.Foreground application when mobile front ends such as mobile phones, for example: browser, in the time of need to identifying the type of webpage of webpage to be identified, can directly read the prioritization of object page feature of storage and the corresponding relation between object page feature and type of webpage, and the type of webpage of webpage to be identified is identified, and then can reduce the complexity of the computing of foreground application, improve the speed to type of webpage identification of foreground application.
The structural representation of a kind of type of webpage recognition device that Fig. 7 provides for the embodiment of the present application.
As shown in Figure 7, this type of webpage recognition device comprises:
Statistic unit 1, adds up respectively whether comprise a plurality of object page features for the sample webpage in a plurality of known web pages types, obtains statistics;
Analytic unit 2, for utilizing decision Tree algorithms to analyze known web pages type and statistics to a plurality of sample webpages, obtains the prioritization of object page feature, and the corresponding relation between object page feature and type of webpage;
Type of webpage determining unit 3, for searching successively described object page feature according to described prioritization at webpage to be identified, determines the type of webpage of webpage to be identified according to lookup result and described corresponding relation.
As shown in Figure 8, in the embodiment of the present application, statistic unit 1 can comprise:
The first judging unit 11, for judging one by one whether described sample webpage comprises object page feature;
Record cell 12, for when described sample webpage comprises described object page feature, is recorded as First Characteristic; When described sample webpage does not comprise described object page feature, be recorded as Second Characteristic;
Form construction unit 13, for building the form that comprises the corresponding First Characteristic of all sample webpages, Second Characteristic, using described form as statistics.
As shown in Figure 9, in the embodiment of the present application, analytic unit 2 can comprise:
Information gain computing unit 21, for calculating the information gain of a plurality of described object page features according to described form;
Sequencing unit 22, for a plurality of described object page features are sorted according to information gain is descending, obtains the prioritization of object page feature;
Corresponding relation generation unit 23, for generating the corresponding relation of object page feature and type of webpage according to the prioritization of the known web pages type of a plurality of sample webpages and described object page feature.
As shown in figure 10, in the embodiment of the present application, information gain computing unit 21 can comprise:
Ratio calculation unit 211, for calculating the ratio of corresponding First Characteristic and the ratio of Second Characteristic of object page feature according to described form;
First information entropy computing unit 212, for calculating respectively the information entropy of First Characteristic and Second Characteristic;
Conditional entropy computing unit 213, for calculating the conditional entropy of object page feature according to the information entropy of described First Characteristic and Second Characteristic;
The second information entropy computing unit 214, for calculating the information entropy of object page feature according to described form;
Information gain computation subunit 215, obtains the information gain of object page feature for the information entropy of object page feature being deducted to the conditional entropy of object page feature.
As shown in figure 10, in the embodiment of the present application, type of webpage determining unit 3 can comprise:
Object page feature is searched unit 31, searches the object page feature of prioritization maximum in webpage to be identified;
The second judging unit 32, for judging whether described webpage to be identified exists the object page feature of prioritization maximum;
Type of webpage is searched unit 33, for when there is the object page feature of prioritization maximum in described webpage to be identified, the corresponding type of webpage of object page feature of searching in described corresponding relation and existing, the type of webpage using the type of webpage finding as webpage to be identified.
While there is not the object page feature of prioritization maximum in described webpage to be identified, sequencing unit 22 can also sort to the information gain of the residue object page feature of not searching in webpage to be identified, and object page feature is searched unit 31 also according to descending other object page feature of searching successively in webpage to be identified of prioritization, until find the type of webpage of webpage to be identified, or, until searched all object page features in mapping table.
In addition, in other embodiment of the application, while there is not the object page feature of prioritization maximum in described webpage to be identified, analytic unit 2 can also recalculate information gain to remaining object page feature, according to the information gain after recalculating, object page feature is sorted, and the corresponding relation between definite object page feature and type of webpage, then object page feature is searched unit 31 and in webpage to be identified, is searched successively the object page feature after rearrangement, until searched all object page features in mapping table.
Compared with prior art, this device can utilize sample webpage to sort to the validity of a plurality of object page features, when identification webpage to be identified, according to sequence, first search the object page feature that validity is higher, then search the object page feature that validity is lower, improve recognition accuracy, and shortened the time that identification expends, improved recognition efficiency.
Be understandable that, the present invention can be used in numerous general or special purpose computingasystem environment or configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment etc.
The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract data type, program, object, assembly, data structure etc.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by the teleprocessing equipment being connected by communication network, be executed the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
It should be noted that, in this article, relational terms such as the first and second grades is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply and between these entities or operation, have the relation of any this reality or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby the process, method, article or the equipment that make to comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or be also included as the intrinsic key element of this process, method, article or equipment.The in the situation that of more restrictions not, the key element being limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.
The above is only the specific embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. a type of webpage recognition methods, is characterized in that, comprising:
In the sample webpage of a plurality of known web pages types, whether statistics comprises a plurality of object page features respectively, obtains statistics;
Utilize decision Tree algorithms to analyze known web pages type and the statistics of a plurality of sample webpages, obtain the prioritization of object page feature, and the corresponding relation between object page feature and type of webpage;
According to described prioritization, in webpage to be identified, search successively described object page feature, according to lookup result and described corresponding relation, determine the type of webpage of webpage to be identified.
2. method according to claim 1, is characterized in that, described in the sample webpage of a plurality of known web pages types respectively statistics whether comprise a plurality of object page features, the step that obtains statistics comprises:
Judge one by one whether described sample webpage comprises object page feature;
When described sample webpage comprises described object page feature, be recorded as First Characteristic; When described sample webpage does not comprise described object page feature, be recorded as Second Characteristic;
The form that structure comprises the corresponding First Characteristic of all sample webpages, Second Characteristic, using described form as statistics.
3. method according to claim 2, it is characterized in that, the described decision Tree algorithms of utilizing is analyzed known web pages type and the statistics of a plurality of sample webpages, obtain the prioritization of object page feature, and the step of the corresponding relation between object page feature and type of webpage comprises:
According to described form, calculate the information gain of a plurality of described object page features;
A plurality of described object page features are sorted according to information gain is descending, obtain the prioritization of object page feature;
According to the prioritization of the known web pages type of a plurality of sample webpages and described object page feature, generate the corresponding relation of object page feature and type of webpage.
4. method according to claim 3, is characterized in that, calculates in the following manner the information gain of object page feature described in each:
According to described form, calculate the ratio of corresponding First Characteristic and the ratio of Second Characteristic of object page feature;
Calculate respectively the information entropy of First Characteristic and Second Characteristic;
According to the information entropy of described First Characteristic and Second Characteristic, calculate the conditional entropy of object page feature;
According to described form, calculate the information entropy of object page feature;
The conditional entropy that the information entropy of object page feature is deducted to object page feature obtains the information gain of object page feature.
5. according to the method described in claim 1-4 any one, it is characterized in that, describedly in webpage to be identified, according to described prioritization, search successively described object page feature, according to lookup result and described corresponding relation, determine that the step of the type of webpage of webpage to be identified comprises:
In webpage to be identified, search the object page feature of prioritization maximum;
Judge the object page feature that whether has prioritization maximum in described webpage to be identified;
While there is the object page feature of prioritization maximum in described webpage to be identified, in described corresponding relation, search the type of webpage corresponding with the object page feature existing, the type of webpage using the type of webpage finding as webpage to be identified;
While there is not the object page feature of prioritization maximum in described webpage to be identified, according to descending other object page feature of searching successively in webpage to be identified of prioritization, until find the type of webpage of webpage to be identified, or, until all object page features in mapping table have been searched.
6. a type of webpage recognition device, is characterized in that, comprising:
Statistic unit, adds up respectively whether comprise a plurality of object page features for the sample webpage in a plurality of known web pages types, obtains statistics;
Analytic unit, for utilizing decision Tree algorithms to analyze known web pages type and statistics to a plurality of sample webpages, obtains the prioritization of object page feature, and the corresponding relation between object page feature and type of webpage;
Type of webpage determining unit, for searching successively described object page feature according to described prioritization at webpage to be identified, determines the type of webpage of webpage to be identified according to lookup result and described corresponding relation.
7. device according to claim 6, is characterized in that, described statistic unit comprises:
The first judging unit, for judging one by one whether described sample webpage comprises object page feature;
Record cell, for when described sample webpage comprises described object page feature, is recorded as First Characteristic; When described sample webpage does not comprise described object page feature, be recorded as Second Characteristic;
Form construction unit, for building the form that comprises the corresponding First Characteristic of all sample webpages, Second Characteristic, using described form as statistics.
8. device according to claim 7, is characterized in that, described analytic unit comprises:
Information gain computing unit, for calculating the information gain of a plurality of described object page features according to described form;
Sequencing unit, for a plurality of described object page features are sorted according to information gain is descending, obtains the prioritization of object page feature;
Corresponding relation generation unit, for generating the corresponding relation of object page feature and type of webpage according to the prioritization of the known web pages type of a plurality of sample webpages and described object page feature.
9. device according to claim 8, is characterized in that, described information gain computing unit comprises:
Ratio calculation unit, for calculating the ratio of corresponding First Characteristic and the ratio of Second Characteristic of object page feature according to described form;
First information entropy computing unit, for calculating respectively the information entropy of First Characteristic and Second Characteristic;
Conditional entropy computing unit, for calculating the conditional entropy of object page feature according to the information entropy of described First Characteristic and Second Characteristic;
The second information entropy computing unit, for calculating the information entropy of object page feature according to described form;
Information gain computation subunit, obtains the information gain of object page feature for the information entropy of object page feature being deducted to the conditional entropy of object page feature.
10. according to the device described in claim 6-9 any one, it is characterized in that, described type of webpage determining unit comprises:
Object page feature is searched unit, searches the object page feature of prioritization maximum in webpage to be identified;
The second judging unit, for judging whether described webpage to be identified exists the object page feature of prioritization maximum;
Type of webpage is searched unit, for when there is the object page feature of prioritization maximum in described webpage to be identified, the corresponding type of webpage of object page feature of searching in described corresponding relation and existing, the type of webpage using the type of webpage finding as webpage to be identified;
While there is not the object page feature of prioritization maximum in described webpage to be identified, described object page feature is searched unit also according to descending other object page feature of searching successively in webpage to be identified of prioritization, until find the type of webpage of webpage to be identified, or, until searched all object page features in mapping table.
CN201310476416.6A 2013-10-12 2013-10-12 Webpage type identification method and device Active CN103577547B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310476416.6A CN103577547B (en) 2013-10-12 2013-10-12 Webpage type identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310476416.6A CN103577547B (en) 2013-10-12 2013-10-12 Webpage type identification method and device

Publications (2)

Publication Number Publication Date
CN103577547A true CN103577547A (en) 2014-02-12
CN103577547B CN103577547B (en) 2017-11-10

Family

ID=50049323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310476416.6A Active CN103577547B (en) 2013-10-12 2013-10-12 Webpage type identification method and device

Country Status (1)

Country Link
CN (1) CN103577547B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294485A (en) * 2015-06-05 2017-01-04 华为技术有限公司 Determine the method and device in notable place
CN108228463A (en) * 2018-01-10 2018-06-29 百度在线网络技术(北京)有限公司 For detecting the method and apparatus of initial screen time
CN108345599A (en) * 2017-01-23 2018-07-31 阿里巴巴集团控股有限公司 Type of webpage determines method, apparatus and computer-readable medium
CN109559141A (en) * 2017-09-27 2019-04-02 北京国双科技有限公司 A kind of automatic classification method, the apparatus and system of intention pattern
CN110336835A (en) * 2019-08-05 2019-10-15 深信服科技股份有限公司 Detection method, user equipment, storage medium and the device of malicious act
CN110750739A (en) * 2018-07-04 2020-02-04 北京国双科技有限公司 Page type determination method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
US20080244715A1 (en) * 2007-03-27 2008-10-02 Tim Pedone Method and apparatus for detecting and reporting phishing attempts
CN101534306A (en) * 2009-04-14 2009-09-16 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
CN101650715A (en) * 2008-08-12 2010-02-17 厦门市美亚柏科信息股份有限公司 Method and device for screening links on web pages
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080244715A1 (en) * 2007-03-27 2008-10-02 Tim Pedone Method and apparatus for detecting and reporting phishing attempts
CN101251855A (en) * 2008-03-27 2008-08-27 腾讯科技(深圳)有限公司 Equipment, system and method for cleaning internet web page
CN101650715A (en) * 2008-08-12 2010-02-17 厦门市美亚柏科信息股份有限公司 Method and device for screening links on web pages
CN101534306A (en) * 2009-04-14 2009-09-16 深圳市腾讯计算机系统有限公司 Detecting method and a device for fishing website
CN101794311A (en) * 2010-03-05 2010-08-04 南京邮电大学 Fuzzy data mining based automatic classification method of Chinese web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王宏威: "《基于决策树的分类算法研究》", 《软件导论》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294485A (en) * 2015-06-05 2017-01-04 华为技术有限公司 Determine the method and device in notable place
CN106294485B (en) * 2015-06-05 2019-11-01 华为技术有限公司 Determine the method and device in significant place
CN108345599A (en) * 2017-01-23 2018-07-31 阿里巴巴集团控股有限公司 Type of webpage determines method, apparatus and computer-readable medium
CN108345599B (en) * 2017-01-23 2021-12-14 阿里巴巴集团控股有限公司 Webpage type determination method and device and computer readable medium
CN109559141A (en) * 2017-09-27 2019-04-02 北京国双科技有限公司 A kind of automatic classification method, the apparatus and system of intention pattern
CN108228463A (en) * 2018-01-10 2018-06-29 百度在线网络技术(北京)有限公司 For detecting the method and apparatus of initial screen time
CN110750739A (en) * 2018-07-04 2020-02-04 北京国双科技有限公司 Page type determination method and device
CN110336835A (en) * 2019-08-05 2019-10-15 深信服科技股份有限公司 Detection method, user equipment, storage medium and the device of malicious act
CN110336835B (en) * 2019-08-05 2021-10-19 深信服科技股份有限公司 Malicious behavior detection method, user equipment, storage medium and device

Also Published As

Publication number Publication date
CN103577547B (en) 2017-11-10

Similar Documents

Publication Publication Date Title
US9531751B2 (en) System and method for identifying phishing website
CN103577547A (en) Webpage type identification method and device
CN107102993B (en) User appeal analysis method and device
CN105389349A (en) Dictionary updating method and apparatus
CN105183923A (en) New word discovery method and device
CN106547871A (en) Method and apparatus is recalled based on the Search Results of neutral net
CN103324745A (en) Text garbage identifying method and system based on Bayesian model
CN109033075B (en) Intention matching method and device, storage medium and terminal equipment
CN101930438A (en) Search result generating method and information search system
CN103336766A (en) Short text garbage identification and modeling method and device
CN103902597A (en) Method and device for determining search relevant categories corresponding to target keywords
CN110377886A (en) Project duplicate checking method, apparatus, equipment and storage medium
CN110765761A (en) Contract sensitive word checking method and device based on artificial intelligence and storage medium
CN103559313B (en) Searching method and device
CN104484380A (en) Personalized search method and personalized search device
KR101505546B1 (en) Keyword extracting method using text mining
CN104731828A (en) Interdisciplinary document similarity calculation method and interdisciplinary document similarity calculation device
CN110263127A (en) Text search method and device is carried out based on user query word
CN104636407A (en) Parameter choice training and search request processing method and device
CN103744889A (en) Method and device for clustering problems
CN108153728B (en) Keyword determination method and device
CN113569118B (en) Self-media pushing method, device, computer equipment and storage medium
CN106919576A (en) Using the method and device of two grades of classes keywords database search for application now
CN103309851B (en) The rubbish recognition methods of short text and system
CN110119880A (en) A kind of automatic measure grading method, apparatus, storage medium and terminal device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200527

Address after: 310051 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 100080, room 16, building 10-20, Building 29, Haidian District, Suzhou Street, Beijing

Patentee before: UC MOBILE Ltd.