CN103577547B - Webpage type identification method and device - Google Patents
Webpage type identification method and device Download PDFInfo
- Publication number
- CN103577547B CN103577547B CN201310476416.6A CN201310476416A CN103577547B CN 103577547 B CN103577547 B CN 103577547B CN 201310476416 A CN201310476416 A CN 201310476416A CN 103577547 B CN103577547 B CN 103577547B
- Authority
- CN
- China
- Prior art keywords
- feature
- webpage
- page
- type
- purpose page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
Abstract
The invention discloses a kind of webpage type identification method and device, this method includes:Counted respectively in the sample web page of multiple known web pages types and whether include multiple purpose page features, obtain statistical result;The known web pages type and statistical result of multiple sample web pages are analyzed using decision Tree algorithms, obtain the corresponding relation between the priority ranking of purpose page feature, and purpose page feature and type of webpage;Search the purpose page feature successively in webpage to be identified according to the priority ranking, the type of webpage of webpage to be identified is determined according to lookup result and the corresponding relation.Compared with prior art, this method can be ranked up using sample web page to the validity of multiple purpose page features, when identifying webpage to be identified, the higher purpose page feature of validity is first searched according to sequence, then the relatively low purpose page feature of validity is searched, the time that identification expends is shortened, improves recognition efficiency.
Description
Technical field
The present invention relates to moving communicating field, more particularly to a kind of webpage type identification method and device.
Background technology
Novel reader is a kind of software for providing novel and downloading function, can not only be provided under local novel reading
Carry, typically also support the functions such as the download, reading, search of the network novel.The network novel is downloaded or read, and is with internet
Based on the webpage of each novel class, by the way that the novel on these webpages is extracted, then suitable form is reintegrated into
It is presented to user.The extraction algorithm difference used due to the catalog page of webpage novel with content page, it usually needs sentence first
The type of webpage of disconnected novel, then extracted again using corresponding extraction algorithm according to type of webpage.
The method of identification type of webpage has at present:Identified based on white list and based on page keyword recognition.Based on white name
Single knowledge method for distinguishing refers to each target web on internet being included into white list, for the page of different web pages in white list
Region feature uses different recognizers, and such as starting point net, I reads to net novel webpage respective imposition layout's method respectively, in advance
First the recognizer according to corresponding to its typesetting characteristic Design goes out each website distinguishes the type of webpage of the novel of these websites.Base
Whether type of webpage is identified comprising the keyword for distinguishing catalog page and content page according to the page in page keyword method,
Such as a certain webpage includes " setting font ", then it is assumed that current web page type is content page.
The shortcomings that certain all be present in the above-mentioned method based on white list and page keyword recognition.Based on white list identification
Method, it can not often be accurately identified for the type of webpage for not being added to webpage in white list, and with internet web page quantity
Huge and website is continuously increased, and the number of the webpage in white list is also being on the increase, and causes maintenance cost very high;And it is based on
The method of page keyword recognition, because Webpage difference is very big, the keyword for distinguishing type of webpage may not apply to
All webpages, therefore page keyword method can not often accurately identify type of webpage.
The content of the invention
The embodiments of the invention provide a kind of webpage type identification method and device, and existing in the prior art with solution can not
The problem of being accurately identified to type of webpage.
In order to solve the above-mentioned technical problem, in a first aspect, the embodiment of the invention discloses a kind of webpage type identification method,
Including:Counted respectively in the sample web page of multiple known web pages types and whether include multiple purpose page features, counted
As a result;The known web pages type and statistical result of multiple sample web pages are analyzed using decision Tree algorithms, obtain purpose page
Corresponding relation between the priority ranking of region feature, and purpose page feature and type of webpage;Arranged according to the priority
Sequence searches the purpose page feature successively in webpage to be identified, is determined according to lookup result and the corresponding relation to be identified
The type of webpage of webpage.
It is described to distinguish in the sample web page of multiple known web pages types in the first possible embodiment of first aspect
The step of whether statistics includes multiple purpose page features, obtain statistical result includes:The sample web page is judged one by one whether
Include purpose page feature;When the sample web page includes the purpose page feature, fisrt feature is recorded as;When the sample
When this webpage does not include the purpose page feature, second feature is recorded as;Structure is special comprising all sample web pages corresponding first
Sign, the form of second feature, using the form as statistical result.
The first possible embodiment with reference to first aspect, it is described in second of possible embodiment of first aspect
The known web pages type and statistical result of multiple sample web pages are analyzed using decision Tree algorithms, obtain purpose page feature
Priority ranking, and the step of corresponding relation between purpose page feature and type of webpage includes:According to the form
Calculate the information gain of multiple purpose page features;Multiple purpose page features are descending according to information gain
It is ranked up, obtains the priority ranking of purpose page feature;According to the known web pages type of multiple sample web pages and the mesh
Page feature priority ranking generation purpose page feature and type of webpage corresponding relation.
With reference to second of possible embodiment of first aspect, in the third possible embodiment of first aspect, by with
Under type calculates the information gain of each purpose page feature:The corresponding of purpose page feature is calculated according to the form
The ratio of fisrt feature and the ratio of second feature;The comentropy of fisrt feature and second feature is calculated respectively;According to described
The comentropy of one feature and second feature calculates the conditional entropy of purpose page feature;Purpose page feature is calculated according to the form
Comentropy;The conditional entropy that the comentropy of purpose page feature is subtracted to purpose page feature obtains the information of purpose page feature
Gain.
With reference to second first aspect, first aspect the first possible embodiment, first aspect of possible embodiment party
Formula or first aspect the third possible embodiment, it is described to be searched successively according to the priority ranking in webpage to be identified
The purpose page feature, the step of determining the type of webpage of webpage to be identified according to lookup result and the corresponding relation, wrap
Include:The maximum purpose page feature of priority ranking is searched in webpage to be identified;Judge whether deposited in the webpage to be identified
In the purpose page feature that priority ranking is maximum;When the purpose page that priority ranking maximum in the webpage to be identified be present
During feature, the type of webpage corresponding with existing purpose page feature, the net that will be found are searched in the corresponding relation
Type of webpage of the page type as webpage to be identified;When the purpose page that priority ranking maximum is not present in the webpage to be identified
During region feature, other purpose page features are searched successively in webpage to be identified according to priority ranking is descending, until looking into
The type of webpage of webpage to be identified is found, or, completed until all purposes page feature in mapping table is searched.
Second aspect, the embodiment of the invention discloses a kind of type of webpage identification device, including:Statistic unit, for
Counted respectively in the sample web page of multiple known web pages types and whether include multiple purpose page features, obtain statistical result;Point
Unit is analysed, for, to analyzing the known web pages type and statistical result of multiple sample web pages, being obtained using decision Tree algorithms
Corresponding relation between the priority ranking of purpose page feature, and purpose page feature and type of webpage;Type of webpage
Determining unit, for searching the purpose page feature successively in webpage to be identified according to the priority ranking, according to looking into
Result and the corresponding relation is looked for determine the type of webpage of webpage to be identified.
In the first possible embodiment of second aspect, the statistic unit includes:First judging unit, for one by one
Judge whether the sample web page includes purpose page feature;Recording unit, for including the purpose when the sample web page
During page feature, fisrt feature is recorded as;When the sample web page does not include the purpose page feature, it is special to be recorded as second
Sign;Form construction unit, the form of fisrt feature, second feature is corresponded to comprising all sample web pages for building, by the table
Lattice are as statistical result.
The first possible embodiment with reference to second aspect, it is described in second of possible embodiment of second aspect
Analytic unit includes:Information gain computing unit, for calculating the information of multiple purpose page features according to the form
Gain;Sequencing unit, for multiple purpose page features to be ranked up according to information gain is descending, obtain purpose
The priority ranking of page feature;Corresponding relation generation unit, for the known web pages type according to multiple sample web pages and institute
State the priority ranking generation purpose page feature of purpose page feature and the corresponding relation of type of webpage.
It is described in the third possible embodiment of second aspect with reference to second of possible embodiment of second aspect
Information gain computing unit includes:Ratio calculation unit, for calculating corresponding the of purpose page feature according to the form
The ratio of one feature and the ratio of second feature;First information entropy computing unit, for calculating fisrt feature and the second spy respectively
The comentropy of sign;Conditional entropy computing unit, for calculating the purpose page according to the comentropy of the fisrt feature and second feature
The conditional entropy of feature;Second comentropy computing unit, for calculating the comentropy of purpose page feature according to the form;Information
Gain computation subunit, the conditional entropy for the comentropy of purpose page feature to be subtracted to purpose page feature obtain the purpose page
The information gain of feature.
With reference to second second aspect, second aspect the first possible embodiment, second aspect of possible embodiment party
Formula or second aspect the third possible embodiment, type of webpage is true described in the 4th kind of possible embodiment of second aspect
Order member includes:Purpose page feature searching unit, it is special that the maximum purpose page of priority ranking is searched in webpage to be identified
Sign;Second judging unit, for judging in the webpage to be identified with the presence or absence of the purpose page feature that priority ranking is maximum;
Type of webpage searching unit, for when the maximum purpose page feature of priority ranking in the webpage to be identified be present,
The type of webpage corresponding with existing purpose page feature is searched in the corresponding relation, using the type of webpage found as
The type of webpage of webpage to be identified;When the purpose page feature of priority ranking maximum is not present in the webpage to be identified,
The purpose page feature searching unit searches other mesh successively according further to priority ranking is descending in webpage to be identified
Page feature, until find the type of webpage of webpage to be identified, or, until having searched all purposes in mapping table
Page feature.
The webpage type identification method provided from above technical scheme, the embodiment of the present application, is counted multiple first
The sample web page of known web pages type includes situation to multiple purpose web page characteristics, obtains sample web page to multiple purpose pages
The statistical result of feature, is then analyzed using decision Tree algorithms, obtains the priority ranking of purpose page feature, and mesh
Page feature and type of webpage between corresponding relation, the priority ranking of purpose page feature is exactly that purpose page feature is known
The validity sequence of other type of webpage, it is special finally to search multiple purpose pages successively according to priority ranking in webpage to be identified
Levy, and the web page class of webpage to be identified is determined according to the corresponding relation between lookup result and purpose page feature and type of webpage
Type.
Compared with prior art, this method can be arranged the validity of multiple purpose page features using sample web page
Sequence, when identifying webpage to be identified, the higher purpose page feature of validity is first searched according to sequence, then search validity compared with
Low purpose page feature, improves recognition accuracy, and shortens the time that identification expends, and improves recognition efficiency.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below by the attached of embodiment
Figure is briefly described, it should be apparent that, for those of ordinary skills, before creative labor is not paid
Put, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet for webpage type identification method that the embodiment of the present application provides;
Fig. 2 is the detailed process schematic diagram for the S100 that the embodiment of the present application provides;
Fig. 3 is the detailed process schematic diagram for the S200 that the embodiment of the present application provides;
Fig. 4 is the detailed process schematic diagram for the S201 that the embodiment of the present application provides;
Fig. 5 is the visualization of the final result for the corresponding relation that page feature and type of webpage are obtained in the embodiment of the present application
Schematic diagram;
Fig. 6 is the detailed process schematic diagram for the S300 that the embodiment of the present application provides;
Fig. 7 is a kind of structural representation for type of webpage identification device that the embodiment of the present application provides;
Fig. 8 is the structural representation for the statistic unit that the embodiment of the present application provides;
Fig. 9 is the structural representation for the analytic unit that the embodiment of the present application provides;
Figure 10 is the structural representation for the information gain computing unit that the embodiment of the present application provides;
Figure 11 is the structural representation for the type of webpage determining unit that the embodiment of the present application provides.
Embodiment
In order that those skilled in the art more fully understand the technical scheme in the embodiment of the present invention, and make of the invention real
Apply the above-mentioned purpose of example, feature and advantage can be more obvious understandable, below in conjunction with the accompanying drawings to technical side in the embodiment of the present invention
Case is described in further detail.
Referring to Fig. 1, a kind of schematic flow sheet of the webpage type identification method provided for the embodiment of the present application, methods described
Comprise the following steps:
S100:Counted respectively in the sample web page of multiple known web pages types and whether include multiple purpose page features,
Obtain statistical result.
The sample web page of known web pages type can randomly select the webpage of novel website, and the type of webpage of sample web page can
With including:Novel content pages and listing of novel page etc..Purpose page feature refers to the feature included in sample web page, can basis
It is special that number of words, feature critical word or webpage number of words and feature critical word in webpage extract multiple purpose pages from sample web page
Sign, furthermore it is also possible to receive multiple page features of user's input., can also be according to other in the application other embodiment
Parameter chooses purpose page feature, will not enumerate herein, and other modes can be used to obtain purpose page feature.
In the embodiment of the present application, as shown in Fig. 2 the step may comprise steps of:
S101:Judge whether the sample web page includes purpose page feature one by one.
For each sample web page, judge that the sample web page includes the situation of each purpose page feature, work as sample web page
During comprising some purpose page feature, S102 is carried out, when sample web page does not include some purpose page feature, is carried out
S103。
S102:It is recorded as fisrt feature.
S103:It is recorded as second feature.
Fisrt feature is to be used to distinguish whether sample web page includes a certain purpose page feature with second feature, be it requires
Fisrt feature is different from second feature.In the embodiment of the present application, fisrt feature can be 1, and second feature can be 0, here
Numerical value come distinguish sample web page whether comprising a certain purpose page feature be only the application a preferred embodiment, in the application
In other embodiment, it can also distinguish whether sample web page includes some purpose web page characteristics using other manner, such as:
Fisrt feature and second feature select different letters, or, fisrt feature and second feature select different low and high levels
Signal.
S104:Structure corresponds to the form of fisrt feature, second feature comprising all sample web pages, using the form as system
Count result.
Referring to table 1, the example of the statistical result of 24 sample web pages provided for the embodiment of the present application, implement in the application
In example, the type of webpage on sample web page in last column increases in the statistical result of sample web page, and sample net
The type of webpage of page represents that the type of webpage of sample web page is represented when being catalogue page with 0 with 1 when being content pages.
Table 1
S200:The known web pages type and statistical result of multiple sample web pages are analyzed using decision Tree algorithms, obtained
Corresponding relation between the priority ranking of purpose page feature, and purpose page feature and type of webpage.
In the embodiment of the present application, as shown in figure 3, the step may comprise steps of:
S201:The information gain of multiple purpose page features is calculated according to the form.
For the information gain of each purpose page feature, as shown in figure 4, can carry out in such a way:
S2011:The ratio and second feature of the corresponding fisrt feature of purpose page feature are calculated according to the form
Ratio.
The ratio of fisrt feature is that fisrt feature corresponds to the consistent probability of sample web page type, in the embodiment of the present application,
By taking upper table 1 as an example, comprising two kinds consistent of situation of type of webpage corresponding to " more than 1000 numbers of words of the page ", i.e., type of webpage is
1(Content pages), type of webpage 0(Catalogue page), and when the number that fisrt feature is 1 is 13, corresponding type of webpage is 1
Number is 12, probability 12/13, and the number that corresponding type of webpage is 0 is 1, probability 1/13.So fisrt feature is corresponding
The consistent probability of sample web page type has two kinds, i.e. the ratio of the First Eigenvalue has two, is respectively:12/13 and 1/13.
S2012:The comentropy of fisrt feature and second feature is calculated respectively.
The information content that information source contains is that the average of information that be possible to that information source is sent has uncertainty, what information source contained
Information content is referred to as comentropy.Assuming that some information source has n information, and the probability that one of information x occurs is p, then should
Information content contained by information x is:
Ix=-log(px),(1)
Information unit:If it is bottom with 2:Unit is bit;If using e the bottom of as:Unit is nat;If it is bottom with 10:Unit is
hart。
The calculation formula of comentropy is:
Wherein, n is included the number of information by information source, and the information that information source packets contain is respectively:x1、x2、……、xi、……、
Xn, wherein xi are that i-th information x, pxi are the probability that i-th of information xi occurs, and Ixi is i-th of information xi information content.
Summed again it can be seen that the comentropy of information source is equal to after the probability for including information is multiplied by respective information content.In this Shen
Please be in embodiment, the fisrt feature of " more than 1000 numbers of words of the page " is represented with 1, and the second of " more than 1000 numbers of words of the page " is special
Requisition 0 represents, then the comentropy of fisrt feature 1 is:
H(1)=-log(12/13)*(12/13)+-log(1/13)*(1/13), (3)
Accordingly, because the ratio of Second Eigenvalue has two, it is respectively:11/11 and 0, the comentropy of second feature 0
For:
H(0)=-log(1)+-log(0),(4)
S2013:The conditional entropy of purpose page feature is calculated according to the comentropy of the fisrt feature and second feature.
For fisrt feature, the quantity that conditional entropy is multiplied by fisrt feature equal to the comentropy of fisrt feature accounts for sample web page
Total quantity.
So in the embodiment of the present application, the conditional entropy of fisrt feature is equal to:
K(1)=H(1)*(13/24)=[-log(12/13)*(12/13)+-log(1/13)*(1/13)]*(13/24),
(5)
Similarly, the conditional entropy of second feature is equal to:
K(0)=H(0)*(11/24)=[-log(0)+-log(1)] * (11/24),(6)
And the conditional entropy of purpose page feature is equal to the conditional entropy and second of fisrt feature corresponding to the purpose page feature
The conditional entropy sum of feature, it is possible to the conditional entropy of purpose page feature is calculated.
S2014:The comentropy of purpose page feature is calculated according to the form.
By taking table 1 as an example, purpose page feature is for the comentropy of " more than 1000 numbers of words of the page ":
H (more than 1000 numbers of words of the page)=- log(12/24)+-log(12/24), (7)
S2015:The conditional entropy that the comentropy of purpose page feature is subtracted to purpose page feature obtains purpose page feature
Information gain;
In the embodiment of the present application, so that purpose page feature is " more than 1000 numbers of words of the page " as an example, letter corresponding to it
Ceasing gain is specially:
-log(12/24)+-log(12/24)-[-log(12/13)*(12/13)+-log(1/13)*(1/13)]*(13/
24)
+[-log(0)+-log(1)]*(11/24).
By above-mentioned calculating, the information gain of a purpose page feature can be obtained, for other purpose page features
Also calculated respectively according to mode shown in Fig. 4, finally give the information gain of each page feature.
S202:Multiple purpose page features are ranked up according to information gain is descending, obtain the purpose page
The priority ranking of feature.
S203:Generated according to the known web pages type of multiple sample web pages and the priority ranking of the purpose page feature
The corresponding relation of purpose page feature and type of webpage.
The corresponding relation of page feature and type of webpage, it can be the corresponding type of webpage of a page feature, also may be used
Think the corresponding type of webpage of combination of multiple page features.Pair of page feature and type of webpage in the embodiment of the present application
The final result that should be related to is:
Content includes " xth x chapters " link surpass 10:{‘0’:{ more than 1000 words of the page:{‘0’:' catalogue page ',
‘1’:' content pages ' } }, ' 1':' catalogue page ' } };
As shown in figure 5, to obtain the final result of the corresponding relation of page feature and type of webpage in the embodiment of the present application
Visualization schematic diagram.It can be seen that, in the embodiment of the present application, do not have " to exist in the mapping table finally given by Fig. 5
Link, its content include ' return to catalogue ' " purpose page feature because this purpose page feature is not enough to for judging
Page type, as invalid feature, so in order to avoid interference, invalid feature can be deleted in mapping table.
S300:The purpose page feature is searched successively according to the priority ranking in webpage to be identified, according to looking into
Result and the corresponding relation is looked for determine the type of webpage of webpage to be identified.
In the embodiment of the present application, as shown in fig. 6, the step may comprise steps of:
S301:The maximum purpose page feature of priority ranking is searched in webpage to be identified.
The bigger purpose page feature of information gain, the accuracy of its determination page type is higher, so, it is real in the application
Apply in example, multiple purpose page features are searched successively according to priority ranking is descending.
S302:Judge in the webpage to be identified with the presence or absence of the purpose page feature that priority ranking is maximum.
When the purpose page feature of priority ranking maximum in the webpage to be identified be present, S303 is carried out, it is described to treat
When identifying the purpose page feature that priority ranking maximum is not present in webpage, S304 is carried out.
S303:The type of webpage corresponding with existing purpose page feature is searched in the corresponding relation, will be searched
Type of webpage of the type of webpage arrived as webpage to be identified, and terminate.
S304:Judge whether all to have looked into all purposes page feature in mapping table in webpage to be identified
Look for;
If it is, terminating, S305 is otherwise performed;
S305:The information gain of remaining purpose page feature to not searched in webpage to be identified is ranked up, and is returned
S301 is returned, searches the purpose page feature after rearrangement successively in webpage to be identified until by all mesh in mapping table
Page feature search complete.
In the application other embodiment, information gain can also be recalculated to remaining purpose page feature, according to
Information gain after recalculating is ranked up to purpose page feature, and is utilized and recalculated the page feature after information gain
Sequence searched in webpage to be identified.
The webpage type identification method that the embodiment of the present application provides, the sample net of multiple known web pages types is counted first
Page includes situation to multiple purpose web page characteristics, obtains statistical result of the sample web page to multiple purpose page features, then
Analyzed using decision Tree algorithms, obtain the priority ranking of purpose page feature, and purpose page feature and web page class
Corresponding relation between type, the priority ranking of purpose page feature are exactly the validity of purpose page feature identification type of webpage
Sequence, finally searches multiple purpose page features successively according to priority ranking in webpage to be identified, and according to lookup result
Corresponding relation between purpose page feature and type of webpage determines the type of webpage of webpage to be identified.
Compared with prior art, this method can be arranged the validity of multiple purpose page features using sample web page
Sequence, when identifying webpage to be identified, the higher purpose page feature of validity is first searched according to sequence, then search validity compared with
Low purpose page feature, improves recognition accuracy, and shortens the time that identification expends, and improves recognition efficiency.
In addition, when applying this method on mobile phone or other mobile terminals, can be after the mobile terminals such as mobile phone
Platform obtains the corresponding relation between the priority ranking of purpose page feature and purpose page feature and type of webpage in advance, then
Corresponding relation between the priority ranking of the purpose page feature got and purpose page feature and type of webpage is carried out
Storage.When the foreground application of the mobile front end such as mobile phone, such as:Browser to the type of webpage of webpage to be identified, it is necessary to know
When other, it can directly read between priority ranking and purpose page feature and the type of webpage of the purpose page feature of storage
Corresponding relation, and the type of webpage of webpage to be identified is identified, and then the complicated journey of the computing of foreground application can be reduced
Degree, improve the speed to type of webpage identification of foreground application.
Fig. 7 is a kind of structural representation for type of webpage identification device that the embodiment of the present application provides.
As shown in fig. 7, the type of webpage identification device includes:
Statistic unit 1, whether multiple purposes are included for being counted respectively in the sample web page of multiple known web pages types
Page feature, obtain statistical result;
Analytic unit 2, for utilizing decision Tree algorithms to the known web pages type and statistical result to multiple sample web pages
Analyzed, obtain the corresponding pass between the priority ranking of purpose page feature, and purpose page feature and type of webpage
System;
Type of webpage determining unit 3, for searching the mesh successively in webpage to be identified according to the priority ranking
Page feature, the type of webpage of webpage to be identified is determined according to lookup result and the corresponding relation.
As shown in figure 8, in the embodiment of the present application, statistic unit 1 can include:
First judging unit 11, for judging whether the sample web page includes purpose page feature one by one;
Recording unit 12, for when the sample web page includes the purpose page feature, being recorded as fisrt feature;When
When the sample web page does not include the purpose page feature, second feature is recorded as;
Form construction unit 13, the form of fisrt feature, second feature is corresponded to comprising all sample web pages for building, will
The form is as statistical result.
As shown in figure 9, in the embodiment of the present application, analytic unit 2 can include:
Information gain computing unit 21, the information for calculating multiple purpose page features according to the form increase
Benefit;
Sequencing unit 22, for multiple purpose page features to be ranked up according to information gain is descending, obtain
To the priority ranking of purpose page feature;
Corresponding relation generation unit 23, it is special for the known web pages type according to multiple sample web pages and the purpose page
The priority ranking generation purpose page feature of sign and the corresponding relation of type of webpage.
As shown in Figure 10, in the embodiment of the present application, information gain computing unit 21 can include:
Ratio calculation unit 211, the ratio of the corresponding fisrt feature for calculating purpose page feature according to the form
The ratio of value and second feature;
First information entropy computing unit 212, for calculating the comentropy of fisrt feature and second feature respectively;
Conditional entropy computing unit 213, for calculating the purpose page according to the comentropy of the fisrt feature and second feature
The conditional entropy of feature;
Second comentropy computing unit 214, for calculating the comentropy of purpose page feature according to the form;
Information gain computation subunit 215, for the comentropy of purpose page feature to be subtracted to the bar of purpose page feature
Part entropy obtains the information gain of purpose page feature.
As shown in Figure 10, in the embodiment of the present application, type of webpage determining unit 3 can include:
Purpose page feature searching unit 31, it is special that the maximum purpose page of priority ranking is searched in webpage to be identified
Sign;
Second judging unit 32, for judging in the webpage to be identified with the presence or absence of the purpose page that priority ranking is maximum
Region feature;
Type of webpage searching unit 33, for when the purpose page that priority ranking maximum in the webpage to be identified be present
During feature, the type of webpage corresponding with existing purpose page feature, the net that will be found are searched in the corresponding relation
Type of webpage of the page type as webpage to be identified.
When the purpose page feature of priority ranking maximum is not present in the webpage to be identified, sequencing unit 22 may be used also
It is ranked up with the information gain to the remaining purpose page feature do not searched in webpage to be identified, and purpose page feature
Searching unit 31 searches other purpose page features successively according further to priority ranking is descending in webpage to be identified, until
The type of webpage of webpage to be identified is found, or, until having searched all purposes page feature in mapping table.
In addition, in the application other embodiment, when the mesh that priority ranking maximum is not present in the webpage to be identified
Page feature when, analytic unit 2 can also recalculate information gain to remaining purpose page feature, according to recalculating
Information gain afterwards is ranked up to purpose page feature, and determines the corresponding pass between purpose page feature and type of webpage
It is that then purpose page feature searching unit 31 searches the purpose page feature after rearrangement successively in webpage to be identified,
Until having searched all purposes page feature in mapping table.
Compared with prior art, the device can be arranged the validity of multiple purpose page features using sample web page
Sequence, when identifying webpage to be identified, the higher purpose page feature of validity is first searched according to sequence, then search validity compared with
Low purpose page feature, improves recognition accuracy, and shortens the time that identification expends, and improves recognition efficiency.
It is understood that the present invention can be used in numerous general or special purpose computing system environments or configuration.Such as:It is individual
People's computer, server computer, handheld device or portable set, laptop device, multicomputer system, based on microprocessor
The system of device, set top box, programmable consumer-elcetronics devices, network PC, minicom, mainframe computer including to take up an official post
DCE of what system or equipment etc..
The present invention can be described in the general context of computer executable instructions, such as program
Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type
Part, data structure etc..The present invention can also be put into practice in a distributed computing environment, in these DCEs, by
Task is performed and connected remote processing devices by communication network.In a distributed computing environment, program module can be with
In the local and remote computer-readable storage medium including storage device.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality
Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation
In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to
Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those
Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include
Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that
Other identical element also be present in process, method, article or equipment including the key element.
Described above is only the embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (4)
- A kind of 1. webpage type identification method, it is characterised in that including:Counted respectively in the sample web page of multiple known web pages types and whether include multiple purpose page features, obtain statistics knot Fruit, including:Judge whether the sample web page includes purpose page feature one by one;When the sample web page includes the purpose page During region feature, fisrt feature is recorded as;When the sample web page does not include the purpose page feature, it is recorded as and the first spy Levy different second feature;Structure corresponds to the form of fisrt feature, second feature comprising all sample web pages, and the form is made For statistical result;The known web pages type and statistical result of multiple sample web pages are analyzed using decision Tree algorithms, wherein:The information gain of multiple purpose page features is calculated according to the form, by multiple purpose page features according to Information gain is descending to be ranked up, and obtains the priority ranking of purpose page feature, according to known to multiple sample web pages The priority ranking of type of webpage and the purpose page feature generates the corresponding relation of purpose page feature and type of webpage, its Middle the step of calculating each information gain of the purpose page feature, includes:The ratio of corresponding fisrt feature and the ratio of second feature of purpose page feature are calculated according to the form;Wherein, The ratio of the fisrt feature is that fisrt feature correspond to the consistent probability of sample web page type, and the ratio of the second feature is the Two features correspond to the consistent probability of sample web page type;The comentropy of fisrt feature and second feature is calculated respectively;The conditional entropy of purpose page feature is calculated according to the comentropy of the fisrt feature and second feature, wherein:Described first The ratio for the total quantity that the quantity that the comentropy of feature is multiplied by the fisrt feature accounts for sample web page is worth to the conditional entropy of fisrt feature, Being multiplied by the ratio of the total quantity that the quantity of the second feature accounts for sample web page with the comentropy of the second feature, to be worth to second special The conditional entropy of sign, using the conditional entropy sum of the conditional entropy of the fisrt feature and the second feature as the purpose page The conditional entropy of region feature;Mesh is calculated using the method different from the computational methods of fisrt feature and the comentropy of second feature according to the form Page feature comentropy, wherein:The quantity for calculating the fisrt feature of the type of webpage of sample web page first accounts for sample web page Total quantity the first ratio, the quantity of the second feature of the type of webpage of sample web page accounts for the second of the total quantity of sample web page Ratio;Then it is the logarithm of the first ratio described in bottom and with 2 logarithms for being the second ratio described in bottom to ask respectively with 2, by gained Comentropy of two logarithm value sums as the purpose page feature;The conditional entropy that the comentropy of purpose page feature is subtracted to purpose page feature obtains the information gain of purpose page feature;The purpose page feature is searched successively in webpage to be identified according to the priority ranking, according to lookup result and institute The type of webpage that corresponding relation determines webpage to be identified is stated, including:Search in webpage to be identified with the presence or absence of the purpose page feature that priority ranking is maximum, when the priority ranking being present During the purpose page feature of maximum, the web page class corresponding with existing purpose page feature is searched in the corresponding relation Type, the type of webpage using the type of webpage found as webpage to be identified;When the purpose page feature of priority ranking maximum is not present in the webpage to be identified, according to priority ranking by big Other purpose page features are searched successively in webpage to be identified to small, until finding the type of webpage of webpage to be identified.
- 2. according to the method for claim 1, it is characterised in that the corresponding relation of the purpose page feature and type of webpage Including:The corresponding type of webpage of combination of one corresponding type of webpage of page feature and/or multiple page features.
- A kind of 3. type of webpage identification device, it is characterised in that including:Statistic unit, it is whether special comprising multiple purpose pages for being counted respectively in the sample web page of multiple known web pages types Sign, obtains statistical result, it includes:First judging unit, for judging whether the sample web page includes purpose page feature one by one;Recording unit, for when the sample web page includes the purpose page feature, being recorded as fisrt feature;When the sample When this webpage does not include the purpose page feature, the second feature different from fisrt feature is recorded as;Form construction unit, the form of fisrt feature, second feature is corresponded to comprising all sample web pages for building, by the table Lattice are as statistical result;Analytic unit, for being divided using decision Tree algorithms the known web pages type and statistical result of multiple sample web pages Analysis, obtains the corresponding relation between the priority ranking of purpose page feature, and purpose page feature and type of webpage;Type of webpage determining unit, for searching the purpose page successively in webpage to be identified according to the priority ranking Feature, the type of webpage of webpage to be identified is determined according to lookup result and the corresponding relation, including:Search in webpage to be identified With the presence or absence of the purpose page feature that priority ranking is maximum, when the purpose page feature that the priority ranking maximum be present When, the type of webpage corresponding with existing purpose page feature, the web page class that will be found are searched in the corresponding relation Type of webpage of the type as webpage to be identified;When being not present in the webpage to be identified, the maximum purpose page of priority ranking is special During sign, other purpose page features are searched successively in webpage to be identified according to priority ranking is descending, until finding The type of webpage of webpage to be identified;Wherein, the analytic unit includes:Information gain computing unit, for calculating the information gain of multiple purpose page features according to the form;Sequencing unit, for multiple purpose page features to be ranked up according to information gain is descending, obtain purpose The priority ranking of page feature;Corresponding relation generation unit, for the excellent of the known web pages type according to multiple sample web pages and the purpose page feature The corresponding relation of first level sequence generation purpose page feature and type of webpage;Wherein, described information gain calculating unit includes:Ratio calculation unit, the ratio and second of the corresponding fisrt feature for calculating purpose page feature according to the form The ratio of feature;Wherein, the ratio of the fisrt feature is that fisrt feature corresponds to the consistent probability of sample web page type, described the The ratio of two features is that second feature corresponds to the consistent probability of sample web page type;First information entropy computing unit, for calculating the comentropy of fisrt feature and second feature respectively;Conditional entropy computing unit, for calculating the bar of purpose page feature according to the comentropy of the fisrt feature and second feature Part entropy, including:With the comentropy of the fisrt feature be multiplied by the fisrt feature quantity account for sample web page total quantity ratio The conditional entropy of fisrt feature is obtained, the quantity that the second feature is multiplied by with the comentropy of the second feature accounts for the total of sample web page The ratio of quantity is worth to the conditional entropy of second feature, by the conditional entropy of the fisrt feature and the conditional entropy phase of the second feature In addition and the conditional entropy as the purpose page feature;Second comentropy computing unit, for using the calculating with fisrt feature and the comentropy of second feature according to the form Method different method calculates the comentropy of purpose page feature, wherein:The of the type of webpage of sample web page is calculated first The quantity of one feature accounts for the first ratio of the total quantity of sample web page, and the quantity of the second feature of the type of webpage of sample web page accounts for Second ratio of the total quantity of sample web page;Then ask respectively and be the logarithm of the first ratio described in bottom with 2 and be described in bottom with 2 The logarithm of two ratios, the comentropy using two logarithm value sums of gained as the purpose page feature;Information gain computation subunit, the conditional entropy for the comentropy of purpose page feature to be subtracted to purpose page feature obtain The information gain of purpose page feature.
- 4. device according to claim 3, it is characterised in that the corresponding relation of the purpose page feature and type of webpage Including:The corresponding type of webpage of combination of one corresponding type of webpage of page feature and/or multiple page features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310476416.6A CN103577547B (en) | 2013-10-12 | 2013-10-12 | Webpage type identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310476416.6A CN103577547B (en) | 2013-10-12 | 2013-10-12 | Webpage type identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103577547A CN103577547A (en) | 2014-02-12 |
CN103577547B true CN103577547B (en) | 2017-11-10 |
Family
ID=50049323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310476416.6A Active CN103577547B (en) | 2013-10-12 | 2013-10-12 | Webpage type identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103577547B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294485B (en) * | 2015-06-05 | 2019-11-01 | 华为技术有限公司 | Determine the method and device in significant place |
CN108345599B (en) * | 2017-01-23 | 2021-12-14 | 阿里巴巴集团控股有限公司 | Webpage type determination method and device and computer readable medium |
CN109559141A (en) * | 2017-09-27 | 2019-04-02 | 北京国双科技有限公司 | A kind of automatic classification method, the apparatus and system of intention pattern |
CN108228463B (en) * | 2018-01-10 | 2021-09-21 | 百度在线网络技术(北京)有限公司 | Method and device for detecting first screen time |
CN110750739B (en) * | 2018-07-04 | 2022-07-05 | 北京国双科技有限公司 | Page type determination method and device |
CN110336835B (en) * | 2019-08-05 | 2021-10-19 | 深信服科技股份有限公司 | Malicious behavior detection method, user equipment, storage medium and device |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080244715A1 (en) * | 2007-03-27 | 2008-10-02 | Tim Pedone | Method and apparatus for detecting and reporting phishing attempts |
CN101251855B (en) * | 2008-03-27 | 2010-12-22 | 腾讯科技(深圳)有限公司 | Equipment, system and method for cleaning internet web page |
CN101650715B (en) * | 2008-08-12 | 2011-06-29 | 厦门市美亚柏科信息股份有限公司 | Method and device for screening links on web pages |
CN101534306B (en) * | 2009-04-14 | 2012-01-11 | 深圳市腾讯计算机系统有限公司 | Detecting method and a device for fishing website |
CN101794311B (en) * | 2010-03-05 | 2012-06-13 | 南京邮电大学 | Fuzzy data mining based automatic classification method of Chinese web pages |
-
2013
- 2013-10-12 CN CN201310476416.6A patent/CN103577547B/en active Active
Non-Patent Citations (1)
Title |
---|
《基于决策树的分类算法研究》;王宏威;《软件导论》;20070930(第17期);第134-135页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103577547A (en) | 2014-02-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103577547B (en) | Webpage type identification method and device | |
EP4016432A1 (en) | Method and apparatus for training fusion ordering model, search ordering method and apparatus, electronic device, storage medium, and program product | |
US9767183B2 (en) | Method and system for enhanced query term suggestion | |
CN102612691B (en) | Method and system for scoring texts | |
CN107704503A (en) | User's keyword extracting device, method and computer-readable recording medium | |
CN107807915B (en) | Error correction model establishing method, device, equipment and medium based on error correction platform | |
WO2020233344A1 (en) | Searching method and apparatus, and storage medium | |
CN110874528B (en) | Text similarity obtaining method and device | |
CN103559313B (en) | Searching method and device | |
WO2011134104A1 (en) | Method, system and appartus for selecting acronym expansion | |
CN113127621A (en) | Dialogue module pushing method, device, equipment and storage medium | |
CN113204953A (en) | Text matching method and device based on semantic recognition and device readable storage medium | |
US10043511B2 (en) | Domain terminology expansion by relevancy | |
CN113791837B (en) | Page processing method, device, equipment and storage medium | |
CN113569118A (en) | Self-media pushing method and device, computer equipment and storage medium | |
US20130282628A1 (en) | Method and Apparatus for Performing Dynamic Textual Complexity Analysis Using Machine Learning Artificial Intelligence | |
CN111125543B (en) | Training method of book recommendation sequencing model, computing device and storage medium | |
CN108052520A (en) | Conjunctive word analysis method, electronic device and storage medium based on topic model | |
CN114647739B (en) | Entity chain finger method, device, electronic equipment and storage medium | |
CN113641767B (en) | Entity relation extraction method, device, equipment and storage medium | |
CN116383340A (en) | Information searching method, device, electronic equipment and storage medium | |
CN113792230B (en) | Service linking method, device, electronic equipment and storage medium | |
CN113486169B (en) | Synonymous statement generation method, device, equipment and storage medium based on BERT model | |
JP2024507029A (en) | Web page identification methods, devices, electronic devices, media and computer programs | |
CN109684467A (en) | A kind of classification method and device of text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20200527 Address after: 310051 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province Patentee after: Alibaba (China) Co.,Ltd. Address before: 100080, room 16, building 10-20, Building 29, Haidian District, Suzhou Street, Beijing Patentee before: UC MOBILE Ltd. |