CN107545020A - A kind of determination method and device of Web page classifying - Google Patents

A kind of determination method and device of Web page classifying Download PDF

Info

Publication number
CN107545020A
CN107545020A CN201710326233.4A CN201710326233A CN107545020A CN 107545020 A CN107545020 A CN 107545020A CN 201710326233 A CN201710326233 A CN 201710326233A CN 107545020 A CN107545020 A CN 107545020A
Authority
CN
China
Prior art keywords
web page
classification
webpage
sorted
link web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710326233.4A
Other languages
Chinese (zh)
Inventor
张惊申
卢俞虹
任方英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Security Technologies Co Ltd
Original Assignee
New H3C Security Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Security Technologies Co Ltd filed Critical New H3C Security Technologies Co Ltd
Priority to CN201710326233.4A priority Critical patent/CN107545020A/en
Publication of CN107545020A publication Critical patent/CN107545020A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application provides a kind of determination method and device of Web page classifying, is related to network communication technology field.Methods described includes:Determine webpage to be sorted;The out-link web page of webpage to be sorted is obtained, wherein, the out-link web page is:The webpage of the address of webpage to be sorted in web page contents be present;According to default mode classification, the reference classification of each out-link web page is determined;Classified according to the reference of identified each out-link web page, determine the classification of webpage to be sorted.The scheme provided using the embodiment of the present application, it is possible to increase the accuracy of identified Web page classifying.

Description

A kind of determination method and device of Web page classifying
Technical field
The application is related to network communication technology field, more particularly to a kind of determination method and device of Web page classifying.
Background technology
Webpage quantity in network is very big, and these webpages may be belonging respectively to various types, and these types can With including news category, educational, sport category, shopping class etc..At present, Web page classifying can apply to various scenes.For example, using In home page filter or applied to establishing Web page classifying storehouse etc..When applied to home page filter, it is often necessary to first determine point of webpage Class, then webpage is filtered according to the classification of determination.
In the prior art, it is determined that during Web page classifying, the heading message of webpage to be sorted can be first obtained, then marks this Topic information is matched with default classifying dictionary, and the classification that above-mentioned webpage belonged to is determined according to matching result, wherein, classify Dictionary typically stores each classification and the keyword each classified.
Generally, the classification of common webpage can be accurately determined using the above method.But at present many webpages all The scope of heading message describes very wide in range so that the heading message of these webpages can not reflect the type of webpage well. Therefore, determine there may be certain error during the classification of these webpages using the above method in this case, it is identified Web page classifying accuracy is not high enough.
The content of the invention
The purpose of the embodiment of the present application is the provision of a kind of determination method and device of Web page classifying, is determined with improving Web page classifying accuracy.Specific technical scheme is as follows.
In order to achieve the above object, the embodiment of the present application discloses a kind of determination method of Web page classifying, methods described bag Include:
Determine webpage to be sorted;
The out-link web page of the webpage to be sorted is obtained, wherein, the out-link web page is:Described treat in web page contents be present The webpage of the address of classification webpage;
According to default mode classification, the reference classification of each out-link web page is determined;
Classified according to the reference of identified each out-link web page, determine the classification of the webpage to be sorted.
In order to achieve the above object, the embodiment of the present application discloses a kind of determining device of Web page classifying, described device bag Include:
Webpage determining module, for determining webpage to be sorted;
Exterior chain obtains module, for obtaining the out-link web page of the webpage to be sorted, wherein, the out-link web page is:Net The webpage of the address of the webpage to be sorted in page content be present;
With reference to determining module, for according to default mode classification, determining the reference classification of each out-link web page;
Classification determining module, for the reference classification of each out-link web page determined by, determine the net to be sorted The classification of page.
The determination method and device for the Web page classifying that the embodiment of the present application provides, the outer link network of webpage to be sorted can be obtained Page, the reference for determining each out-link web page according to default mode classification is classified, according to the ginseng of identified each out-link web page Examination mark class, determine the classification of webpage to be sorted.Because each out-link web page is usually the webpage associated with webpage to be sorted, because This reference classification belonged to according to out-link web page determines the classification of webpage to be sorted, true compared to directly according to webpage to be sorted Determine Web page classifying, it is possible to increase the accuracy of identified Web page classifying.
Brief description of the drawings
, below will be to embodiment or existing in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art There is the required accompanying drawing used in technology description to be briefly described.It should be evident that drawings in the following description are only this Some embodiments of application, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of the determination method for the Web page classifying that the embodiment of the present application provides;
Fig. 2 is a kind of schematic flow sheet of step S104 in Fig. 1;
Fig. 3 is a kind of structural representation of the determining device for the Web page classifying that the embodiment of the present application provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete Whole description.Obviously, described embodiment is only the part of the embodiment of the application, rather than whole embodiments.Base Embodiment in the application, those of ordinary skill in the art are obtained all on the premise of creative work is not made Other embodiment, belong to the scope of the application protection.
The embodiment of the present application provides a kind of determination method and device of Web page classifying, it is possible to increase identified webpage point The accuracy of class.Below by specific embodiment, the application is described in detail.
Fig. 1 is a kind of schematic flow sheet of the determination method for the Web page classifying that the embodiment of the present application provides, applied to electronics Equipment, the electronic equipment can include the gateway devices such as router, interchanger, can also include common computer, flat board electricity The equipment such as brain, smart mobile phone.This method comprises the following steps:
Step S101:Determine webpage to be sorted.
Wherein, webpage to be sorted is determined or determined from default web page library temporarily.Web page library For storing each webpage.Specifically, the present embodiment can be determined by the way of the address of webpage to be sorted is determined it is to be sorted Webpage.Wherein, the address of webpage to be sorted includes uniform resource locator (Uniform Resoure Locator, URL) Location.Web page address as described below may each comprise URL addresses.
Step S102:The out-link web page of webpage to be sorted is obtained, wherein, the out-link web page is:Exist in web page contents The webpage of the address of the webpage to be sorted.
Specifically, when obtaining the out-link web page of webpage to be sorted, can be treated from default out-link web page relation storehouse The out-link web page of classification webpage.Wherein, out-link web page relation storehouse is used to store each webpage and corresponding out-link web page.In addition, Out-link web page relation storehouse can be also used for storing address or info web of each out-link web page etc..As an example, in table 1 List each webpage and corresponding out-link web page, the address of out-link web page and info web.
Table 1
Webpage Out-link web page The address of out-link web page The info web that out-link web page includes
Webpage 1 Webpage 4 abc.com One recruitment website
Webpage 1 Webpage 5 Sdc.gov Human resources service is provided
Webpage 1 Webpage 6 Syds.com More professional personnel recruitment
Webpage 2 Webpage 3 112.com Most quick most professional sports news is provided
Webpage 2 Webpage 7 Yyy.com Competitive sports are reported
Webpage 2 Webpage 8 A11.com Masters' tournament please be paid close attention to
Out-link web page relation storehouse according to table 1, when webpage to be sorted is webpage 1, it can be obtained from table 1 and net Out-link web page corresponding to page 1 includes webpage 4, webpage 5 and webpage 6.It is understood that from default out-link web page relation storehouse Obtain the out-link web page of webpage to be sorted, it is possible to increase obtain efficiency during out-link web page.
Specifically, above-mentioned out-link web page relation storehouse can be obtained ahead of time in the following ways:The interior chain of each sample web page Webpage, the corresponding relation of generation sample web page and interior link web page, wherein, the interior link web page of a sample web page is the sample web page Web page contents present in other webpages address corresponding to webpage.For interior link web page, sample corresponding to interior link web page This webpage, it is the out-link web page of the interior link web page.Accordingly, when establishing out-link web page relation storehouse as shown in table 1, this is outer Webpage in link web page relation storehouse is interior link web page, and out-link web page is sample web page.Wherein, sample web page can be webpage Each webpage in navigation.Led for example, the web page navigation can be hao123 guidances to website, search dog guidance to website, 2345 websites Boat etc..
It is understood that the interior link web page determined can be used as sample web page, continue to determine the interior chain of the interior link web page Webpage, to establish more out-link web page relations.
As an example, it is known that sample web page includes webpage 1, the grade webpage of webpage 2 ... webpage 1000.Below with webpage Illustrated exemplified by 1.For webpage 1, the web page contents of webpage 1 are obtained using crawler technology, are extracted from the web page contents Web page address including a.com, d.com, c.com, e.com etc., the web page address of extraction does not include the web page address of webpage 1. It is assumed that webpage corresponding to these web page addresses is respectively webpage 21, webpage 30, webpage 33, webpage 55, it may be determined that " webpage 21, Webpage 30, webpage 33, webpage 55 " are interior link web page corresponding with webpage 1.At the same time it can also extract the info web of webpage 1 And store.Aforesaid operations are performed respectively to above-mentioned webpage 1, webpage 2 ... webpage 1000, obtained sample web page and interior link web page Corresponding relation can be shown in Table 2.
Table 2
Sample web page Interior link web page
Webpage 1 Webpage 21, webpage 30, webpage 33, webpage 55
Webpage 2 Webpage 5
Webpage 3 Webpage 6, webpage 2
Webpage 4 Webpage 1, webpage 30, webpage 33, webpage 55
Webpage 5 Webpage 1, webpage 90,
Webpage 6 Webpage 1, webpage 70
…… ……
Webpage 1000 Webpage 700, webpage 20, webpage 303, webpage 57
, can be to obtain each webpage and right after the corresponding relation for obtaining sample web page and interior link web page shown in table 2 The out-link web page answered.For example, webpage 1 is directed to, can be with the interior link web page one in look-up table 2 when obtaining the out-link web page of webpage 1 Row whether there is webpage 1, and lookup result shows that webpage 1 is present in the interior link web page of webpage 4, webpage 5 and webpage 6, therefore can So that webpage 4, webpage 5 and webpage 6 to be defined as to the out-link web page of webpage 1.For other webpages, can use similar to the above Process determines the out-link web page of other webpages.
, can also be straight in the following ways when obtaining the out-link web page of webpage to be sorted as another embodiment Obtain webpage to be sorted out-link web page:The interior link web page of default each sample web page is obtained, will be existed in interior link web page The sample web page of webpage to be sorted is defined as the out-link web page of webpage to be sorted.
, can when the sample web page that webpage to be sorted in interior link web page be present is defined as into the out-link web page of webpage to be sorted So that with reference to the above-mentioned process that out-link web page is determined according to table 2, here is omitted.
Due to magnanimity webpage in practice be present, in order that the present embodiment is more easily implemented, as a kind of embodiment, Part webpage and corresponding out-link web page can be prestored in out-link web page relation storehouse.Step S102 is specifically included:Judge It whether there is the out-link web page of the webpage to be sorted in the out-link web page relation storehouse, if it is present from above-mentioned out-link web page The out-link web page of webpage to be sorted is obtained in relation storehouse, if it does not exist, then the out-link web page of webpage to be sorted is directly obtained, and The out-link web page of acquisition is added to above-mentioned out-link web page relation storehouse.So, when the out-link web page for subsequently needing acquisition webpage again When, directly it can be obtained from above-mentioned out-link web page relation storehouse.
Step S103:According to default mode classification, the reference classification of each out-link web page is determined.
Specifically, according to default mode classification, the reference classification of each out-link web page is determined, can be included following several Embodiment:
Mode one:The address of each out-link web page is obtained, address is special corresponding to extraction from the address of each out-link web page Sign, according to the address feature and default address feature obtained and the corresponding relation of classification, determine each out-link web page With reference to classification.For example, address feature and two kinds of corresponding relations of classification can be shown in Table 3.
Table 3
In table 3, it can determine that the reference of out-link web page is classified according to the corresponding relation of left side address feature 1 and classification 1, It can also be determined according to right side address feature 2 and the corresponding relation of classification 2.For example, working as obtained address is characterized as .edu When, it can determine that the reference of the out-link web page is categorized as " educating " according to table 3.
Mode two:Info web corresponding to each out-link web page is obtained, according to the info web that is obtained and default Classifying dictionary, determine the reference classification of each out-link web page.Wherein, the info web can include:Web page title, webpage close At least one of key word, webpage description.Above-mentioned three kinds of information is commonly referred to as webpage three elements, for describe webpage purposes, The information such as field.Certainly, info web can also include web page contents.Compared to web page contents, the generality of webpage three elements Preferably, data volume is less.When obtaining info web corresponding to each out-link web page, web crawlers technology can be used to obtain each The corresponding info web of individual out-link web page.
Above-mentioned classifying dictionary is used to store each classification, the power of each the classify word (keyword) included and each word Weight.The weight of the word is used to represent the word to what extent close to the classification, and the weight of word is bigger, illustrates the word Closer to the classification.
Specifically, in aforesaid way two, according to the info web and default classifying dictionary obtained, determine each outer During the reference classification of link web page, (for convenience of describing, an out-link web page is referred to as mesh for any one out-link web page therein Mark out-link web page), it may comprise steps of 1 and step 2:
Step 1:According to default classifying dictionary, in the following ways, determine target out-link web page in the classifying dictionary In i-th of classificatory score Ti
Ti=∑j(Wj*Kj)。
Wherein, the WjFor the weight of i-th described in the classifying dictionary j-th of the word included of classifying, the KjFor Occurrence number of j-th of the word in info web corresponding to the target out-link web page.
Step 2:The reference that the maximum classification of score value is defined as to the target out-link web page is classified.
As an example, it is known that classifying dictionary includes 3 classification 1,2,3, wherein, classify in classifying dictionary corresponding to 1 Word W1, W2 weight are respectively 0.6,0.4, and word W3, W4 and W5 weight are respectively 0.3,0.4 and corresponding to classification 2 0.3, word W6 and W7 weight are respectively 0.7,0.3 corresponding to classification 3.The known info web for obtaining out-link web page WebA {information}.When it is determined that out-link web page WebA reference is classified, procedure below can be used:
First, occurrence number of each word in out-link web page WebA info web { information } is calculated, is counted Calculating result is:The occurrence number of W1, W2, W3, W4, W5, W6 and W7 in { information } is respectively K1=0, K2=1, K3 =1, K4=3, K5=2, K6=0, K7=1.
Secondly, out-link web page WebA is calculated in each classificatory score:
Classification 1:T1=W1 weight * K1+W2 weight * K2=0.6*0+0.4*1=0.4;
Classification 2:T2=W3 weight * K3+W4 weight * K4+W5 weight * K5=0.3*1+0.4*3+0.3*2= 2.1;
Classification 3:T3=W6 weight * K6+W7 weight * K7=0.7*0+0.3*1=0.3.
Finally, the reference for the classification 2 of highest scoring being defined as to out-link web page WebA is classified.
Above-mentioned classifying dictionary can use existing classifying dictionary, existing classifying dictionary be typically stored with each classification with And weight corresponding to representative word corresponding to each classification and each word.Above-mentioned classifying dictionary can also be pre-created Classifying dictionary.Specifically, when creating classifying dictionary, can include:
First, each classification, such as physical culture, shopping, tourism, finance etc. are determined.Secondly, it is it is determined that each corresponding to each classification Individual sample web page, for example, " Sina's physical culture ", " Sohu Sports News " and " Tengxun's physical culture " can be defined as to the sample net of classification sports Page.Then, the info web of each sample web page is obtained, and info web is segmented, it is corresponding to obtain each sample web page Alternative words.And then the word for being directed to and each classifying is chosen from the alternative words obtained, and according to the word in institute There is the weight that the number occurred in participle determines each word using machine learning method.As an example, listed in table 4 Each specific name, word and the weight included in classifying dictionary.
Table 4
Numbering Specific name Word Weight
1 Education Course 5.602
2 Education Read 5.678
3 Education English 6.272
4 Physical culture Table tennis 6.505
5 Physical culture Masters' tournament 6.683
As can be seen from Table 4, classifying dictionary is made up of herein below:Classification, the word of classification, the weight etc. of word. For example, classify for education, the classification includes course, reading and these three English words, and " English " in the classification Weight is maximum, and the weight of " course " in the classification is minimum.
Step S104:Classified according to the reference of identified each out-link web page, determine the classification of webpage to be sorted.
Specifically, classified according to the reference of identified each out-link web page, can be with when determining the classification of webpage to be sorted Including following several embodiments:
Mode one:The reference classification of identified each out-link web page is directly defined as the classification of webpage to be sorted.
It is understood that when out-link web page only has one, directly the reference of the out-link web page can be classified and determined For the classification of webpage to be sorted.When out-link web page quantity is more, and during the reference classification all same of each out-link web page, The reference classification of out-link web page directly can be defined as the classification of webpage to be sorted;Or when out-link web page quantity is more, And, can also be directly by the reference of out-link web page when the number of species of the reference classification of each out-link web page is less than predetermined threshold value Classification is defined as the classification of webpage to be sorted, the classification more than one of at this moment identified webpage to be sorted.For example, net to be sorted There are 10 out-link web pages in page, this 10 out-link web pages are belonging respectively to 2 with reference to classification, and known above-mentioned predetermined threshold value is 3, So can be using this 2 reference classification as the classification of webpage to be sorted.
Mode two:The first occurrence number that each in the classifying dictionary is sorted in the first reference sorted group is determined, First reference for including identified each out-link web page with reference to sorted group is classified;By the classification that the first occurrence number value is maximum It is defined as the classification of webpage to be sorted.
It is understood that when the quantity of out-link web page is more, and the number of species of the reference classification of out-link web page Also when more, the occurrence number of each classification can be calculated, will appear from point that the maximum classification of number is defined as webpage to be sorted Class.
Mode three:Weight of website corresponding to each out-link web page is obtained, the weight of website obtained is defined as each outer The weight of link web page, and according to the reference of identified each out-link web page classification and the weight of each out-link web page, determine institute State the classification of webpage to be sorted.
When obtaining weight of website corresponding to each out-link web page, can be directly obtained by special website tools each Weight of website corresponding to out-link web page.For example, website tools can include the website work that the websites such as love station net, head of a station's instrument provide Tool.It should be noted that a website can include multiple webpages, the multiple webpages for belonging to same website correspond to identical website Weight.Weight of website is the authority value that search engine assigns to website, weight of website and web site architecture, domain name type, importing chain Connect, web page contents, include the factors such as quantity, key word ranking, renewal frequency correlation.
It is understood that weight of website is higher, the weight of out-link web page is bigger;Weight of website is lower, out-link web page Weight with regard to smaller.It should be noted that weight of website can reflect the maintenance condition of webpage place website from side, webpage Weight of website is higher, can reflect that the maintenance condition of website is better, the info web of corresponding webpage is more accurate, according to webpage Information determines that accuracy also can be higher during the reference classification of webpage.That is, when the weight of out-link web page is larger, out-link web page Reference classification confidence level it is also some higher.
Specifically, according to the reference of identified each out-link web page classification and the weight of each out-link web page, institute is determined When stating the classification of webpage to be sorted, 1 and step 2 may comprise steps of:
Step 1:According to Oi=∑n(yn*Mn), determine the webpage to be sorted in i-th of classificatory total score Oi.Its In, the MnFor the weight of n-th of out-link web page, classify when the reference of n-th of out-link web page is categorized as described i-th When, the yn1 is taken, when the reference classification of n-th of out-link web page is not classified for described i-th, the ynTake 0.
Step 2:By must the maximum classification of score value be defined as the classification of the webpage to be sorted.
As an example, it is known that the out-link web page of webpage to be sorted is respectively WebA, WebB and WebC, out-link web page Weight is respectively 0.4,0.3 and 0.3, and comprising 3 classification 1,2,3 in classifying dictionary, out-link web page WebA reference is categorized as point Class 1 and classification 2, out-link web page WebB reference are categorized as classification 1 and classification 3, and out-link web page WebC reference is categorized as classifying 2.It is determined that webpage to be sorted classification when, the total score of each classification can be calculated first:
Classification 1:O1=WebA weight * 1+WebB weight * 1+WebC weight * 0=0.4*1+0.3*1+0=0.7;
Classification 2:O2=WebA weight * 1+WebB weight * 0+WebC weight * 1=0.4*1+0.3*0+0.3*1= 0.7;
Classification 3:O2=WebA weight * 0+WebB weight * 1+WebC weight * 0=0.4*0+0.3*1+0.3*0= 0.3;
It can determine that the total score of classification 1 and classification 2 is maximum according to above-mentioned result of calculation, thus classification 1 and classification 2 is true It is set to the classification of webpage to be sorted.
As shown in the above, the determination method and device for the Web page classifying that the present embodiment provides, can be obtained to be sorted The out-link web page of webpage, the reference for determining each out-link web page according to default mode classification is classified, according to identified each The reference classification of out-link web page, determines the classification of webpage to be sorted.Because each out-link web page is usually and webpage phase to be sorted The webpage of association, therefore the reference classification belonged to according to out-link web page determines the classification of webpage to be sorted, compared to direct root Web page classifying is determined according to webpage to be sorted, it is possible to increase the accuracy of identified Web page classifying.
It is understood that when the quantity of the out-link web page of webpage to be sorted is more, according to the ginseng of each out-link web page Examination mark class determines the classification of webpage to be sorted, and accuracy when determining Web page classifying is improved from the angle of big data.
In a kind of embodiment based on embodiment illustrated in fig. 1, step S104, according to identified each outer link network Reference the classification of page, the step of determining the classification of the webpage to be sorted, can be carried out, tool according to schematic flow sheet shown in Fig. 2 Body includes step S104A and step S104B:
Step S104A:According to above-mentioned mode classification, the reference classification of webpage to be sorted is determined.
In order to improve the accuracy of identified Web page classifying to be sorted, present embodiment uses and step S103 identicals Mode classification, the reference classification of webpage to be sorted is first determined, then according to the reference of each out-link web page classification and net to be sorted The reference classification of page finally determines the classification of webpage to be sorted.
Step S104B:Classified according to the reference of identified each out-link web page classification and the reference of webpage to be sorted, really The classification of fixed webpage to be sorted.
Specifically, when the reference classification of each out-link web page is identical with the reference classification of webpage to be sorted, directly will be upper State the classification for being defined as webpage to be sorted with reference to classification.
, can be according to following implementation when the reference classification of each out-link web page is different with the reference classification of webpage to be sorted Mode determines the classification of webpage to be sorted:
Determine that each in above-mentioned classifying dictionary is sorted in second with reference to the second occurrence number in sorted group, second ginseng Examining sorted group includes the reference classification and the reference of webpage to be sorted classification of each out-link web page, and the second occurrence number value is maximum Classification be defined as the classification of the webpage to be sorted.Above-mentioned classifying dictionary is the classifying dictionary referred in step S103.
It is understood that when second is more with reference to the reference classification number of species in sorted group, can calculate each The occurrence number of classification, it will appear from the classification that the maximum classification of number is defined as webpage to be sorted.
It can be seen that the scheme that present embodiment provides, can classify and webpage to be sorted according to the reference of each out-link web page Reference classification determine the classification of webpage to be sorted.That is, on the basis of embodiment illustrated in fig. 1, present embodiment exists When determining the classification of webpage to be sorted, do not classify only with reference to the reference of out-link web page, while referring also to the reference of webpage to be sorted Classification, therefore can further improve the accuracy of identified Web page classifying.
Fig. 3 is a kind of structural representation of the determining device for the Web page classifying that the embodiment of the present application provides, applied to electronics Equipment, corresponding with embodiment of the method shown in Fig. 1, described device includes:
Webpage determining module 301, for determining webpage to be sorted;
Exterior chain obtains module 302, for obtaining the out-link web page of the webpage to be sorted, wherein, the out-link web page is: The webpage of the address of the webpage to be sorted in web page contents be present;
With reference to determining module 303, for according to default mode classification, determining the reference classification of each out-link web page;
Classification determining module 304, for the reference classification of each out-link web page determined by, determine described to be sorted The classification of webpage.
In a kind of embodiment based on embodiment illustrated in fig. 3, exterior chain obtains module 302 and specifically can be used for:
The out-link web page of the webpage to be sorted is obtained from default out-link web page relation storehouse;Wherein, the outer link network Page relation storehouse, for storing each webpage and corresponding out-link web page.
It is the first determining module or the with reference to determining module 303 in a kind of embodiment based on embodiment illustrated in fig. 3 Two determining modules;(not shown)
First determining module, for obtaining the address of each out-link web page, extracted from the address of each out-link web page Corresponding address feature, according to the address feature and default address feature obtained and the corresponding relation of classification, it is determined that respectively The reference classification of individual out-link web page;
Second determining module, for obtaining info web corresponding to each out-link web page, believed according to the webpage obtained Breath and default classifying dictionary, determine the reference classification of each out-link web page.
In a kind of embodiment based on embodiment illustrated in fig. 3, it is described classification determining module 304 be the 3rd determining module, One in 4th determining module, the 5th determining module;(not shown)
3rd determining module, for the reference classification of identified each out-link web page to be defined as into the net to be sorted The classification of page;Or
4th determining module, for determining that each in the classifying dictionary is sorted in first with reference to the in sorted group One occurrence number, described first includes the reference classification of identified each out-link web page with reference to sorted group, goes out occurrence by first The maximum classification of numerical value is defined as the classification of the webpage to be sorted;Or
5th determining module, for obtaining weight of website corresponding to each out-link web page, the weight of website that will be obtained It is defined as the weight of each out-link web page, and according to the reference of identified each out-link web page classification and each out-link web page Weight, determine the classification of the webpage to be sorted.
In a kind of embodiment based on embodiment illustrated in fig. 3, classification determining module 304 can include:
Determination sub-module (not shown), for according to the mode classification, determining the reference of the webpage to be sorted Classification;
Classification submodule (not shown), is treated for the reference classification of each out-link web page determined by with described The reference classification of classification webpage, determines the classification of the webpage to be sorted.
In a kind of embodiment based on embodiment illustrated in fig. 3, above-mentioned classification submodule specifically can be used for:
Determine that each in the classifying dictionary is sorted in second with reference to the second occurrence number in sorted group, described second Include the reference classification of each out-link web page with reference to sorted group and the reference of the webpage to be sorted is classified, by the second occurrence number The maximum classification of value is defined as the classification of the webpage to be sorted.
Because said apparatus embodiment is obtained based on embodiment of the method, there is identical technique effect with this method, Therefore the technique effect of device embodiment will not be repeated here.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or any other variant be intended to it is non- It is exclusive to include, so that process, method, article or equipment including a series of elements not only include those key elements, But also the other element including being not expressly set out, or also include solid by this process, method, article or equipment Some key elements.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including Other identical element also be present in the process of the key element, method, article or equipment.
Each embodiment in this specification is described by the way of related, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for device For applying example, because it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to embodiment of the method Part explanation.
The preferred embodiment of the application is the foregoing is only, is not intended to limit the protection domain of the application.It is all Any modification, equivalent substitution and improvements done within spirit herein and principle etc., it is all contained in the protection domain of the application It is interior.

Claims (12)

1. a kind of determination method of Web page classifying, it is characterised in that methods described includes:
Determine webpage to be sorted;
The out-link web page of the webpage to be sorted is obtained, wherein, the out-link web page is:Exist in web page contents described to be sorted The webpage of the address of webpage;
According to default mode classification, the reference classification of each out-link web page is determined;
Classified according to the reference of identified each out-link web page, determine the classification of the webpage to be sorted.
2. according to the method for claim 1, it is characterised in that the step of the out-link web page for obtaining the webpage to be sorted Suddenly, including:
The out-link web page of the webpage to be sorted is obtained from default out-link web page relation storehouse;Wherein, the out-link web page is closed It is storehouse, for storing each webpage and corresponding out-link web page.
3. according to the method for claim 2, it is characterised in that it is described according to default mode classification, determine each exterior chain The step of reference classification of webpage, including:
The address of each out-link web page is obtained, the address feature corresponding to extraction from the address of each out-link web page, according to being obtained The address feature and default address feature and the corresponding relation of classification obtained, determine the reference classification of each out-link web page;Or Person,
Info web corresponding to each out-link web page is obtained, according to the info web and default classifying dictionary obtained, really The reference classification of fixed each out-link web page.
4. according to the method for claim 3, it is characterised in that the reference of each out-link web page determined by the basis point Class, the step of determining the classification of the webpage to be sorted, including:
The reference classification of identified each out-link web page is defined as the classification of the webpage to be sorted;Or
Determine that each in the classifying dictionary is sorted in first with reference to the first occurrence number in sorted group, first reference Sorted group includes the reference classification of identified each out-link web page, the maximum classification of the first occurrence number value is defined as described The classification of webpage to be sorted;Or
Weight of website corresponding to each out-link web page is obtained, the weight of website obtained is defined as to the power of each out-link web page Weight, and according to the reference of identified each out-link web page classification and the weight of each out-link web page, determine the net to be sorted The classification of page.
5. according to the method for claim 1, it is characterised in that the reference of each out-link web page determined by the basis point Class, the step of determining the classification of the webpage to be sorted, including:
According to the mode classification, the reference classification of the webpage to be sorted is determined;
According to the reference of identified each out-link web page classification and reference of the webpage to be sorted classification, treated point it is determined that described The classification of class webpage.
6. according to the method for claim 5, it is characterised in that the reference of each out-link web page determined by the basis point The reference classification of class and the webpage to be sorted, the step of determining the classification of the webpage to be sorted, including:
Determine that each in the classifying dictionary is sorted in second with reference to the second occurrence number in sorted group, second reference Sorted group includes the reference classification of each out-link web page and the reference of the webpage to be sorted is classified, by the second occurrence number value most Big classification is defined as the classification of the webpage to be sorted.
7. a kind of determining device of Web page classifying, it is characterised in that described device includes:
Webpage determining module, for determining webpage to be sorted;
Exterior chain obtains module, for obtaining the out-link web page of the webpage to be sorted, wherein, the out-link web page is:In webpage The webpage of the address of the webpage to be sorted in appearance be present;
With reference to determining module, for according to default mode classification, determining the reference classification of each out-link web page;
Classification determining module, for the reference classification of each out-link web page determined by, determine the webpage to be sorted Classification.
8. device according to claim 7, it is characterised in that the exterior chain obtains module, is specifically used for:
The out-link web page of the webpage to be sorted is obtained from default out-link web page relation storehouse;Wherein, the out-link web page is closed It is storehouse, for storing each webpage and corresponding out-link web page.
9. device according to claim 8, it is characterised in that described with reference to determining module is the first determining module or the Two determining modules;
First determining module, for obtaining the address of each out-link web page, the extraction pair from the address of each out-link web page The address feature answered, according to the address feature and default address feature obtained and the corresponding relation of classification, determine each The reference classification of out-link web page;
Second determining module, for obtaining info web corresponding to each out-link web page, according to the info web obtained And default classifying dictionary, determine the reference classification of each out-link web page.
10. device according to claim 9, it is characterised in that the classification determining module is the 3rd determining module, the 4th One in determining module, the 5th determining module;
3rd determining module, for the reference classification of identified each out-link web page to be defined as into the webpage to be sorted Classification;Or
4th determining module, for determining that each in the classifying dictionary is sorted in first with reference to first in sorted group Occurrence number, described first includes the reference classification of identified each out-link web page with reference to sorted group, by the first occurrence number The maximum classification of value is defined as the classification of the webpage to be sorted;Or
5th determining module, it is for obtaining weight of website corresponding to each out-link web page, the weight of website obtained is true It is set to the weight of each out-link web page, and according to the reference of identified each out-link web page classification and the power of each out-link web page Weight, determines the classification of the webpage to be sorted.
11. device according to claim 7, it is characterised in that the classification determining module, including:
Determination sub-module, for according to the mode classification, determining the reference classification of the webpage to be sorted;
Classification submodule, reference classification and the reference of the webpage to be sorted point for each out-link web page determined by Class, determine the classification of the webpage to be sorted.
12. device according to claim 11, it is characterised in that the classification submodule, be specifically used for:
Determine that each in the classifying dictionary is sorted in second with reference to the second occurrence number in sorted group, second reference Sorted group includes the reference classification of each out-link web page and the reference of the webpage to be sorted is classified, by the second occurrence number value most Big classification is defined as the classification of the webpage to be sorted.
CN201710326233.4A 2017-05-10 2017-05-10 A kind of determination method and device of Web page classifying Pending CN107545020A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710326233.4A CN107545020A (en) 2017-05-10 2017-05-10 A kind of determination method and device of Web page classifying

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710326233.4A CN107545020A (en) 2017-05-10 2017-05-10 A kind of determination method and device of Web page classifying

Publications (1)

Publication Number Publication Date
CN107545020A true CN107545020A (en) 2018-01-05

Family

ID=60966852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710326233.4A Pending CN107545020A (en) 2017-05-10 2017-05-10 A kind of determination method and device of Web page classifying

Country Status (1)

Country Link
CN (1) CN107545020A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256104A (en) * 2018-02-05 2018-07-06 恒安嘉新(北京)科技股份公司 Internet site compressive classification method based on multidimensional characteristic
CN109977327A (en) * 2019-03-20 2019-07-05 新华三信息安全技术有限公司 A kind of Web page classification method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102473190A (en) * 2009-07-30 2012-05-23 阿尔卡特朗讯 Keyword assignment to a web page
CN102819597A (en) * 2012-08-13 2012-12-12 北京星网锐捷网络技术有限公司 Web page classification method and equipment
CN102955810A (en) * 2011-08-26 2013-03-06 中国移动通信集团公司 Webpage classification method and device
CN106250402A (en) * 2016-07-19 2016-12-21 杭州华三通信技术有限公司 A kind of Website classification method and device
CN106339459A (en) * 2016-08-26 2017-01-18 中国科学院信息工程研究所 Method for pre-classifying Chinese webpages based on keyword matching

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102473190A (en) * 2009-07-30 2012-05-23 阿尔卡特朗讯 Keyword assignment to a web page
CN102955810A (en) * 2011-08-26 2013-03-06 中国移动通信集团公司 Webpage classification method and device
CN102819597A (en) * 2012-08-13 2012-12-12 北京星网锐捷网络技术有限公司 Web page classification method and equipment
CN106250402A (en) * 2016-07-19 2016-12-21 杭州华三通信技术有限公司 A kind of Website classification method and device
CN106339459A (en) * 2016-08-26 2017-01-18 中国科学院信息工程研究所 Method for pre-classifying Chinese webpages based on keyword matching

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256104A (en) * 2018-02-05 2018-07-06 恒安嘉新(北京)科技股份公司 Internet site compressive classification method based on multidimensional characteristic
CN108256104B (en) * 2018-02-05 2020-05-26 恒安嘉新(北京)科技股份公司 Comprehensive classification method of internet websites based on multidimensional characteristics
CN109977327A (en) * 2019-03-20 2019-07-05 新华三信息安全技术有限公司 A kind of Web page classification method and device

Similar Documents

Publication Publication Date Title
CN108763321B (en) Related entity recommendation method based on large-scale related entity network
CN103514234B (en) A kind of page info extracting method and device
CN106599155A (en) Method and system for classifying web pages
CN107341183A (en) A kind of Website classification method based on darknet website comprehensive characteristics
CN104750754A (en) Website industry classification method and server
CN103294781A (en) Method and equipment used for processing page data
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN104484380A (en) Personalized search method and personalized search device
CN107102993A (en) A kind of user's demand analysis method and device
CN106294535A (en) The recognition methods of website and device
CN106033445A (en) Method and device for obtaining article association degree data
CN105787662A (en) Mobile application software performance prediction method based on attributes
CN105653547A (en) Method and device for extracting keywords of text
CN108170678A (en) A kind of text entities abstracting method and system
CN110209721A (en) Judgement document transfers method, apparatus, server and storage medium
CN106445907A (en) Domain lexicon generation method and apparatus
CN102289514A (en) Social label automatic labelling method and social label automatic labeller
CN106815265A (en) The searching method and device of judgement document
CN107545020A (en) A kind of determination method and device of Web page classifying
CN111079582A (en) Image recognition English composition running question judgment method
CN104462439B (en) The recognition methods of event and device
CN106874340A (en) A kind of web page address sorting technique and device
CN113569118A (en) Self-media pushing method and device, computer equipment and storage medium
CN108388556A (en) The method for digging and system of similar entity
CN104408036A (en) Correlated topic recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180105