CN107545020A - A kind of determination method and device of Web page classifying - Google Patents
A kind of determination method and device of Web page classifying Download PDFInfo
- Publication number
- CN107545020A CN107545020A CN201710326233.4A CN201710326233A CN107545020A CN 107545020 A CN107545020 A CN 107545020A CN 201710326233 A CN201710326233 A CN 201710326233A CN 107545020 A CN107545020 A CN 107545020A
- Authority
- CN
- China
- Prior art keywords
- web page
- classification
- webpage
- sorted
- link web
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present application provides a kind of determination method and device of Web page classifying, is related to network communication technology field.Methods described includes:Determine webpage to be sorted;The out-link web page of webpage to be sorted is obtained, wherein, the out-link web page is:The webpage of the address of webpage to be sorted in web page contents be present;According to default mode classification, the reference classification of each out-link web page is determined;Classified according to the reference of identified each out-link web page, determine the classification of webpage to be sorted.The scheme provided using the embodiment of the present application, it is possible to increase the accuracy of identified Web page classifying.
Description
Technical field
The application is related to network communication technology field, more particularly to a kind of determination method and device of Web page classifying.
Background technology
Webpage quantity in network is very big, and these webpages may be belonging respectively to various types, and these types can
With including news category, educational, sport category, shopping class etc..At present, Web page classifying can apply to various scenes.For example, using
In home page filter or applied to establishing Web page classifying storehouse etc..When applied to home page filter, it is often necessary to first determine point of webpage
Class, then webpage is filtered according to the classification of determination.
In the prior art, it is determined that during Web page classifying, the heading message of webpage to be sorted can be first obtained, then marks this
Topic information is matched with default classifying dictionary, and the classification that above-mentioned webpage belonged to is determined according to matching result, wherein, classify
Dictionary typically stores each classification and the keyword each classified.
Generally, the classification of common webpage can be accurately determined using the above method.But at present many webpages all
The scope of heading message describes very wide in range so that the heading message of these webpages can not reflect the type of webpage well.
Therefore, determine there may be certain error during the classification of these webpages using the above method in this case, it is identified
Web page classifying accuracy is not high enough.
The content of the invention
The purpose of the embodiment of the present application is the provision of a kind of determination method and device of Web page classifying, is determined with improving
Web page classifying accuracy.Specific technical scheme is as follows.
In order to achieve the above object, the embodiment of the present application discloses a kind of determination method of Web page classifying, methods described bag
Include:
Determine webpage to be sorted;
The out-link web page of the webpage to be sorted is obtained, wherein, the out-link web page is:Described treat in web page contents be present
The webpage of the address of classification webpage;
According to default mode classification, the reference classification of each out-link web page is determined;
Classified according to the reference of identified each out-link web page, determine the classification of the webpage to be sorted.
In order to achieve the above object, the embodiment of the present application discloses a kind of determining device of Web page classifying, described device bag
Include:
Webpage determining module, for determining webpage to be sorted;
Exterior chain obtains module, for obtaining the out-link web page of the webpage to be sorted, wherein, the out-link web page is:Net
The webpage of the address of the webpage to be sorted in page content be present;
With reference to determining module, for according to default mode classification, determining the reference classification of each out-link web page;
Classification determining module, for the reference classification of each out-link web page determined by, determine the net to be sorted
The classification of page.
The determination method and device for the Web page classifying that the embodiment of the present application provides, the outer link network of webpage to be sorted can be obtained
Page, the reference for determining each out-link web page according to default mode classification is classified, according to the ginseng of identified each out-link web page
Examination mark class, determine the classification of webpage to be sorted.Because each out-link web page is usually the webpage associated with webpage to be sorted, because
This reference classification belonged to according to out-link web page determines the classification of webpage to be sorted, true compared to directly according to webpage to be sorted
Determine Web page classifying, it is possible to increase the accuracy of identified Web page classifying.
Brief description of the drawings
, below will be to embodiment or existing in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art
There is the required accompanying drawing used in technology description to be briefly described.It should be evident that drawings in the following description are only this
Some embodiments of application, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of the determination method for the Web page classifying that the embodiment of the present application provides;
Fig. 2 is a kind of schematic flow sheet of step S104 in Fig. 1;
Fig. 3 is a kind of structural representation of the determining device for the Web page classifying that the embodiment of the present application provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete
Whole description.Obviously, described embodiment is only the part of the embodiment of the application, rather than whole embodiments.Base
Embodiment in the application, those of ordinary skill in the art are obtained all on the premise of creative work is not made
Other embodiment, belong to the scope of the application protection.
The embodiment of the present application provides a kind of determination method and device of Web page classifying, it is possible to increase identified webpage point
The accuracy of class.Below by specific embodiment, the application is described in detail.
Fig. 1 is a kind of schematic flow sheet of the determination method for the Web page classifying that the embodiment of the present application provides, applied to electronics
Equipment, the electronic equipment can include the gateway devices such as router, interchanger, can also include common computer, flat board electricity
The equipment such as brain, smart mobile phone.This method comprises the following steps:
Step S101:Determine webpage to be sorted.
Wherein, webpage to be sorted is determined or determined from default web page library temporarily.Web page library
For storing each webpage.Specifically, the present embodiment can be determined by the way of the address of webpage to be sorted is determined it is to be sorted
Webpage.Wherein, the address of webpage to be sorted includes uniform resource locator (Uniform Resoure Locator, URL)
Location.Web page address as described below may each comprise URL addresses.
Step S102:The out-link web page of webpage to be sorted is obtained, wherein, the out-link web page is:Exist in web page contents
The webpage of the address of the webpage to be sorted.
Specifically, when obtaining the out-link web page of webpage to be sorted, can be treated from default out-link web page relation storehouse
The out-link web page of classification webpage.Wherein, out-link web page relation storehouse is used to store each webpage and corresponding out-link web page.In addition,
Out-link web page relation storehouse can be also used for storing address or info web of each out-link web page etc..As an example, in table 1
List each webpage and corresponding out-link web page, the address of out-link web page and info web.
Table 1
Webpage | Out-link web page | The address of out-link web page | The info web that out-link web page includes |
Webpage 1 | Webpage 4 | abc.com | One recruitment website |
Webpage 1 | Webpage 5 | Sdc.gov | Human resources service is provided |
Webpage 1 | Webpage 6 | Syds.com | More professional personnel recruitment |
Webpage 2 | Webpage 3 | 112.com | Most quick most professional sports news is provided |
Webpage 2 | Webpage 7 | Yyy.com | Competitive sports are reported |
Webpage 2 | Webpage 8 | A11.com | Masters' tournament please be paid close attention to |
Out-link web page relation storehouse according to table 1, when webpage to be sorted is webpage 1, it can be obtained from table 1 and net
Out-link web page corresponding to page 1 includes webpage 4, webpage 5 and webpage 6.It is understood that from default out-link web page relation storehouse
Obtain the out-link web page of webpage to be sorted, it is possible to increase obtain efficiency during out-link web page.
Specifically, above-mentioned out-link web page relation storehouse can be obtained ahead of time in the following ways:The interior chain of each sample web page
Webpage, the corresponding relation of generation sample web page and interior link web page, wherein, the interior link web page of a sample web page is the sample web page
Web page contents present in other webpages address corresponding to webpage.For interior link web page, sample corresponding to interior link web page
This webpage, it is the out-link web page of the interior link web page.Accordingly, when establishing out-link web page relation storehouse as shown in table 1, this is outer
Webpage in link web page relation storehouse is interior link web page, and out-link web page is sample web page.Wherein, sample web page can be webpage
Each webpage in navigation.Led for example, the web page navigation can be hao123 guidances to website, search dog guidance to website, 2345 websites
Boat etc..
It is understood that the interior link web page determined can be used as sample web page, continue to determine the interior chain of the interior link web page
Webpage, to establish more out-link web page relations.
As an example, it is known that sample web page includes webpage 1, the grade webpage of webpage 2 ... webpage 1000.Below with webpage
Illustrated exemplified by 1.For webpage 1, the web page contents of webpage 1 are obtained using crawler technology, are extracted from the web page contents
Web page address including a.com, d.com, c.com, e.com etc., the web page address of extraction does not include the web page address of webpage 1.
It is assumed that webpage corresponding to these web page addresses is respectively webpage 21, webpage 30, webpage 33, webpage 55, it may be determined that " webpage 21,
Webpage 30, webpage 33, webpage 55 " are interior link web page corresponding with webpage 1.At the same time it can also extract the info web of webpage 1
And store.Aforesaid operations are performed respectively to above-mentioned webpage 1, webpage 2 ... webpage 1000, obtained sample web page and interior link web page
Corresponding relation can be shown in Table 2.
Table 2
Sample web page | Interior link web page |
Webpage 1 | Webpage 21, webpage 30, webpage 33, webpage 55 |
Webpage 2 | Webpage 5 |
Webpage 3 | Webpage 6, webpage 2 |
Webpage 4 | Webpage 1, webpage 30, webpage 33, webpage 55 |
Webpage 5 | Webpage 1, webpage 90, |
Webpage 6 | Webpage 1, webpage 70 |
…… | …… |
Webpage 1000 | Webpage 700, webpage 20, webpage 303, webpage 57 |
, can be to obtain each webpage and right after the corresponding relation for obtaining sample web page and interior link web page shown in table 2
The out-link web page answered.For example, webpage 1 is directed to, can be with the interior link web page one in look-up table 2 when obtaining the out-link web page of webpage 1
Row whether there is webpage 1, and lookup result shows that webpage 1 is present in the interior link web page of webpage 4, webpage 5 and webpage 6, therefore can
So that webpage 4, webpage 5 and webpage 6 to be defined as to the out-link web page of webpage 1.For other webpages, can use similar to the above
Process determines the out-link web page of other webpages.
, can also be straight in the following ways when obtaining the out-link web page of webpage to be sorted as another embodiment
Obtain webpage to be sorted out-link web page:The interior link web page of default each sample web page is obtained, will be existed in interior link web page
The sample web page of webpage to be sorted is defined as the out-link web page of webpage to be sorted.
, can when the sample web page that webpage to be sorted in interior link web page be present is defined as into the out-link web page of webpage to be sorted
So that with reference to the above-mentioned process that out-link web page is determined according to table 2, here is omitted.
Due to magnanimity webpage in practice be present, in order that the present embodiment is more easily implemented, as a kind of embodiment,
Part webpage and corresponding out-link web page can be prestored in out-link web page relation storehouse.Step S102 is specifically included:Judge
It whether there is the out-link web page of the webpage to be sorted in the out-link web page relation storehouse, if it is present from above-mentioned out-link web page
The out-link web page of webpage to be sorted is obtained in relation storehouse, if it does not exist, then the out-link web page of webpage to be sorted is directly obtained, and
The out-link web page of acquisition is added to above-mentioned out-link web page relation storehouse.So, when the out-link web page for subsequently needing acquisition webpage again
When, directly it can be obtained from above-mentioned out-link web page relation storehouse.
Step S103:According to default mode classification, the reference classification of each out-link web page is determined.
Specifically, according to default mode classification, the reference classification of each out-link web page is determined, can be included following several
Embodiment:
Mode one:The address of each out-link web page is obtained, address is special corresponding to extraction from the address of each out-link web page
Sign, according to the address feature and default address feature obtained and the corresponding relation of classification, determine each out-link web page
With reference to classification.For example, address feature and two kinds of corresponding relations of classification can be shown in Table 3.
Table 3
In table 3, it can determine that the reference of out-link web page is classified according to the corresponding relation of left side address feature 1 and classification 1,
It can also be determined according to right side address feature 2 and the corresponding relation of classification 2.For example, working as obtained address is characterized as .edu
When, it can determine that the reference of the out-link web page is categorized as " educating " according to table 3.
Mode two:Info web corresponding to each out-link web page is obtained, according to the info web that is obtained and default
Classifying dictionary, determine the reference classification of each out-link web page.Wherein, the info web can include:Web page title, webpage close
At least one of key word, webpage description.Above-mentioned three kinds of information is commonly referred to as webpage three elements, for describe webpage purposes,
The information such as field.Certainly, info web can also include web page contents.Compared to web page contents, the generality of webpage three elements
Preferably, data volume is less.When obtaining info web corresponding to each out-link web page, web crawlers technology can be used to obtain each
The corresponding info web of individual out-link web page.
Above-mentioned classifying dictionary is used to store each classification, the power of each the classify word (keyword) included and each word
Weight.The weight of the word is used to represent the word to what extent close to the classification, and the weight of word is bigger, illustrates the word
Closer to the classification.
Specifically, in aforesaid way two, according to the info web and default classifying dictionary obtained, determine each outer
During the reference classification of link web page, (for convenience of describing, an out-link web page is referred to as mesh for any one out-link web page therein
Mark out-link web page), it may comprise steps of 1 and step 2:
Step 1:According to default classifying dictionary, in the following ways, determine target out-link web page in the classifying dictionary
In i-th of classificatory score Ti:
Ti=∑j(Wj*Kj)。
Wherein, the WjFor the weight of i-th described in the classifying dictionary j-th of the word included of classifying, the KjFor
Occurrence number of j-th of the word in info web corresponding to the target out-link web page.
Step 2:The reference that the maximum classification of score value is defined as to the target out-link web page is classified.
As an example, it is known that classifying dictionary includes 3 classification 1,2,3, wherein, classify in classifying dictionary corresponding to 1
Word W1, W2 weight are respectively 0.6,0.4, and word W3, W4 and W5 weight are respectively 0.3,0.4 and corresponding to classification 2
0.3, word W6 and W7 weight are respectively 0.7,0.3 corresponding to classification 3.The known info web for obtaining out-link web page WebA
{information}.When it is determined that out-link web page WebA reference is classified, procedure below can be used:
First, occurrence number of each word in out-link web page WebA info web { information } is calculated, is counted
Calculating result is:The occurrence number of W1, W2, W3, W4, W5, W6 and W7 in { information } is respectively K1=0, K2=1, K3
=1, K4=3, K5=2, K6=0, K7=1.
Secondly, out-link web page WebA is calculated in each classificatory score:
Classification 1:T1=W1 weight * K1+W2 weight * K2=0.6*0+0.4*1=0.4;
Classification 2:T2=W3 weight * K3+W4 weight * K4+W5 weight * K5=0.3*1+0.4*3+0.3*2=
2.1;
Classification 3:T3=W6 weight * K6+W7 weight * K7=0.7*0+0.3*1=0.3.
Finally, the reference for the classification 2 of highest scoring being defined as to out-link web page WebA is classified.
Above-mentioned classifying dictionary can use existing classifying dictionary, existing classifying dictionary be typically stored with each classification with
And weight corresponding to representative word corresponding to each classification and each word.Above-mentioned classifying dictionary can also be pre-created
Classifying dictionary.Specifically, when creating classifying dictionary, can include:
First, each classification, such as physical culture, shopping, tourism, finance etc. are determined.Secondly, it is it is determined that each corresponding to each classification
Individual sample web page, for example, " Sina's physical culture ", " Sohu Sports News " and " Tengxun's physical culture " can be defined as to the sample net of classification sports
Page.Then, the info web of each sample web page is obtained, and info web is segmented, it is corresponding to obtain each sample web page
Alternative words.And then the word for being directed to and each classifying is chosen from the alternative words obtained, and according to the word in institute
There is the weight that the number occurred in participle determines each word using machine learning method.As an example, listed in table 4
Each specific name, word and the weight included in classifying dictionary.
Table 4
Numbering | Specific name | Word | Weight |
1 | Education | Course | 5.602 |
2 | Education | Read | 5.678 |
3 | Education | English | 6.272 |
4 | Physical culture | Table tennis | 6.505 |
5 | Physical culture | Masters' tournament | 6.683 |
As can be seen from Table 4, classifying dictionary is made up of herein below:Classification, the word of classification, the weight etc. of word.
For example, classify for education, the classification includes course, reading and these three English words, and " English " in the classification
Weight is maximum, and the weight of " course " in the classification is minimum.
Step S104:Classified according to the reference of identified each out-link web page, determine the classification of webpage to be sorted.
Specifically, classified according to the reference of identified each out-link web page, can be with when determining the classification of webpage to be sorted
Including following several embodiments:
Mode one:The reference classification of identified each out-link web page is directly defined as the classification of webpage to be sorted.
It is understood that when out-link web page only has one, directly the reference of the out-link web page can be classified and determined
For the classification of webpage to be sorted.When out-link web page quantity is more, and during the reference classification all same of each out-link web page,
The reference classification of out-link web page directly can be defined as the classification of webpage to be sorted;Or when out-link web page quantity is more,
And, can also be directly by the reference of out-link web page when the number of species of the reference classification of each out-link web page is less than predetermined threshold value
Classification is defined as the classification of webpage to be sorted, the classification more than one of at this moment identified webpage to be sorted.For example, net to be sorted
There are 10 out-link web pages in page, this 10 out-link web pages are belonging respectively to 2 with reference to classification, and known above-mentioned predetermined threshold value is 3,
So can be using this 2 reference classification as the classification of webpage to be sorted.
Mode two:The first occurrence number that each in the classifying dictionary is sorted in the first reference sorted group is determined,
First reference for including identified each out-link web page with reference to sorted group is classified;By the classification that the first occurrence number value is maximum
It is defined as the classification of webpage to be sorted.
It is understood that when the quantity of out-link web page is more, and the number of species of the reference classification of out-link web page
Also when more, the occurrence number of each classification can be calculated, will appear from point that the maximum classification of number is defined as webpage to be sorted
Class.
Mode three:Weight of website corresponding to each out-link web page is obtained, the weight of website obtained is defined as each outer
The weight of link web page, and according to the reference of identified each out-link web page classification and the weight of each out-link web page, determine institute
State the classification of webpage to be sorted.
When obtaining weight of website corresponding to each out-link web page, can be directly obtained by special website tools each
Weight of website corresponding to out-link web page.For example, website tools can include the website work that the websites such as love station net, head of a station's instrument provide
Tool.It should be noted that a website can include multiple webpages, the multiple webpages for belonging to same website correspond to identical website
Weight.Weight of website is the authority value that search engine assigns to website, weight of website and web site architecture, domain name type, importing chain
Connect, web page contents, include the factors such as quantity, key word ranking, renewal frequency correlation.
It is understood that weight of website is higher, the weight of out-link web page is bigger;Weight of website is lower, out-link web page
Weight with regard to smaller.It should be noted that weight of website can reflect the maintenance condition of webpage place website from side, webpage
Weight of website is higher, can reflect that the maintenance condition of website is better, the info web of corresponding webpage is more accurate, according to webpage
Information determines that accuracy also can be higher during the reference classification of webpage.That is, when the weight of out-link web page is larger, out-link web page
Reference classification confidence level it is also some higher.
Specifically, according to the reference of identified each out-link web page classification and the weight of each out-link web page, institute is determined
When stating the classification of webpage to be sorted, 1 and step 2 may comprise steps of:
Step 1:According to Oi=∑n(yn*Mn), determine the webpage to be sorted in i-th of classificatory total score Oi.Its
In, the MnFor the weight of n-th of out-link web page, classify when the reference of n-th of out-link web page is categorized as described i-th
When, the yn1 is taken, when the reference classification of n-th of out-link web page is not classified for described i-th, the ynTake 0.
Step 2:By must the maximum classification of score value be defined as the classification of the webpage to be sorted.
As an example, it is known that the out-link web page of webpage to be sorted is respectively WebA, WebB and WebC, out-link web page
Weight is respectively 0.4,0.3 and 0.3, and comprising 3 classification 1,2,3 in classifying dictionary, out-link web page WebA reference is categorized as point
Class 1 and classification 2, out-link web page WebB reference are categorized as classification 1 and classification 3, and out-link web page WebC reference is categorized as classifying
2.It is determined that webpage to be sorted classification when, the total score of each classification can be calculated first:
Classification 1:O1=WebA weight * 1+WebB weight * 1+WebC weight * 0=0.4*1+0.3*1+0=0.7;
Classification 2:O2=WebA weight * 1+WebB weight * 0+WebC weight * 1=0.4*1+0.3*0+0.3*1=
0.7;
Classification 3:O2=WebA weight * 0+WebB weight * 1+WebC weight * 0=0.4*0+0.3*1+0.3*0=
0.3;
It can determine that the total score of classification 1 and classification 2 is maximum according to above-mentioned result of calculation, thus classification 1 and classification 2 is true
It is set to the classification of webpage to be sorted.
As shown in the above, the determination method and device for the Web page classifying that the present embodiment provides, can be obtained to be sorted
The out-link web page of webpage, the reference for determining each out-link web page according to default mode classification is classified, according to identified each
The reference classification of out-link web page, determines the classification of webpage to be sorted.Because each out-link web page is usually and webpage phase to be sorted
The webpage of association, therefore the reference classification belonged to according to out-link web page determines the classification of webpage to be sorted, compared to direct root
Web page classifying is determined according to webpage to be sorted, it is possible to increase the accuracy of identified Web page classifying.
It is understood that when the quantity of the out-link web page of webpage to be sorted is more, according to the ginseng of each out-link web page
Examination mark class determines the classification of webpage to be sorted, and accuracy when determining Web page classifying is improved from the angle of big data.
In a kind of embodiment based on embodiment illustrated in fig. 1, step S104, according to identified each outer link network
Reference the classification of page, the step of determining the classification of the webpage to be sorted, can be carried out, tool according to schematic flow sheet shown in Fig. 2
Body includes step S104A and step S104B:
Step S104A:According to above-mentioned mode classification, the reference classification of webpage to be sorted is determined.
In order to improve the accuracy of identified Web page classifying to be sorted, present embodiment uses and step S103 identicals
Mode classification, the reference classification of webpage to be sorted is first determined, then according to the reference of each out-link web page classification and net to be sorted
The reference classification of page finally determines the classification of webpage to be sorted.
Step S104B:Classified according to the reference of identified each out-link web page classification and the reference of webpage to be sorted, really
The classification of fixed webpage to be sorted.
Specifically, when the reference classification of each out-link web page is identical with the reference classification of webpage to be sorted, directly will be upper
State the classification for being defined as webpage to be sorted with reference to classification.
, can be according to following implementation when the reference classification of each out-link web page is different with the reference classification of webpage to be sorted
Mode determines the classification of webpage to be sorted:
Determine that each in above-mentioned classifying dictionary is sorted in second with reference to the second occurrence number in sorted group, second ginseng
Examining sorted group includes the reference classification and the reference of webpage to be sorted classification of each out-link web page, and the second occurrence number value is maximum
Classification be defined as the classification of the webpage to be sorted.Above-mentioned classifying dictionary is the classifying dictionary referred in step S103.
It is understood that when second is more with reference to the reference classification number of species in sorted group, can calculate each
The occurrence number of classification, it will appear from the classification that the maximum classification of number is defined as webpage to be sorted.
It can be seen that the scheme that present embodiment provides, can classify and webpage to be sorted according to the reference of each out-link web page
Reference classification determine the classification of webpage to be sorted.That is, on the basis of embodiment illustrated in fig. 1, present embodiment exists
When determining the classification of webpage to be sorted, do not classify only with reference to the reference of out-link web page, while referring also to the reference of webpage to be sorted
Classification, therefore can further improve the accuracy of identified Web page classifying.
Fig. 3 is a kind of structural representation of the determining device for the Web page classifying that the embodiment of the present application provides, applied to electronics
Equipment, corresponding with embodiment of the method shown in Fig. 1, described device includes:
Webpage determining module 301, for determining webpage to be sorted;
Exterior chain obtains module 302, for obtaining the out-link web page of the webpage to be sorted, wherein, the out-link web page is:
The webpage of the address of the webpage to be sorted in web page contents be present;
With reference to determining module 303, for according to default mode classification, determining the reference classification of each out-link web page;
Classification determining module 304, for the reference classification of each out-link web page determined by, determine described to be sorted
The classification of webpage.
In a kind of embodiment based on embodiment illustrated in fig. 3, exterior chain obtains module 302 and specifically can be used for:
The out-link web page of the webpage to be sorted is obtained from default out-link web page relation storehouse;Wherein, the outer link network
Page relation storehouse, for storing each webpage and corresponding out-link web page.
It is the first determining module or the with reference to determining module 303 in a kind of embodiment based on embodiment illustrated in fig. 3
Two determining modules;(not shown)
First determining module, for obtaining the address of each out-link web page, extracted from the address of each out-link web page
Corresponding address feature, according to the address feature and default address feature obtained and the corresponding relation of classification, it is determined that respectively
The reference classification of individual out-link web page;
Second determining module, for obtaining info web corresponding to each out-link web page, believed according to the webpage obtained
Breath and default classifying dictionary, determine the reference classification of each out-link web page.
In a kind of embodiment based on embodiment illustrated in fig. 3, it is described classification determining module 304 be the 3rd determining module,
One in 4th determining module, the 5th determining module;(not shown)
3rd determining module, for the reference classification of identified each out-link web page to be defined as into the net to be sorted
The classification of page;Or
4th determining module, for determining that each in the classifying dictionary is sorted in first with reference to the in sorted group
One occurrence number, described first includes the reference classification of identified each out-link web page with reference to sorted group, goes out occurrence by first
The maximum classification of numerical value is defined as the classification of the webpage to be sorted;Or
5th determining module, for obtaining weight of website corresponding to each out-link web page, the weight of website that will be obtained
It is defined as the weight of each out-link web page, and according to the reference of identified each out-link web page classification and each out-link web page
Weight, determine the classification of the webpage to be sorted.
In a kind of embodiment based on embodiment illustrated in fig. 3, classification determining module 304 can include:
Determination sub-module (not shown), for according to the mode classification, determining the reference of the webpage to be sorted
Classification;
Classification submodule (not shown), is treated for the reference classification of each out-link web page determined by with described
The reference classification of classification webpage, determines the classification of the webpage to be sorted.
In a kind of embodiment based on embodiment illustrated in fig. 3, above-mentioned classification submodule specifically can be used for:
Determine that each in the classifying dictionary is sorted in second with reference to the second occurrence number in sorted group, described second
Include the reference classification of each out-link web page with reference to sorted group and the reference of the webpage to be sorted is classified, by the second occurrence number
The maximum classification of value is defined as the classification of the webpage to be sorted.
Because said apparatus embodiment is obtained based on embodiment of the method, there is identical technique effect with this method,
Therefore the technique effect of device embodiment will not be repeated here.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality
Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation
In any this actual relation or order.Moreover, term " comprising ", "comprising" or any other variant be intended to it is non-
It is exclusive to include, so that process, method, article or equipment including a series of elements not only include those key elements,
But also the other element including being not expressly set out, or also include solid by this process, method, article or equipment
Some key elements.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including
Other identical element also be present in the process of the key element, method, article or equipment.
Each embodiment in this specification is described by the way of related, identical similar portion between each embodiment
Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for device
For applying example, because it is substantially similar to embodiment of the method, so describing fairly simple, related part is referring to embodiment of the method
Part explanation.
The preferred embodiment of the application is the foregoing is only, is not intended to limit the protection domain of the application.It is all
Any modification, equivalent substitution and improvements done within spirit herein and principle etc., it is all contained in the protection domain of the application
It is interior.
Claims (12)
1. a kind of determination method of Web page classifying, it is characterised in that methods described includes:
Determine webpage to be sorted;
The out-link web page of the webpage to be sorted is obtained, wherein, the out-link web page is:Exist in web page contents described to be sorted
The webpage of the address of webpage;
According to default mode classification, the reference classification of each out-link web page is determined;
Classified according to the reference of identified each out-link web page, determine the classification of the webpage to be sorted.
2. according to the method for claim 1, it is characterised in that the step of the out-link web page for obtaining the webpage to be sorted
Suddenly, including:
The out-link web page of the webpage to be sorted is obtained from default out-link web page relation storehouse;Wherein, the out-link web page is closed
It is storehouse, for storing each webpage and corresponding out-link web page.
3. according to the method for claim 2, it is characterised in that it is described according to default mode classification, determine each exterior chain
The step of reference classification of webpage, including:
The address of each out-link web page is obtained, the address feature corresponding to extraction from the address of each out-link web page, according to being obtained
The address feature and default address feature and the corresponding relation of classification obtained, determine the reference classification of each out-link web page;Or
Person,
Info web corresponding to each out-link web page is obtained, according to the info web and default classifying dictionary obtained, really
The reference classification of fixed each out-link web page.
4. according to the method for claim 3, it is characterised in that the reference of each out-link web page determined by the basis point
Class, the step of determining the classification of the webpage to be sorted, including:
The reference classification of identified each out-link web page is defined as the classification of the webpage to be sorted;Or
Determine that each in the classifying dictionary is sorted in first with reference to the first occurrence number in sorted group, first reference
Sorted group includes the reference classification of identified each out-link web page, the maximum classification of the first occurrence number value is defined as described
The classification of webpage to be sorted;Or
Weight of website corresponding to each out-link web page is obtained, the weight of website obtained is defined as to the power of each out-link web page
Weight, and according to the reference of identified each out-link web page classification and the weight of each out-link web page, determine the net to be sorted
The classification of page.
5. according to the method for claim 1, it is characterised in that the reference of each out-link web page determined by the basis point
Class, the step of determining the classification of the webpage to be sorted, including:
According to the mode classification, the reference classification of the webpage to be sorted is determined;
According to the reference of identified each out-link web page classification and reference of the webpage to be sorted classification, treated point it is determined that described
The classification of class webpage.
6. according to the method for claim 5, it is characterised in that the reference of each out-link web page determined by the basis point
The reference classification of class and the webpage to be sorted, the step of determining the classification of the webpage to be sorted, including:
Determine that each in the classifying dictionary is sorted in second with reference to the second occurrence number in sorted group, second reference
Sorted group includes the reference classification of each out-link web page and the reference of the webpage to be sorted is classified, by the second occurrence number value most
Big classification is defined as the classification of the webpage to be sorted.
7. a kind of determining device of Web page classifying, it is characterised in that described device includes:
Webpage determining module, for determining webpage to be sorted;
Exterior chain obtains module, for obtaining the out-link web page of the webpage to be sorted, wherein, the out-link web page is:In webpage
The webpage of the address of the webpage to be sorted in appearance be present;
With reference to determining module, for according to default mode classification, determining the reference classification of each out-link web page;
Classification determining module, for the reference classification of each out-link web page determined by, determine the webpage to be sorted
Classification.
8. device according to claim 7, it is characterised in that the exterior chain obtains module, is specifically used for:
The out-link web page of the webpage to be sorted is obtained from default out-link web page relation storehouse;Wherein, the out-link web page is closed
It is storehouse, for storing each webpage and corresponding out-link web page.
9. device according to claim 8, it is characterised in that described with reference to determining module is the first determining module or the
Two determining modules;
First determining module, for obtaining the address of each out-link web page, the extraction pair from the address of each out-link web page
The address feature answered, according to the address feature and default address feature obtained and the corresponding relation of classification, determine each
The reference classification of out-link web page;
Second determining module, for obtaining info web corresponding to each out-link web page, according to the info web obtained
And default classifying dictionary, determine the reference classification of each out-link web page.
10. device according to claim 9, it is characterised in that the classification determining module is the 3rd determining module, the 4th
One in determining module, the 5th determining module;
3rd determining module, for the reference classification of identified each out-link web page to be defined as into the webpage to be sorted
Classification;Or
4th determining module, for determining that each in the classifying dictionary is sorted in first with reference to first in sorted group
Occurrence number, described first includes the reference classification of identified each out-link web page with reference to sorted group, by the first occurrence number
The maximum classification of value is defined as the classification of the webpage to be sorted;Or
5th determining module, it is for obtaining weight of website corresponding to each out-link web page, the weight of website obtained is true
It is set to the weight of each out-link web page, and according to the reference of identified each out-link web page classification and the power of each out-link web page
Weight, determines the classification of the webpage to be sorted.
11. device according to claim 7, it is characterised in that the classification determining module, including:
Determination sub-module, for according to the mode classification, determining the reference classification of the webpage to be sorted;
Classification submodule, reference classification and the reference of the webpage to be sorted point for each out-link web page determined by
Class, determine the classification of the webpage to be sorted.
12. device according to claim 11, it is characterised in that the classification submodule, be specifically used for:
Determine that each in the classifying dictionary is sorted in second with reference to the second occurrence number in sorted group, second reference
Sorted group includes the reference classification of each out-link web page and the reference of the webpage to be sorted is classified, by the second occurrence number value most
Big classification is defined as the classification of the webpage to be sorted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710326233.4A CN107545020A (en) | 2017-05-10 | 2017-05-10 | A kind of determination method and device of Web page classifying |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710326233.4A CN107545020A (en) | 2017-05-10 | 2017-05-10 | A kind of determination method and device of Web page classifying |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107545020A true CN107545020A (en) | 2018-01-05 |
Family
ID=60966852
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710326233.4A Pending CN107545020A (en) | 2017-05-10 | 2017-05-10 | A kind of determination method and device of Web page classifying |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107545020A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256104A (en) * | 2018-02-05 | 2018-07-06 | 恒安嘉新(北京)科技股份公司 | Internet site compressive classification method based on multidimensional characteristic |
CN109977327A (en) * | 2019-03-20 | 2019-07-05 | 新华三信息安全技术有限公司 | A kind of Web page classification method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102473190A (en) * | 2009-07-30 | 2012-05-23 | 阿尔卡特朗讯 | Keyword assignment to a web page |
CN102819597A (en) * | 2012-08-13 | 2012-12-12 | 北京星网锐捷网络技术有限公司 | Web page classification method and equipment |
CN102955810A (en) * | 2011-08-26 | 2013-03-06 | 中国移动通信集团公司 | Webpage classification method and device |
CN106250402A (en) * | 2016-07-19 | 2016-12-21 | 杭州华三通信技术有限公司 | A kind of Website classification method and device |
CN106339459A (en) * | 2016-08-26 | 2017-01-18 | 中国科学院信息工程研究所 | Method for pre-classifying Chinese webpages based on keyword matching |
-
2017
- 2017-05-10 CN CN201710326233.4A patent/CN107545020A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102473190A (en) * | 2009-07-30 | 2012-05-23 | 阿尔卡特朗讯 | Keyword assignment to a web page |
CN102955810A (en) * | 2011-08-26 | 2013-03-06 | 中国移动通信集团公司 | Webpage classification method and device |
CN102819597A (en) * | 2012-08-13 | 2012-12-12 | 北京星网锐捷网络技术有限公司 | Web page classification method and equipment |
CN106250402A (en) * | 2016-07-19 | 2016-12-21 | 杭州华三通信技术有限公司 | A kind of Website classification method and device |
CN106339459A (en) * | 2016-08-26 | 2017-01-18 | 中国科学院信息工程研究所 | Method for pre-classifying Chinese webpages based on keyword matching |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256104A (en) * | 2018-02-05 | 2018-07-06 | 恒安嘉新(北京)科技股份公司 | Internet site compressive classification method based on multidimensional characteristic |
CN108256104B (en) * | 2018-02-05 | 2020-05-26 | 恒安嘉新(北京)科技股份公司 | Comprehensive classification method of internet websites based on multidimensional characteristics |
CN109977327A (en) * | 2019-03-20 | 2019-07-05 | 新华三信息安全技术有限公司 | A kind of Web page classification method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108763321B (en) | Related entity recommendation method based on large-scale related entity network | |
CN103514234B (en) | A kind of page info extracting method and device | |
CN106599155A (en) | Method and system for classifying web pages | |
CN107341183A (en) | A kind of Website classification method based on darknet website comprehensive characteristics | |
CN104750754A (en) | Website industry classification method and server | |
CN103294781A (en) | Method and equipment used for processing page data | |
CN104899335A (en) | Method for performing sentiment classification on network public sentiment of information | |
CN104484380A (en) | Personalized search method and personalized search device | |
CN107102993A (en) | A kind of user's demand analysis method and device | |
CN106294535A (en) | The recognition methods of website and device | |
CN106033445A (en) | Method and device for obtaining article association degree data | |
CN105787662A (en) | Mobile application software performance prediction method based on attributes | |
CN105653547A (en) | Method and device for extracting keywords of text | |
CN108170678A (en) | A kind of text entities abstracting method and system | |
CN110209721A (en) | Judgement document transfers method, apparatus, server and storage medium | |
CN106445907A (en) | Domain lexicon generation method and apparatus | |
CN102289514A (en) | Social label automatic labelling method and social label automatic labeller | |
CN106815265A (en) | The searching method and device of judgement document | |
CN107545020A (en) | A kind of determination method and device of Web page classifying | |
CN111079582A (en) | Image recognition English composition running question judgment method | |
CN104462439B (en) | The recognition methods of event and device | |
CN106874340A (en) | A kind of web page address sorting technique and device | |
CN113569118A (en) | Self-media pushing method and device, computer equipment and storage medium | |
CN108388556A (en) | The method for digging and system of similar entity | |
CN104408036A (en) | Correlated topic recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180105 |