CN102955810A - Webpage classification method and device - Google Patents

Webpage classification method and device Download PDF

Info

Publication number
CN102955810A
CN102955810A CN2011102492702A CN201110249270A CN102955810A CN 102955810 A CN102955810 A CN 102955810A CN 2011102492702 A CN2011102492702 A CN 2011102492702A CN 201110249270 A CN201110249270 A CN 201110249270A CN 102955810 A CN102955810 A CN 102955810A
Authority
CN
China
Prior art keywords
url
classification
class library
prediction
last layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011102492702A
Other languages
Chinese (zh)
Other versions
CN102955810B (en
Inventor
徐萌
何洪凌
胡珉
罗治国
孙少陵
陶涛
陈婷
张新访
李成华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chellona Mobile Communications Corp Cmcc
China Mobile Communications Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201110249270.2A priority Critical patent/CN102955810B/en
Publication of CN102955810A publication Critical patent/CN102955810A/en
Application granted granted Critical
Publication of CN102955810B publication Critical patent/CN102955810B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses webpage classification method and device. The method includes establishing a virtual hierarchical URL (uniform resource locator) according to recording in an existing URL class library, and predicting the class of the hierarchical URL; when classification on webpages to be classified is needed, searching the URL class library according to URLs of the webpages to be classified; if matching URLs are unfound, searching the URL class library according to higher-level URLs of the URLs; and when matching URLs are found, determining the classes of the webpages to be classified according to predicted classes of the found URLs. Efficiency and success rate in webpage classification by the method and device are improved.

Description

A kind of Web page classification method and equipment
Technical field
The present invention relates to Internet technical field, relate in particular to a kind of Web page classification method and equipment.
Background technology
Along with the high speed development of development of Mobile Internet technology, mobile Internet user's quantity is more and more, therefore, mobile Internet user's behavioural analysis is also become a study hotspot gradually.
In the prior art, usually according to mobile Internet user's access log user behavior is analyzed.Concrete, mobile Internet user's access log leaves WAP (Wireless Application Protocol in, Wireless Application Protocol) in the gateway, recorded URL (the Universal Resource Locator of the webpage that the user accesses in this access log, URL(uniform resource locator)), can know the webpage classification that the user accesses by inquiry URL class library, and then know the behavior preference of respective user.
Wherein, existing Web page classification method can may further comprise the steps:
1, reptile crawls web page contents;
2, web page contents is resolved, obtain corresponding text;
3, keyword is analyzed, obtained to text;
4, utilize algorithm model, the model such as the Algorithm of documents categorization such as naive Bayesian or SVM is classified; Wherein, training obtains algorithm model according to training set in advance usually.
Can classify to the webpage (or URL corresponding to webpage) that the user accesses by said method, and then can set up the URL class library.Wherein, URL class library of the prior art can be as shown in table 1.
Table 1
Figure BSA00000563462200021
In realizing process of the present invention, the inventor finds to exist at least in the prior art following problem:
In the prior art, the URL class library is a simple flat tables of data, without any relation, in order accurately to inquire the classification of the webpage that the user accesses, needs a large amount of data of storage, and needs the real-time update class library between the clauses and subclauses.And because internet development is rapid, newly-increased webpage speed is exceedingly fast, and upgrades URL class library and the classification that can not preserve all webpages even do a URL class library every day.At this moment, adoptable method is the method for real-time crawl, prediction, and the classification of a webpage of prediction may the time need approximate number ten minutes, if batch forecast, although can parallelization, the time is still very long, at least hour rank.
Summary of the invention
The embodiment of the invention provides a kind of method and apparatus of Web page classifying, determines other efficient of web page class and success ratio to improve.
In order to achieve the above object, the embodiment of the invention provides a kind of Web page classification method, be applied to the Web page classifying flow process based on the realization of URL class library, record the prediction classification of each level URL and each URL in the described URL class library, wherein, upper strata URL among the URL of adjacent level obtains in the intercepting of the basis of the URL of lower floor, and the method comprises:
URL inquiry URL class library according to webpage to be sorted;
If do not inquire the URL of coupling, then the upper strata URL according to this URL inquires about the URL class library, and when inquiring the URL of coupling, determines the classification of webpage to be sorted according to the prediction classification of the URL that inquires.
The embodiment of the invention also provides a kind of Web page classifying equipment, be applied to the Web page classifying flow process based on the realization of uniform resource position mark URL class library, record the prediction classification of each level URL and each URL in the described URL class library, wherein, upper strata URL among the URL of adjacent level obtains in the intercepting of the basis of the URL of lower floor, and this equipment comprises:
Upper strata URL generation module is used for the URL according to webpage to be sorted, generates the upper strata URL of this URL;
Enquiry module is used for the URL inquiry URL class library according to webpage to be sorted; If do not inquire the URL of coupling, then the upper strata URL according to this URL inquires about the URL class library;
Determination module is used for determining the classification of webpage to be sorted according to the prediction classification of the URL that inquires when described enquiry module inquires the URL of coupling.
Compared with prior art, the embodiment of the invention is divided by URL being carried out level, each level URL of record in the URL class library, and the prediction classification of each URL of corresponding record; When needs are determined the classification of webpage to be sorted, obtain the URL of this webpage to be sorted, and whether record this URL in the inquiry URL class library; When not recording identical URL in the URL class library, be defined as the classification of webpage to be sorted according to the prediction classification of the upper strata URL of this URL, improved other efficient of definite web page class and success ratio.
Description of drawings
The URL class library product process synoptic diagram that Fig. 1 provides for the embodiment of the invention;
The Web page classification method schematic flow sheet that Fig. 2 provides for the embodiment of the invention;
The structural representation of the Web page classifying equipment that Fig. 3 provides for the embodiment of the invention.
Embodiment
For defective of the prior art, the embodiment of the invention has proposed a kind of technical scheme of Web page classifying.In the technical scheme that the embodiment of the invention proposes, by the mode that URL is intercepted URL being carried out level divides, the URL of adjacent level at the middle and upper levels URL by intercepting obtains on the basis of the URL of lower floor, in existing URL class library, increase the record (being the upper strata URL that records prediction classification and the adjacent level of this URL of URL, this URL in the embodiment of the invention in the URL class library) of upper strata URL, and the prediction classification of record upper strata URL, when needs are classified to webpage, can be according to the URL inquiry URL class library of webpage to be sorted; If do not inquire the URL of coupling, then the upper strata URL according to this URL inquires about the URL class library, and when inquiring the URL of coupling, determine the classification of webpage to be sorted according to the prediction classification of the URL that inquires, namely when not recording the URL of webpage to be sorted in the URL class library, can determine according to the prediction classification of the upper strata URL of this URL the classification of webpage to be sorted, by inquiring about record corresponding to upper strata URL of this URL to be sorted, and with the prediction classification of its upper strata URL prediction classification as webpage to be sorted, other efficient of definite web page class and success ratio have been improved.
Wherein, in the mode that URL is intercepted URL being carried out level divides and can specifically realize in the following manner:
According to separator "/" among the URL URL being carried out level divides, obtain successively forward "/" from URL end position, and with this URL from the upper strata URL (be last layer level URL) of position, end predetermined number (such as 1) "/" field before forward as the adjacent level of this URL.
For example, for URL:http: // 3g.sina.com.cn:80/3g/static/sina.gif? t1=1252192802, http: // 3g.sina.com.cn:80/3g/static/sina.gif? t1=1252192802 is the first level of this URL, http: // 3g.sina.com.cn:80/3g/static/ is the second level of this URL, http: // 3g.sina.com.cn:80/3g/ is the 3rd level of this URL, http: // 3g.sina.com.cn:80/3g/static/ then is the last layer level URL of former URL, and http: // 3g.sina.com.cn:80/3g/ then is the last layer level URL of http: // 3g.sina.com.cn:80/3g/static/.
Should be realized that, determine in the technical scheme that the embodiment of the invention proposes that the mode of last layer level URL is not limited to aforesaid way, also can be other modes.
Below in conjunction with the accompanying drawing among the application, the technical scheme among the application is carried out clear, complete description, obviously, described embodiment is a part of embodiment of the application, rather than whole embodiment.Based on the embodiment among the application, the every other embodiment that those of ordinary skills obtain under the prerequisite of not making creative work belongs to the scope that the application protects.
As shown in Figure 1, the synoptic diagram of the URL class library Establishing process that proposes for the embodiment of the invention is for ease of describing, to be described as an example of the information of the form storage URL of tables of data example in the URL classification, the corresponding list item of each URL, this URL class library Establishing process can may further comprise the steps:
Step 101, in the URL class library record list item corresponding to lowest hierarchical level URL.Wherein, record the prediction classification of URL, this URL and the last layer level URL of this URL in the list item that URL is corresponding.
Concrete, can with the user in the past in a period of time the URL of the webpage of (such as one month) access and obtain the prediction classification of corresponding URL by existing Web page classification method as the lowest hierarchical level URL in the URL class library; Perhaps, URL that can some well-known website is corresponding is as seed, the mode that crawls by reptile is obtained the URL of some, and with the URL that gets access to as the lowest hierarchical level URL in the URL class library, and obtain the prediction classification of corresponding URL by existing Web page classification method.After the URL of lowest hierarchical level and the prediction classification thereof, obtain the last layer level URL of each lowest hierarchical level URL in the URL class library that gets access to, and corresponding information (URL prediction classification, last layer level URL) is recorded in the URL class library corresponding to URL.
Step 102, from the URL class library, select a list item, obtain the last layer level URL of the URL that records in this list item.
Concrete, the list item in the traversal URL class library, and the list item in the select progressively URL class library obtain the last layer level URL in the selected list item.
Step 103, judge whether store list item corresponding to this last layer level URL in the URL class library.Then to go to step 102 if be judged as; Otherwise, go to step 104.
Concrete, when storing list item corresponding to this last layer level URL in the URL class library, then reselect another list item; When not storing list item corresponding to this last layer level URL in the URL class library, then need to create list item corresponding to this last layer level URL.
The last layer level URL of step 104, the prediction classification of determining this last layer level URL and this last layer level URL, and it is recorded in the URL class library.
Concrete, the list item in the traversal URL class library obtains the wherein identical list item of last layer level URL, and determines the prediction classification of last layer level URL according to the prediction classification of the URL in the list item that gets access to.
Wherein, the prediction classification of determining last layer level URL specifically can realize in the following manner:
From described URL class library, obtain its last layer level URL and be all URL of the URL of this classification to be predicted; Respectively predict the quantity of the URL of classification among the URL that determines to get access to; The prediction classification that wherein URL quantity is maximum is defined as the prediction classification of the URL of this classification to be predicted.
For example, for following 4 URL:
Http:// www.chinaweekly.cn/bencandy.php? fid=48﹠amp; Id=5464 predicts classification: history
Http:// www.chinaweekly.cn/bencandy.php? fid=48﹠amp; Id=5463 predicts classification: history
Http:// www.chinaweekly.cn/bencandy.php? fid=48﹠amp; Id=5344 predicts classification: history
Http:// www.chinaweekly.cn/bencandy.php? fid=49﹠amp; Id=5449 predicts classification: the news commentary
These four URL have identical last layer level URL:http: //www.chinaweekly.cn/, because among the URL of lower floor of the adjacent level of this upper strata URL, have 3 prediction classifications for historical, 1 prediction classification is the news commentary, so the prediction classification of this upper strata URL is historical.
It should be noted that in the technical scheme that the embodiment of the invention provides, the prediction probability of each URL can also corresponding record in the URL class library be arranged.At this moment, the last layer level URL that comprises prediction classification, prediction probability and this URL of URL, this URL in the URL class library in the list item of corresponding URL.For lowest hierarchical level URL, its prediction classification and prediction probability are determined by existing Web page classification method, and the prediction classification of the URL of all the other levels and prediction probability are determined according to prediction classification and the prediction probability of next level URL of this URL.
Concrete, can specifically realize in the following manner according to the prediction classification of the URL of next level and prediction classification and the prediction probability of the URL that prediction probability is determined its last layer level:
From described URL class library, obtain its last layer level URL and be all URL of the URL of this classification to be predicted and probability; For the URL of each prediction classification, calculate the weighted mean value of the prediction probability of each URL in this prediction classification; The prediction classification that weighted mean value is the highest is defined as the prediction classification of this URL to be predicted, and will predict that the mean value of prediction probability of the URL of classification is defined as the prediction probability of this URL to be predicted.
Still take above-mentioned 4 URL as example, suppose that the prediction probability of above-mentioned 4 URL is followed successively by 80%, 79%, 81% and 80%.Then among these 4 URL, the prediction classification is 60% ((80%+79%+81%)/(3+1)) for the weighted mean value of the prediction probability of historical URL, and the prediction classification is that the weighted mean value of prediction probability of the URL of the news commentary is 20% ((80%)/(3+1)).Therefore, the prediction classification of the upper strata URL of these 4 adjacent levels of URL is historical, and its prediction probability is 60%.
Above-mentioned flow process can realize by computer program, also can according to above principle, dispose this URL class library by manual type.
Should be realized that, in the technical scheme of the embodiment of the invention, when not recording the URL of webpage to be sorted in the URL class library, the mode that is not limited to the last layer level URL by successively inquiring about this URL is determined the classification of webpage to be sorted, also can be the classification that the prediction classification of directly inquiring about other upper stratas URL of the last layer level URL of last layer level URL of this URL or this URL is determined webpage to be sorted.In addition, determining in the technical scheme that the embodiment of the invention provides that last layer level URL prediction class method for distinguishing is not limited to the mode of describing in the above-mentioned flow process, also can be other modes.
By above flow process, can determine to have now the upper strata URL of the URL that records in the URL class library, and this upper strata list item corresponding to URL is stored in the URL class library, the list item of storing in the URL class library can form a hierarchical architecture.Wherein, the data structure of URL information can be as shown in table 2 in the URL class library after the renewal:
Table 2
Title Note
url URL
url_label The prediction classification
prediction Prediction probability
faurlevel Last layer level URL
Wherein, the implication of every variable is as follows:
Url: the URL StringUTF-8 of webpage
The prediction classification StringUTF-8 of url_label:URL
The prediction probability Double of prediction:URL
Faurlevel: last layer level URL StringUTF-8
Based on above-mentioned URL class library, the embodiment of the invention provides a kind of method of Web page classifying, and as shown in Figure 2, the synoptic diagram of the Web page classification method flow process that provides for the embodiment of the invention can may further comprise the steps:
Step 201, obtain the URL of webpage to be sorted, whether record this URL in the inquiry URL class library.
Record identical URL in the URL class library if step 202 inquires, then go to step 204; Otherwise, go to step 203.
Step 203, generate the last layer level URL of this URL, whether record this last layer level URL in the inquiry URL class library, and go to step 202.
Step 204, prediction classification corresponding to URL that inquires be defined as the classification of described webpage to be sorted.
Concrete, in the prior art scheme, directly in the URL class library, carry out the exact matching inquiry according to URL, when the list item of the correspondence that inquires, then return the prediction classification of URL; When the list item of the correspondence that does not inquire, then return null value.
And in the technical scheme that the embodiment of the invention provides, by introducing URL is carried out the level division, and the list item that upper strata URL is corresponding is stored in the URL class library.After needs are classified to webpage, at first the URL according to webpage to be sorted carries out exact matching in the URL class library, when list item corresponding to the URL that does not store webpage to be sorted in the URL class library, further generate the last layer level URL of the URL of webpage to be sorted, and in class library, inquire about corresponding list item according to this last layer level URL, and with the prediction classification of the last layer level URL that the inquires prediction classification as the URL of webpage to be sorted.
For example, the URL of the webpage to be sorted that gets access to is http://sports.sina.com.cn/k/2011-05-18/09415581512.shtml, and do not record the URL with this webpage to be sorted in the current URL class library, at this moment, need to generate the last layer level URL of this URL, be http://sports.sina.com.cn/k/2011-05-18/, and in the URL class library, inquire about list item corresponding to this last layer level URL.If store list item corresponding to this last layer level URL in the URL class library, then can obtain the prediction classification (such as physical culture) of this last layer level URL by inquiry URL class library, then with the prediction classification of this last layer level URL prediction classification as the URL of webpage to be sorted.
It should be noted that the URL of the highest level that the URL that ought find webpage to be sorted is corresponding, do not inquire yet when recording identical URL in the URL class library, return the inquiry failure response.
In embodiments of the present invention, when having new lowest hierarchical level URL to increase in the URL class library, can carry out classification to the URL class library by modes such as Event triggered or manual activation and upgrade.Concrete, can again travel through the lowest hierarchical level URL that stores in the URL class library, and carry out level and divide, again obtain corresponding upper strata URL and corresponding prediction classification thereof.In addition, also can only upgrade the prediction classification of the upper strata URL relevant with newly-increased lowest hierarchical level URL.Specific implementation does not repeat them here.
Based on the identical technical conceive of above-mentioned Web page classification method, the embodiment of the invention also provides a kind of Web page classifying equipment, can be applied to the above-mentioned Web page classification method of realizing based on the URL class library, record each level URL in the described URL class library, wherein, upper strata URL among the URL of adjacent level obtains in the intercepting of the basis of the URL of lower floor, and each URL respectively corresponding record has the prediction classification.
As shown in Figure 3, the structural representation of the Web page classifying equipment that provides for the embodiment of the invention can comprise:
Upper strata URL generation module 31 is used for the URL according to webpage to be sorted, generates the upper strata URL of this URL;
Enquiry module 32 is used for the URL inquiry URL class library according to webpage to be sorted; If do not inquire the URL of coupling, then the upper strata URL according to this URL inquires about the URL class library;
Determination module 33 is used for determining the classification of webpage to be sorted according to the prediction classification of the URL that inquires when enquiry module 32 inquires the URL of coupling.
Wherein, upper strata URL generation module 31 specifically is used for, and when enquiry module 32 does not inquire the URL of coupling, generates the last layer level URL of this URL;
Enquiry module 32 is specifically inquired about the prediction classification of upper strata URL of the URL of webpage to be sorted by following flow process:
Steps A, obtain the last layer level URL of this URL, whether record this last layer level URL in the inquiry URL class library;
Record identical URL in the URL class library if step B inquires, then go to step C; Otherwise go to steps A;
Step C, obtain the prediction classification of the URL that inquires;
Determination module 33 specifically is used for, and the URL that enquiry module 33 is inquired predicts that classification is defined as the classification of described webpage to be sorted.
Wherein, determination module 33 also is used for, and when enquiry module 32 has inquired the URL of highest level corresponding to the URL of described webpage to be sorted, does not inquire yet when recording identical URL in the URL class library, returns the inquiry failure response.
Wherein, described Web page classifying equipment also comprises: URL class library maintenance module 34;
Upper strata URL generation module 31 specifically is used for, and travels through the URL in the described URL class library, and when traversing a URL, selects this URL from described URL class library, and generates the last layer level URL of this URL according to the URL that selects;
Enquiry module 32 specifically is used for, and the last layer level URL that generates according to upper strata URL generation module 31 inquires about the URL class library;
URL classification maintenance module 34 is used for, and when enquiry module 32 does not inquire the URL of coupling, determines the prediction classification of this last layer level URL, and this last layer level URL and prediction classification thereof are recorded in the described URL class library.
Wherein, URL class library maintenance module 34 specifically is used for, and determines the prediction classification of the URL of all the other levels except lowest hierarchical level according to the prediction classification of next level URL of URL.
Wherein, URL class library maintenance module 34 specifically is used for, and obtaining its last layer level URL from described URL class library is all URL of the URL of classification to be predicted; Respectively predict the quantity of the URL of classification among the URL that determines to get access to; The prediction classification that wherein URL quantity is maximum is defined as the prediction classification of the URL of this classification to be predicted.
Wherein, each URL in the URL class library is also separately to there being prediction probability;
URL class library maintenance module 34 specifically is used for, prediction classification and the prediction probability of determining the URL of all the other levels except lowest hierarchical level according to prediction classification and the prediction probability of next level URL of URL.
Wherein, URL class library maintenance module 34 specifically is used for, and obtains its last layer level URL and be all URL of the URL of this classification to be predicted and probability from described URL class library; For the URL of each prediction classification, calculate the weighted mean value of the prediction probability of each URL in this prediction classification; The prediction classification that weighted mean value is the highest is defined as the prediction classification of this URL to be predicted, and will predict that the mean value of prediction probability of the URL of classification is defined as the prediction probability of this URL to be predicted.
Wherein, when having increased new URL in the described URL class library,
Upper strata URL generation module 31 also is used for, and generates the upper strata URL of this URL;
Enquiry module 32 specifically is used for, according to the upper strata URL inquiry URL class library of described URL;
URL class library maintenance module 34 specifically is used for, if enquiry module 32 inquires the URL of coupling, then upgrades the prediction classification of upper strata URL; If enquiry module 32 does not inquire the URL of coupling, then in the URL class library, record this upper strata URL and corresponding prediction classification.
Wherein, upper strata URL generation module 31 specifically is used for, and according to the separator among the URL URL is carried out level and divide, and with the last layer level URL of the field of this URL before the predetermined number separator forward of position, end as this URL.
By the description of above embodiment, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform, can certainly pass through hardware, but the former is better embodiment in a lot of situation.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium, comprise that some instructions are with so that a computer equipment (can be personal computer, server, the perhaps network equipment etc.) carry out the described method of each embodiment of the present invention.
It will be appreciated by those skilled in the art that accompanying drawing is the synoptic diagram of a preferred embodiment, the module in the accompanying drawing or flow process might not be that enforcement the present invention is necessary.
It will be appreciated by those skilled in the art that the module in the device among the embodiment can be distributed in the device of embodiment according to the embodiment description, also can carry out respective change and be arranged in the one or more devices that are different from present embodiment.The module of above-described embodiment can be merged into a module, also can further split into a plurality of submodules.
The invention described above embodiment sequence number does not represent the quality of embodiment just to description.
More than disclosed only be several specific embodiment of the present invention, still, the present invention is not limited thereto, the changes that any person skilled in the art can think of all should fall into protection scope of the present invention.

Claims (16)

1. Web page classification method, it is characterized in that, be applied to the Web page classifying flow process based on the realization of uniform resource position mark URL class library, record the prediction classification of each level URL and each URL in the described URL class library, wherein, upper strata URL among the URL of adjacent level obtains in the intercepting of the basis of the URL of lower floor, and the method comprises:
URL inquiry URL class library according to webpage to be sorted;
If do not inquire the URL of coupling, then the upper strata URL according to this URL inquires about the URL class library, and when inquiring the URL of coupling, determines the classification of webpage to be sorted according to the prediction classification of the URL that inquires.
2. the method for claim 1 is characterized in that, described upper strata URL inquiry URL class library according to this URL comprises:
Steps A, generate the last layer level URL of this URL, whether record this last layer level URL in the inquiry URL class library;
Record identical URL in the URL class library if step B inquires, then go to step C; Otherwise go to steps A;
Step C, obtain the prediction classification of the URL that inquires.
3. the method for claim 1 is characterized in that, the generative process of described URL class library comprises:
Travel through the URL in the described URL class library, and when traversing a URL, from described URL class library, select this URL, and generate the last layer level URL of this URL according to the URL that selects;
Already in whether the last layer level URL that judge to generate in the described URL class library, and when not having this last layer level URL in the described URL class library, determine the prediction classification of this last layer level URL, and this last layer level URL and prediction classification thereof are recorded in the described URL class library.
4. such as the described method of one of claim 1-3, it is characterized in that except the URL of lowest hierarchical level, the prediction classification of the URL of all the other levels is to determine according to the prediction classification of next level URL of this URL.
5. method as claimed in claim 4 is characterized in that, determines to be specially the prediction classification of the URL of its last layer level according to the prediction classification of the URL of next level:
From described URL class library, obtain its last layer level URL and be all URL of the URL of this classification to be predicted;
Respectively predict the quantity of the URL of classification among the URL that determines to get access to;
The prediction classification that wherein URL quantity is maximum is defined as the prediction classification of the URL of this classification to be predicted.
6. method as claimed in claim 4 is characterized in that, each URL in the URL class library is also separately to there being prediction probability;
According to the prediction classification of the URL of next level and prediction classification and the prediction probability of the URL that prediction probability is determined its last layer level, be specially:
From described URL class library, obtain its last layer level URL and be all URL of the URL of this classification to be predicted and probability;
For the URL of each prediction classification, calculate the weighted mean value of the prediction probability of each URL in this prediction classification;
The prediction classification that weighted mean value is the highest is defined as the prediction classification of this URL to be predicted, and will predict that the mean value of prediction probability of the URL of classification is defined as the prediction probability of this URL to be predicted.
7. the method for claim 1 is characterized in that, when having increased new URL in the described URL class library, generate the upper strata URL of this URL, and according to the upper strata URL of described URL inquiry URL class library, if inquire the URL of coupling, then upgrade the prediction classification of this upper strata URL; If do not inquire the URL of coupling, record this upper strata URL and corresponding prediction classification in the URL class library.
8. the method for claim 1 is characterized in that, determines the last layer level URL of URL, is specially:
According to the separator among the URL URL is carried out level and divide, and with the last layer level URL of the field of this URL before the predetermined number separator forward of position, end as this URL.
9. Web page classifying equipment, it is characterized in that, be applied to the Web page classifying flow process based on the realization of uniform resource position mark URL class library, record the prediction classification of each level URL and each URL in the described URL class library, wherein, upper strata URL among the URL of adjacent level obtains in the intercepting of the basis of the URL of lower floor, and this equipment comprises:
Upper strata URL generation module is used for the URL according to webpage to be sorted, generates the upper strata URL of this URL;
Enquiry module is used for the URL inquiry URL class library according to webpage to be sorted; If do not inquire the URL of coupling, then the upper strata URL according to this URL inquires about the URL class library;
Determination module is used for determining the classification of webpage to be sorted according to the prediction classification of the URL that inquires when described enquiry module inquires the URL of coupling.
10. equipment as claimed in claim 9 is characterized in that,
Described upper strata URL generation module specifically is used for, and when described enquiry module does not inquire the URL of coupling, generates the last layer level URL of this URL;
Described enquiry module is specifically inquired about the prediction classification of upper strata URL of the URL of webpage to be sorted by following flow process:
Steps A, obtain the last layer level URL of this URL, whether record this last layer level URL in the inquiry URL class library;
Record identical URL in the URL class library if step B inquires, then go to step C; Otherwise go to steps A;
Step C, obtain the prediction classification of the URL that inquires;
Described determination module specifically is used for, and the URL that described enquiry module is inquired predicts that classification is defined as the classification of described webpage to be sorted.
11. equipment as claimed in claim 9 is characterized in that, also comprises: URL class library maintenance module;
Described upper strata URL generation module specifically is used for, and travels through the URL in the described URL class library, and when traversing a URL, selects this URL from described URL class library, and generates the last layer level URL of this URL according to the URL that selects;
Described enquiry module specifically is used for, and the last layer level URL that generates according to described upper strata URL generation module inquires about the URL class library;
Described URL classification maintenance module is used for, and when described enquiry module does not inquire the URL of coupling, determines the prediction classification of this last layer level URL, and this last layer level URL and prediction classification thereof are recorded in the described URL class library.
12., it is characterized in that described URL class library maintenance module specifically is used for such as the described equipment of one of claim 9-11, determine the prediction classification of the URL of all the other levels except lowest hierarchical level according to the prediction classification of next level URL of URL.
13. equipment as claimed in claim 15 is characterized in that, described URL class library maintenance module specifically is used for, and obtaining its last layer level URL from described URL class library is all URL of the URL of classification to be predicted; Respectively predict the quantity of the URL of classification among the URL that determines to get access to; The prediction classification that wherein URL quantity is maximum is defined as the prediction classification of the URL of this classification to be predicted.
14. equipment as claimed in claim 12 is characterized in that, each URL in the URL class library is also separately to there being prediction probability;
Described URL class library maintenance module specifically is used for, and obtains its last layer level URL and be all URL of the URL of this classification to be predicted and probability from described URL class library; For the URL of each prediction classification, calculate the weighted mean value of the prediction probability of each URL in this prediction classification; The prediction classification that weighted mean value is the highest is defined as the prediction classification of this URL to be predicted, and will predict that the mean value of prediction probability of the URL of classification is defined as the prediction probability of this URL to be predicted.
15. equipment as claimed in claim 12, when having increased new URL in the described URL class library,
Described upper strata URL generation module also is used for, and generates the upper strata URL of this URL;
Described enquiry module specifically is used for, according to the upper strata URL inquiry URL class library of described URL;
Described URL class library maintenance module specifically is used for, if described enquiry module inquires the URL of coupling, then upgrades the prediction classification of upper strata URL; If described enquiry module does not inquire the URL of coupling, then in the URL class library, record this upper strata URL and corresponding prediction classification.
16. equipment as claimed in claim 9, it is characterized in that, described upper strata URL generation module specifically is used for, and according to the separator among the URL URL is carried out level and divide, and with the last layer level URL of the field of this URL before the predetermined number separator forward of position, end as this URL.
CN201110249270.2A 2011-08-26 2011-08-26 A kind of Web page classification method and equipment Active CN102955810B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110249270.2A CN102955810B (en) 2011-08-26 2011-08-26 A kind of Web page classification method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110249270.2A CN102955810B (en) 2011-08-26 2011-08-26 A kind of Web page classification method and equipment

Publications (2)

Publication Number Publication Date
CN102955810A true CN102955810A (en) 2013-03-06
CN102955810B CN102955810B (en) 2015-12-02

Family

ID=47764622

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110249270.2A Active CN102955810B (en) 2011-08-26 2011-08-26 A kind of Web page classification method and equipment

Country Status (1)

Country Link
CN (1) CN102955810B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646119A (en) * 2013-12-26 2014-03-19 北京西塔网络科技股份有限公司 Method and device for generating user behavior record
CN105912736A (en) * 2016-06-28 2016-08-31 迈普通信技术股份有限公司 URL classifying method and device
CN106294443A (en) * 2015-05-28 2017-01-04 上海池乐信息科技有限公司 The URL classification recognition methods in a kind of knowledge based storehouse and system
CN106294442A (en) * 2015-05-28 2017-01-04 上海池乐信息科技有限公司 A kind of internet information classifying identification method based on URL and system
CN103914534B (en) * 2014-03-31 2017-03-15 郭磊 Content of text sorting technique based on specialist system URL classification knowledge base
CN106528556A (en) * 2015-09-10 2017-03-22 北京国双科技有限公司 Analysis method and device for website access data
CN107545020A (en) * 2017-05-10 2018-01-05 新华三信息安全技术有限公司 A kind of determination method and device of Web page classifying
CN110472125A (en) * 2019-08-23 2019-11-19 厦门商集网络科技有限责任公司 A kind of the cascade crawling method and equipment of the multi-interface based on web crawlers

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776825B (en) * 2016-11-24 2020-06-02 竹间智能科技(上海)有限公司 User preference entity classification method and system based on hierarchical mapping

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7565350B2 (en) * 2006-06-19 2009-07-21 Microsoft Corporation Identifying a web page as belonging to a blog
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification
CN1592229B (en) * 2003-08-25 2010-10-06 微软公司 Electronic communications and web pages filtering based on URL
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1592229B (en) * 2003-08-25 2010-10-06 微软公司 Electronic communications and web pages filtering based on URL
US7565350B2 (en) * 2006-06-19 2009-07-21 Microsoft Corporation Identifying a web page as belonging to a blog
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN101630330A (en) * 2009-08-14 2010-01-20 苏州锐创通信有限责任公司 Method for webpage classification

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646119A (en) * 2013-12-26 2014-03-19 北京西塔网络科技股份有限公司 Method and device for generating user behavior record
CN103914534B (en) * 2014-03-31 2017-03-15 郭磊 Content of text sorting technique based on specialist system URL classification knowledge base
CN106294443A (en) * 2015-05-28 2017-01-04 上海池乐信息科技有限公司 The URL classification recognition methods in a kind of knowledge based storehouse and system
CN106294442A (en) * 2015-05-28 2017-01-04 上海池乐信息科技有限公司 A kind of internet information classifying identification method based on URL and system
CN106528556A (en) * 2015-09-10 2017-03-22 北京国双科技有限公司 Analysis method and device for website access data
CN106528556B (en) * 2015-09-10 2019-07-30 北京国双科技有限公司 The analysis method and device of website visitation data
CN105912736A (en) * 2016-06-28 2016-08-31 迈普通信技术股份有限公司 URL classifying method and device
CN107545020A (en) * 2017-05-10 2018-01-05 新华三信息安全技术有限公司 A kind of determination method and device of Web page classifying
CN110472125A (en) * 2019-08-23 2019-11-19 厦门商集网络科技有限责任公司 A kind of the cascade crawling method and equipment of the multi-interface based on web crawlers

Also Published As

Publication number Publication date
CN102955810B (en) 2015-12-02

Similar Documents

Publication Publication Date Title
CN102955810B (en) A kind of Web page classification method and equipment
US9449271B2 (en) Classifying resources using a deep network
US7779001B2 (en) Web page ranking with hierarchical considerations
CN107451861B (en) Method for identifying user internet access characteristics under big data
CN102117321B (en) The automatic discovery that subject areas is discussed is assembled and tissue
US9317613B2 (en) Large scale entity-specific resource classification
US7672943B2 (en) Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling
US20100205168A1 (en) Thread-Based Incremental Web Forum Crawling
CN102999586B (en) A kind of method and apparatus of recommendation of websites
RU2720954C1 (en) Search index construction method and system using machine learning algorithm
CN103329151A (en) Recommendations based on topic clusters
US20100030768A1 (en) Classifying documents using implicit feedback and query patterns
CN111143655B (en) Method for calculating news popularity
CN104090919A (en) Advertisement recommending method and advertisement recommending server
CN104423621A (en) Pinyin string processing method and device
US8898151B2 (en) System and method for filtering documents
JP2014515514A (en) Method and apparatus for providing suggested words
WO2009031759A1 (en) Method and system for generating search collection of query
CN103810162A (en) Method and system for recommending network information
CN101211368B (en) Method for classifying search term, device and search engine system
US7769749B2 (en) Web page categorization using graph-based term selection
CN103425767B (en) A kind of determination method and system pointing out data
US10146876B2 (en) Predicting real-time change in organic search ranking of a website
JP6680663B2 (en) Information processing apparatus, information processing method, prediction model generation apparatus, prediction model generation method, and program
CN116226494B (en) Crawler system and method for information search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20170223

Address after: Kolding road high tech Zone of Suzhou City, Jiangsu Province, No. 78 215163

Patentee after: CHINA MOBILE (SUZHOU) SOFTWARE TECHNOLOGY CO., LTD.

Patentee after: China Mobile Communications Co., Ltd.

Patentee after: Chellona Mobile Communications Corporation Cmcc

Address before: 100032 Beijing Finance Street, No. 29, Xicheng District

Patentee before: Chellona Mobile Communications Corporation Cmcc