CN106649384B - The method and apparatus classified to URL - Google Patents

The method and apparatus classified to URL Download PDF

Info

Publication number
CN106649384B
CN106649384B CN201510733512.3A CN201510733512A CN106649384B CN 106649384 B CN106649384 B CN 106649384B CN 201510733512 A CN201510733512 A CN 201510733512A CN 106649384 B CN106649384 B CN 106649384B
Authority
CN
China
Prior art keywords
url
user
characteristic information
webpage
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510733512.3A
Other languages
Chinese (zh)
Other versions
CN106649384A (en
Inventor
赵钧
石屹嵘
黄磊
邱晨旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN201510733512.3A priority Critical patent/CN106649384B/en
Publication of CN106649384A publication Critical patent/CN106649384A/en
Application granted granted Critical
Publication of CN106649384B publication Critical patent/CN106649384B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the method and apparatus that a kind of couple of URL classifies, it is related to big data and Internet technical field, wherein, method includes: that the user's characteristic information for obtaining each user of access URL and each user access the access times of the URL, and the user's characteristic information includes the weight of the user tag and each user tag that are determined based on user's history internet behavior;Determine that URL characteristic information, the URL characteristic information include the type of webpage of URL and the weight of each type of webpage according to the access times that the user's characteristic information of each user got and each user access URL;Classified according to the URL characteristic information to the URL.The efficiency of URL classification can be improved in the present invention.

Description

The method and apparatus classified to URL
Technical field
The present invention relates to big datas and Internet technical field, especially a kind of couple of URL (Uniform Resource Locator, uniform resource locator) method and apparatus classified.
Background technique
Currently, the internet behavior based on DPI (Deep Packet Inspection, deep-packet detection) data analysis user The network address mainly accessed by the address URL storehouse matching user, then labels to user to realize.
URL address base generally uses that web page contents extract and identification technology constructs to classify to URL, still, this Invention inventors have found that using web page contents extract and identification technology classify to URL by the way of have the drawback that
First is that due to needing for different website design personalization algorithms, workload when classifying to URL Greatly, low efficiency;
Second is that need again to classify to URL by manually distinguishing or re-recognizing after different website revisions, URL address base can not automatically update.
Summary of the invention
One of technical problem to be solved of the embodiment of the present invention is: solving the problems, such as URL classification low efficiency.
According to an aspect of the present invention, a kind of couple of URL method classified is provided, comprising: obtain each use of access URL The user's characteristic information at family and each user access the access times of the URL, and the user's characteristic information includes being based on user's history The weight of user tag and each user tag that internet behavior determines;According to the user's characteristic information of each user got and respectively The access times that user accesses URL determine that URL characteristic information, the URL characteristic information include the type of webpage and each webpage of URL The weight of type;Classified according to the URL characteristic information to the URL.
In one embodiment, the user's characteristic information for each user that the basis is got and each user access URL's Access times determine that URL characteristic information includes: according to uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P calculates access should The label vector u of each user j of URLj, wherein j is positive integer, and 1≤j≤S, S are the total number of users for accessing the URL, xjnFor The user tag of user j, kjnFor user tag xjnWeight, jn is positive integer, pjThe access times of the URL are accessed for user j, P is total access times that all users access the URL;By the label vector u of each user jjThe weight of middle same subscriber label is tired Add, and user tag is ranked up by the size of the coefficient of the user tag after adding up, obtains the label vector y=of the URL (x1×c1, x2×c2..., xt×ct), whereinxtFor user tag, user tag xtCoefficient ctFor The label vector u of S userjIn with xtThe sum of the weight of identical user tag;User's mark is selected from the label vector y of URL The maximum preceding m user tag x of the coefficient of label1, x2... xmAs the type of webpage of the URL, and willAs net Page type xiWeight.
It is in one embodiment, described that carry out classification to the URL according to the URL characteristic information include: each net of selection Type of webpage of the maximum one or more type of webpage as the URL in the weight of page type, to divide the URL Class.
In one embodiment, the method also includes: filter out from the DPI data of acquisition total access times be greater than it is pre- If the URL of threshold value is as the URL.
In one embodiment, the method also includes: acquire the web page contents of the URL, and according to the net of the URL Page content and special algorithm identify the type of webpage of the URL, to classify to the URL;By classification results and according to institute URL characteristic information is stated to be compared the URL classification results classified;The preset threshold is adjusted according to comparison result Size.
According to another aspect of the present invention, a kind of couple of URL device classified is provided, comprising: user's characteristic information obtains Modulus block, user's characteristic information and each user for obtaining each user of access URL access the access times of the URL, described User's characteristic information includes the weight of the user tag and each user tag that are determined based on user's history internet behavior;URL feature Information determination module, the access times for accessing URL according to the user's characteristic information of each user got and each user are true Determine URL characteristic information, the URL characteristic information includes the weight of type of webpage and each type of webpage;URL classification module, is used for Classified according to the URL characteristic information to the URL.
In one embodiment, the URL characteristic information determining module includes: user tag computing unit, is used for basis uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P calculates the label vector u for accessing each user j of the URLj, wherein j For positive integer, 1≤j≤S, S are the total number of users for accessing the URL, xjnFor the user tag of user j, kjnFor user tag xjn's Weight, jn are positive integer, pjThe access times of the URL are accessed for user j, P is total access time that all users access the URL Number;URL tag calculation unit, for by the label vector u of each user jjThe weight of middle same subscriber label is cumulative, and by cumulative The size of the coefficient of user tag afterwards is ranked up user tag, obtains the label vector y=(x of the URL1×c1, x2× c2..., xt×ct), whereinxtFor user tag, user tag xtCoefficient ctFor the mark of S user Sign vector ujIn with xtThe sum of the weight of identical user tag;URL characteristic information determination unit, for the label vector from URL The maximum preceding m user tag x of coefficient of user tag is selected in y1, x2... xmAs the type of webpage of the URL, and willAs type of webpage xiWeight.
In one embodiment, the URL classification module, specifically in the weight of each type of webpage of selection maximum one Type of webpage of a or multiple type of webpage as the URL, to classify to the URL.
In one embodiment, described device further include: DPI data analysis module, for being sieved from the DPI data of acquisition URL of total access times greater than preset threshold is selected as the URL.
In one embodiment, described device further include: web page contents acquisition module, for acquiring the webpage of the URL Content, and according to the type of webpage of the web page contents of the URL and special algorithm identification URL, to classify to the URL; Comparison module, for comparing the classification results that the URL classifies classification results with according to the URL characteristic information Compared with;Module is adjusted, for adjusting the size of the preset threshold according to comparison result.
The present invention can determine the characteristic information of URL by obtaining the user's characteristic information of each user of access URL, thus The type of webpage of URL can be determined, to classify to URL.This mode classification one side, without being directed to different URL nets Design personalized of standing algorithm, classification effectiveness are high;On the other hand, after different URL website revisions, i.e., type of webpage changes When, due to can according to access the URL user's characteristic information obtain the characteristic information of URL, so as in time to URL again Classify, automatically updates URL address base.
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.
Fig. 1 is flow diagram of the present invention to URL method one embodiment classified;
Fig. 2 is schematic diagram of the present invention to URL one example of method classified;
Fig. 3 is structural schematic diagram of the present invention to URL device one embodiment classified;
Fig. 4 is the structural schematic diagram for device another embodiment that the present invention classifies to URL;
Fig. 5 is the structural schematic diagram for device another embodiment that the present invention classifies to URL;
Fig. 6 is structural schematic diagram of the present invention to the URL device further embodiment classified.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Unless specifically stated otherwise, positioned opposite, the digital table of the component and step that otherwise illustrate in these embodiments It is not limited the scope of the invention up to formula and numerical value.
Simultaneously, it should be appreciated that for ease of description, the size of various pieces shown in attached drawing is not according to reality Proportionate relationship draw.
Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable In the case of, the technology, method and apparatus should be considered as authorizing part of specification.
It is shown here and discuss all examples in, any occurrence should be construed as merely illustratively, without It is as limitation.Therefore, the other examples of exemplary embodiment can have different values.
It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, then in subsequent attached drawing does not need that it is further discussed.
It was found by the inventors of the present invention that on the basis of big data statistics, when the user volume for accessing a URL network address is larger When, the content of URL webpage embodies the joint demand of access user, rather than the specific demand of single user.It is therefore proposed that root The characteristic information of URL is reversely marked according to the user's characteristic information of each user of access URL.The present invention can be used for telecommunications DPI user Behavioural analysis, can quickly the URL big to amount of access classify, identify the classification of newly-increased URL, existing manual examination and verification, On the basis of URL feature identification based on web page analysis, the quality and quantity of URL classification can be further increased.
Fig. 1 is flow diagram of the present invention to URL method one embodiment classified.As shown in Figure 1, this method Include:
Step 102, the user's characteristic information and each user that obtain each user of access URL access the access times of the URL, Wherein, user's characteristic information includes the weight of the user tag and each user tag that are determined based on user's history internet behavior.
Here, according to the user's characteristic information of the available each user of the history internet behavior of user, for example, user is frequent Financial web site and P. E Web Sites are accessed, then can stamp two user tags to the user, one is financial web site, and one is body Educate website.According to the weight for the two available user tags of number that user accesses the two websites, to obtain user Characteristic information.For example, user's characteristic information may include following content: user tag is financial web site and P. E Web Sites, finance and economics The weight of website is 20%, and the weight of P. E Web Sites is 80%.
Furthermore it is possible to be adjusted to the quantity of the user tag in user's characteristic information, such as reduce user tag Quantity, so as to adjust the quantity of the type of webpage in final URL characteristic information.
Step 104, it is determined according to the access times that the user's characteristic information of each user got and each user access URL URL characteristic information, the URL characteristic information include the type of webpage of URL and the weight of each type of webpage.
The user's characteristic information of each user can react URL characteristic information, will provide illustrative detailed description hereinafter.
Step 106, classified according to URL characteristic information to the URL.
To get the weight of the type of webpage and each type of webpage that have arrived URL after the characteristic information for obtaining URL, one In a embodiment, web page class of the maximum one or more type of webpage as URL in the weight of each type of webpage can choose Type, to classify to URL.
The present embodiment can determine the characteristic information of URL by obtaining the user's characteristic information of each user of access URL, from And can determine the type of webpage of URL, to classify to URL.This mode classification one side, without being directed to different URL Website design personalization algorithm, classification effectiveness are high;On the other hand, after different URL website revisions, i.e., type of webpage becomes When change, since the characteristic information of URL can be obtained according to the user's characteristic information for accessing the URL, so as in time to URL weight Newly classify, automatically updates URL address base.
As a specific embodiment, step 104 shown in Fig. 1 can be achieved in that
Firstly, according to uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P calculates each user j for accessing the URL Label vector uj, wherein j is positive integer, and 1≤j≤S, S are the total number of users for accessing the URL, xjnIt is marked for the user of user j Label, kjnFor user tag xjnWeight, jn is positive integer, pjThe access times of the URL are accessed for user j, P is all users visit Ask total access times of the URL.
Then, by the label vector u of each user jjThe weight of middle same subscriber label is cumulative, and marks by the user after cumulative The size of the coefficient of label is ranked up user tag, such as ascending or descending order arrangement, to obtain the label vector y of the URL =(x1×c1, x2×c2..., xt×ct), whereinIf the user tag of each user not phase Together, thenxtFor user tag, user tag xtCoefficient ctFor the label vector u of S userjIn with xtIt is identical The sum of weight of user tag.Specifically, work as xt=xjhWhen, ctIt can be expressed as following formula: Wherein kjh∈(kj1, kj2... kjn), xjh∈(xj1, xj2... xjn)。
Later, the maximum preceding m user tag x of coefficient of user tag is selected from the label vector y of URL1, x2... xm As the type of webpage of the URL, and willAs type of webpage xiWeight.That is,Respectively type of webpage x1, x2... xmWeight.
In the present embodiment, the available each user of access times which is accessed by user's characteristic information and each user Label vector, according to the label vector of the available URL of the label vector of each user, to obtain the characteristic information of URL.
Although this is simultaneously it should be understood that above-described embodiment realizes step 104 shown in Fig. 1 by way of label vector Non-limiting, those skilled in the art can be visited using other modes according to the user's characteristic information of each user and each user Ask that the access times of URL determine URL characteristic information.
It gives one example below with reference to Fig. 2 column and the method that URL classifies is described in detail in the present invention:
As shown in Fig. 2, access URL:http: total access times of //x.x.com are P=10 times.Wherein, user A is accessed The access times of URL are p1=2 times, and the access times that user B accesses URL are p2=8 times.
The user's characteristic information of user A are as follows: news, weight 0.6;Shopping, weight 0.2;Sport, weight 0.1.
The label vector of user A is u1=(x1×k11, x2×k12... x1n×k1n)×p1/ P=(news × 0.6, shopping × 0.2, sport × 0.1) × 2/10=(news × 0.12, shopping × 0.04, sport × 0.02).
The user's characteristic information of user B are as follows: shopping, weight 0.5;Baby children, weight 0.3;Video, weight 0.1.
The label vector of user B is u2=(x1×k21, x2×k22... x2n×k2n)×p2/ P=(shopping × 0.5, baby children × 0.3, video × 0.1) × 8/10=(shopping × 0.4, baby children × 0.24, video × 0.08).
It is u by the label vector of user A1Label vector with user B is u2The weight of middle same web page type is added (i.e. The weight 0.04+0.4 of shopping is added) obtain URL:http: the label vector of //x.x.com are as follows: y=(x1×m1, x2× m2..., xt×mt)=(news × 0.12, shopping × 0.44, sport × 0.02, baby children × 0.24, video × 0.08).
Selection type of webpage maximum 2, i.e. shopping and type of webpage of the baby children as URL, or only select maximum One, i.e. type of webpage of the shopping as URL, to classify to the URL.
It should be understood that Fig. 2 schematically shows the examples that two users access URL, in practical applications, the present invention is provided The URL more than access times is particularly suitable for the method that URL classifies in one embodiment can be from the DPI of acquisition URL of total access times greater than preset threshold is filtered out in data as the URL to classify, to increase the accurate of classification Property.For example, calculating the access times of each URL in DPI data in certain a period of time, sequence filters out total access times greater than default The URL of threshold value is as the URL to classify.
In addition, in one embodiment, may be used also to the method that URL classifies to verify the correctness of classification results To include the following steps:
Step S1 acquires the web page contents of URL, and is divided according to the web page contents of the URL and special algorithm URL Class.
For example, the web page contents of URL are acquired by way of manual examination and verification or web page crawl, according in the webpage of the URL Hold, the type of webpage of the URL is identified by text mining algorithm, to classify to URL.Here, different URL is needed Text mining algorithm is adjusted correspondingly.
Step S2, by the classification results that step S1 is obtained and the classification knot classified according to URL characteristic information to the URL Fruit is compared.
Step S3 adjusts the size of preset threshold according to comparison result.
If two results are inconsistent, preset threshold can be adjusted to bigger value, so that according to URL spy The classification results that reference breath classifies to URL are more accurate.If two results are consistent, without adjusting preset threshold.
The present embodiment can verify the correct of the method for the invention to URL classification by the comparison to two kinds of classification results Property, the size of preset threshold can be in time adjusted according to verification result, to further increase the reliability of classification results.
It is provided by the invention that the method that URL classifies is equally applicable to APP address sort.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with its The difference of its embodiment, the same or similar part cross-reference between each embodiment.For Installation practice For, since it is substantially corresponding with embodiment of the method, so being described relatively simple, referring to the portion of embodiment of the method in place of correlation It defends oneself bright.
Fig. 3 is structural schematic diagram of the present invention to URL device one embodiment classified.As shown in figure 3, the device Include:
User's characteristic information obtains module 301, for obtaining the user's characteristic information for accessing each user of URL and each user Access the access times of the URL, wherein user's characteristic information include based on user's history internet behavior determine user tag and The weight of each user tag;
URL characteristic information determining module 302, for being visited according to the user's characteristic information of each user got and each user Ask that the access times of URL determine that URL characteristic information, URL characteristic information include the weight of type of webpage and each type of webpage;
URL classification module 303, for being classified according to URL characteristic information to URL.
Illustratively, URL classification module 303 is specifically used for maximum one or more in the weight for selecting each type of webpage Type of webpage of the type of webpage as URL, to classify to URL.
The present embodiment can determine the characteristic information of URL by obtaining the user's characteristic information of each user of access URL, from And can determine the type of webpage of URL, to classify to URL.This mode classification one side, without being directed to different URL Website design personalization algorithm, classification effectiveness are high;On the other hand, after different URL website revisions, i.e., type of webpage becomes When change, since the characteristic information of URL can be obtained according to the user's characteristic information for accessing the URL, so as in time to URL weight Newly classify, automatically updates URL address base.
Fig. 4 is the structural schematic diagram for device another embodiment that the present invention classifies to URL.As shown in figure 4, this reality The URL characteristic information determining module 302 applied in example may include:
User tag computing unit 311, for according to uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P, which is calculated, to be visited Ask the label vector u of each user j of the URLj, wherein j is positive integer, and 1≤j≤S, S are the total number of users for accessing the URL, xjnFor the user tag of user j, kjnFor user tag xjnWeight, jn is positive integer, pjThe access of the URL is accessed for user j Number, P are total access times that all users access the URL;
URL tag calculation unit 321, for by the label vector u of each user jjThe weight of middle same subscriber label is cumulative, And user tag is ranked up by the size of the coefficient of the user tag after adding up, obtain the label vector y=(x of the URL1 ×c1, x2×c2..., xt×ct), whereinxtFor user tag, user tag xtCoefficient ctFor S The label vector u of a userjIn with xtThe sum of the weight of identical user tag;
URL characteristic information determination unit 331, for selecting the coefficient of user tag maximum from the label vector y of URL Preceding m user tag x1, x2... xmAs the type of webpage of the URL, and willAs type of webpage xiWeight.
In the present embodiment, the available each user of access times which is accessed by user's characteristic information and each user Label vector, according to the label vector of the available URL of the label vector of each user, to obtain the characteristic information of URL.
Fig. 5 is the structural schematic diagram for device another embodiment that the present invention classifies to URL.As shown in figure 5, in order to The accurate of classification is improved, which can also include:
DPI data analysis module 501, for filtering out total access times from the DPI data of acquisition greater than preset threshold URL as the URL.
Fig. 6 is structural schematic diagram of the present invention to the URL device further embodiment classified.As shown in fig. 6, the dress Setting to include:
Web page contents acquisition module 601, for acquiring the web page contents of URL, and according to the web page contents of URL and specific calculation Method identifies the type of webpage of URL, to classify to URL;
Comparison module 602, for by classification results and the classification results classified according to URL characteristic information to URL into Row compares;
Module 603 is adjusted, for adjusting the size of preset threshold according to comparison result.
The present embodiment can verify the correct of the method for the invention to URL classification by the comparison to two kinds of classification results Property, the size of preset threshold can be in time adjusted according to verification result, to further increase the reliability of classification results.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light The various media that can store program code such as disk.
Description of the invention is given for the purpose of illustration and description, and is not exhaustively or will be of the invention It is limited to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.It selects and retouches It states embodiment and is to more preferably illustrate the principle of the present invention and practical application, and those skilled in the art is enable to manage The solution present invention is to design various embodiments suitable for specific applications with various modifications.

Claims (10)

1. the method that a kind of couple of URL classifies characterized by comprising
The user's characteristic information and each user that obtain each user of access URL access the access times of the URL, the user characteristics Information includes the weight of the user tag and each user tag that are determined based on user's history internet behavior;
URL characteristic information is determined according to the access times that the user's characteristic information of each user got and each user access URL, The URL characteristic information includes the type of webpage of URL and the weight of each type of webpage;
Classified according to the URL characteristic information to the URL;
Wherein, the access times that the user's characteristic information for each user that the basis is got and each user access URL determine URL Characteristic information includes:
According to uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P calculates the label vector for accessing each user j of the URL uj, wherein j is positive integer, and 1≤j≤S, S are the total number of users for accessing the URL, xjnFor the user tag of user j, kjnFor user Label xjnWeight, jn is positive integer, pjThe access times of the URL are accessed for user j, P is that all users access the total of the URL Access times;
By the label vector u of each user jjThe weight of middle same subscriber label is cumulative, obtains the label vector y=(x of the URL1× c1, x2×c2..., xt×ct), whereinxtFor user tag, user tag xtCoefficient ctIt is used for S The label vector u at familyjIn with xtThe sum of the weight of identical user tag;
The maximum preceding m user tag x of coefficient of user tag is selected from the label vector y of URL1, x2... xmAs the URL Type of webpage, and willAs type of webpage xiWeight.
2. the method according to claim 1, wherein
By the label vector u of each user jjThe weight of middle same subscriber label is cumulative, and by the coefficient of the user tag after adding up Size is ranked up user tag, obtains the label vector y=(x of the URL1×c1, x2×c2..., xt×ct)。
3. method according to claim 1 or 2, which is characterized in that it is described according to the URL characteristic information to the URL Carrying out classification includes:
Type of webpage of the maximum one or more type of webpage as the URL in the weight of each type of webpage is selected, with right The URL classifies.
4. the method according to claim 1, wherein further include:
URL of total access times greater than preset threshold is filtered out from the DPI data of acquisition as the URL.
5. according to the method described in claim 4, it is characterized by further comprising:
The web page contents of the URL are acquired, and identify the webpage of the URL according to the web page contents of the URL and special algorithm Type, to classify to the URL;
Classification results are compared the URL classification results classified with according to the URL characteristic information;
The size of the preset threshold is adjusted according to comparison result.
6. the device that a kind of couple of URL classifies characterized by comprising
User's characteristic information obtains module, and the user's characteristic information and each user access for obtaining each user of access URL should The access times of URL, the user's characteristic information include the user tag determined based on user's history internet behavior and each user The weight of label;
URL characteristic information determining module, for accessing URL's according to the user's characteristic information of each user got and each user Access times determine that URL characteristic information, the URL characteristic information include the weight of type of webpage and each type of webpage;
URL classification module, for being classified according to the URL characteristic information to the URL;
Wherein, the URL characteristic information determining module includes:
User tag computing unit, for according to uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P, which is calculated, accesses the URL Each user j label vector uj, wherein j is positive integer, and 1≤j≤S, S are the total number of users for accessing the URL, xjnFor user The user tag of j, kjnFor user tag xjnWeight, jn is positive integer, pjThe access times of the URL are accessed for user j, P is All users access total access times of the URL;
URL tag calculation unit, for by the label vector u of each user jjThe weight of middle same subscriber label is cumulative, is somebody's turn to do Label vector y=(the x of URL1×c1, x2×c2..., xt×ct), whereinxtFor user tag, use Family label xtCoefficient ctFor the label vector u of S userjIn with xtThe sum of the weight of identical user tag;
URL characteristic information determination unit, for selecting the maximum preceding m use of the coefficient of user tag from the label vector y of URL Family label x1, x2... xmAs the type of webpage of the URL, and willAs type of webpage xiWeight.
7. device according to claim 6, which is characterized in that
The URL tag calculation unit is used for the label vector u of each user jjThe weight of middle same subscriber label is cumulative, and presses The size of the coefficient of user tag after cumulative is ranked up user tag, obtains the label vector y=(x of the URL1×c1, x2×c2..., xt×ct)。
8. device according to claim 6 or 7, which is characterized in that
The URL classification module is made specifically for one or more type of webpage maximum in the weight of each type of webpage of selection For the type of webpage of the URL, to classify to the URL.
9. device according to claim 6, which is characterized in that further include:
DPI data analysis module is made for filtering out total access times from the DPI data of acquisition greater than the URL of preset threshold For the URL.
10. device according to claim 9, which is characterized in that further include:
Web page contents acquisition module, for acquiring the web page contents of the URL, and according to the web page contents of the URL and specific Algorithm identifies the type of webpage of URL, to classify to the URL;
Comparison module, for by classification results and the classification results classified according to the URL characteristic information to the URL into Row compares;
Module is adjusted, for adjusting the size of the preset threshold according to comparison result.
CN201510733512.3A 2015-11-03 2015-11-03 The method and apparatus classified to URL Active CN106649384B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510733512.3A CN106649384B (en) 2015-11-03 2015-11-03 The method and apparatus classified to URL

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510733512.3A CN106649384B (en) 2015-11-03 2015-11-03 The method and apparatus classified to URL

Publications (2)

Publication Number Publication Date
CN106649384A CN106649384A (en) 2017-05-10
CN106649384B true CN106649384B (en) 2019-07-09

Family

ID=58810876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510733512.3A Active CN106649384B (en) 2015-11-03 2015-11-03 The method and apparatus classified to URL

Country Status (1)

Country Link
CN (1) CN106649384B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992416B (en) * 2017-11-28 2021-02-23 中国联合网络通信集团有限公司 Method and device for determining webpage time delay
CN111325495B (en) * 2018-12-17 2023-12-01 顺丰科技有限公司 Abnormal part classification method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101038596A (en) * 2007-04-29 2007-09-19 北京搜狗科技发展有限公司 Method and system for classifying website
CN102567494A (en) * 2011-12-22 2012-07-11 北京亿赞普网络技术有限公司 Website classification method and device
WO2014182748A1 (en) * 2013-05-08 2014-11-13 Microsoft Corporation Cross-lingual automatic query annotation
CN104408175A (en) * 2014-12-12 2015-03-11 北京奇虎科技有限公司 Method and device for identifying page type
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130066814A1 (en) * 2011-09-12 2013-03-14 Volker Bosch System and Method for Automated Classification of Web pages and Domains
CN104391860B (en) * 2014-10-22 2018-03-02 安一恒通(北京)科技有限公司 content type detection method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101038596A (en) * 2007-04-29 2007-09-19 北京搜狗科技发展有限公司 Method and system for classifying website
CN102567494A (en) * 2011-12-22 2012-07-11 北京亿赞普网络技术有限公司 Website classification method and device
WO2014182748A1 (en) * 2013-05-08 2014-11-13 Microsoft Corporation Cross-lingual automatic query annotation
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN104408175A (en) * 2014-12-12 2015-03-11 北京奇虎科技有限公司 Method and device for identifying page type

Also Published As

Publication number Publication date
CN106649384A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
KR101770683B1 (en) Method, apparatus, server, program and computer-readable recording medium of dispalying social network information flow
CN104424296B (en) Query word sorting technique and device
CN102929939B (en) The offer method and device of customized information
US11907644B2 (en) Detecting compatible layouts for content-based native ads
CN104133870B (en) A kind of webpage similarity calculating method and device
CN105488023B (en) A kind of text similarity appraisal procedure and device
WO2020224128A1 (en) News recommendation method and apparatus based on short-term interest of user, and electronic device and medium
CN103198069A (en) Method and device for extracting relational table
CN109325179A (en) A kind of method and device that content is promoted
US10073918B2 (en) Classifying URLs
US8990684B2 (en) System and method for recommending fonts
CN104699837B (en) Method, device and server for selecting illustrated pictures of web pages
CN107402932A (en) Extension processing method, the text of user tag recommend method and apparatus
CN106649384B (en) The method and apparatus classified to URL
CN105117434A (en) Webpage classification method and webpage classification system
CN104123321B (en) A kind of determining method and device for recommending picture
CN106445907A (en) Domain lexicon generation method and apparatus
CN106776910A (en) The display methods and device of a kind of Search Results
CN110134812A (en) A kind of face searching method and its device
CN103744920A (en) Commodity attribute name-value pair extraction method and system
CN108319606A (en) The construction method and device of specialized database
CN103514237B (en) A kind of method and system obtaining user and Document personalization feature
CN109460555A (en) Official document determination method, device and electronic equipment
US10606875B2 (en) Search support apparatus and method
CN104778251B (en) A kind of acquisition methods and device of document temperature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant