CN106649384B - The method and apparatus classified to URL - Google Patents
The method and apparatus classified to URL Download PDFInfo
- Publication number
- CN106649384B CN106649384B CN201510733512.3A CN201510733512A CN106649384B CN 106649384 B CN106649384 B CN 106649384B CN 201510733512 A CN201510733512 A CN 201510733512A CN 106649384 B CN106649384 B CN 106649384B
- Authority
- CN
- China
- Prior art keywords
- url
- user
- characteristic information
- webpage
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses the method and apparatus that a kind of couple of URL classifies, it is related to big data and Internet technical field, wherein, method includes: that the user's characteristic information for obtaining each user of access URL and each user access the access times of the URL, and the user's characteristic information includes the weight of the user tag and each user tag that are determined based on user's history internet behavior;Determine that URL characteristic information, the URL characteristic information include the type of webpage of URL and the weight of each type of webpage according to the access times that the user's characteristic information of each user got and each user access URL;Classified according to the URL characteristic information to the URL.The efficiency of URL classification can be improved in the present invention.
Description
Technical field
The present invention relates to big datas and Internet technical field, especially a kind of couple of URL (Uniform Resource
Locator, uniform resource locator) method and apparatus classified.
Background technique
Currently, the internet behavior based on DPI (Deep Packet Inspection, deep-packet detection) data analysis user
The network address mainly accessed by the address URL storehouse matching user, then labels to user to realize.
URL address base generally uses that web page contents extract and identification technology constructs to classify to URL, still, this
Invention inventors have found that using web page contents extract and identification technology classify to URL by the way of have the drawback that
First is that due to needing for different website design personalization algorithms, workload when classifying to URL
Greatly, low efficiency;
Second is that need again to classify to URL by manually distinguishing or re-recognizing after different website revisions,
URL address base can not automatically update.
Summary of the invention
One of technical problem to be solved of the embodiment of the present invention is: solving the problems, such as URL classification low efficiency.
According to an aspect of the present invention, a kind of couple of URL method classified is provided, comprising: obtain each use of access URL
The user's characteristic information at family and each user access the access times of the URL, and the user's characteristic information includes being based on user's history
The weight of user tag and each user tag that internet behavior determines;According to the user's characteristic information of each user got and respectively
The access times that user accesses URL determine that URL characteristic information, the URL characteristic information include the type of webpage and each webpage of URL
The weight of type;Classified according to the URL characteristic information to the URL.
In one embodiment, the user's characteristic information for each user that the basis is got and each user access URL's
Access times determine that URL characteristic information includes: according to uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P calculates access should
The label vector u of each user j of URLj, wherein j is positive integer, and 1≤j≤S, S are the total number of users for accessing the URL, xjnFor
The user tag of user j, kjnFor user tag xjnWeight, jn is positive integer, pjThe access times of the URL are accessed for user j,
P is total access times that all users access the URL;By the label vector u of each user jjThe weight of middle same subscriber label is tired
Add, and user tag is ranked up by the size of the coefficient of the user tag after adding up, obtains the label vector y=of the URL
(x1×c1, x2×c2..., xt×ct), whereinxtFor user tag, user tag xtCoefficient ctFor
The label vector u of S userjIn with xtThe sum of the weight of identical user tag;User's mark is selected from the label vector y of URL
The maximum preceding m user tag x of the coefficient of label1, x2... xmAs the type of webpage of the URL, and willAs net
Page type xiWeight.
It is in one embodiment, described that carry out classification to the URL according to the URL characteristic information include: each net of selection
Type of webpage of the maximum one or more type of webpage as the URL in the weight of page type, to divide the URL
Class.
In one embodiment, the method also includes: filter out from the DPI data of acquisition total access times be greater than it is pre-
If the URL of threshold value is as the URL.
In one embodiment, the method also includes: acquire the web page contents of the URL, and according to the net of the URL
Page content and special algorithm identify the type of webpage of the URL, to classify to the URL;By classification results and according to institute
URL characteristic information is stated to be compared the URL classification results classified;The preset threshold is adjusted according to comparison result
Size.
According to another aspect of the present invention, a kind of couple of URL device classified is provided, comprising: user's characteristic information obtains
Modulus block, user's characteristic information and each user for obtaining each user of access URL access the access times of the URL, described
User's characteristic information includes the weight of the user tag and each user tag that are determined based on user's history internet behavior;URL feature
Information determination module, the access times for accessing URL according to the user's characteristic information of each user got and each user are true
Determine URL characteristic information, the URL characteristic information includes the weight of type of webpage and each type of webpage;URL classification module, is used for
Classified according to the URL characteristic information to the URL.
In one embodiment, the URL characteristic information determining module includes: user tag computing unit, is used for basis
uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P calculates the label vector u for accessing each user j of the URLj, wherein j
For positive integer, 1≤j≤S, S are the total number of users for accessing the URL, xjnFor the user tag of user j, kjnFor user tag xjn's
Weight, jn are positive integer, pjThe access times of the URL are accessed for user j, P is total access time that all users access the URL
Number;URL tag calculation unit, for by the label vector u of each user jjThe weight of middle same subscriber label is cumulative, and by cumulative
The size of the coefficient of user tag afterwards is ranked up user tag, obtains the label vector y=(x of the URL1×c1, x2×
c2..., xt×ct), whereinxtFor user tag, user tag xtCoefficient ctFor the mark of S user
Sign vector ujIn with xtThe sum of the weight of identical user tag;URL characteristic information determination unit, for the label vector from URL
The maximum preceding m user tag x of coefficient of user tag is selected in y1, x2... xmAs the type of webpage of the URL, and willAs type of webpage xiWeight.
In one embodiment, the URL classification module, specifically in the weight of each type of webpage of selection maximum one
Type of webpage of a or multiple type of webpage as the URL, to classify to the URL.
In one embodiment, described device further include: DPI data analysis module, for being sieved from the DPI data of acquisition
URL of total access times greater than preset threshold is selected as the URL.
In one embodiment, described device further include: web page contents acquisition module, for acquiring the webpage of the URL
Content, and according to the type of webpage of the web page contents of the URL and special algorithm identification URL, to classify to the URL;
Comparison module, for comparing the classification results that the URL classifies classification results with according to the URL characteristic information
Compared with;Module is adjusted, for adjusting the size of the preset threshold according to comparison result.
The present invention can determine the characteristic information of URL by obtaining the user's characteristic information of each user of access URL, thus
The type of webpage of URL can be determined, to classify to URL.This mode classification one side, without being directed to different URL nets
Design personalized of standing algorithm, classification effectiveness are high;On the other hand, after different URL website revisions, i.e., type of webpage changes
When, due to can according to access the URL user's characteristic information obtain the characteristic information of URL, so as in time to URL again
Classify, automatically updates URL address base.
Below by drawings and examples, technical scheme of the present invention will be described in further detail.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art
To obtain other drawings based on these drawings.
Fig. 1 is flow diagram of the present invention to URL method one embodiment classified;
Fig. 2 is schematic diagram of the present invention to URL one example of method classified;
Fig. 3 is structural schematic diagram of the present invention to URL device one embodiment classified;
Fig. 4 is the structural schematic diagram for device another embodiment that the present invention classifies to URL;
Fig. 5 is the structural schematic diagram for device another embodiment that the present invention classifies to URL;
Fig. 6 is structural schematic diagram of the present invention to the URL device further embodiment classified.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Unless specifically stated otherwise, positioned opposite, the digital table of the component and step that otherwise illustrate in these embodiments
It is not limited the scope of the invention up to formula and numerical value.
Simultaneously, it should be appreciated that for ease of description, the size of various pieces shown in attached drawing is not according to reality
Proportionate relationship draw.
Technology, method and apparatus known to person of ordinary skill in the relevant may be not discussed in detail, but suitable
In the case of, the technology, method and apparatus should be considered as authorizing part of specification.
It is shown here and discuss all examples in, any occurrence should be construed as merely illustratively, without
It is as limitation.Therefore, the other examples of exemplary embodiment can have different values.
It should also be noted that similar label and letter indicate similar terms in following attached drawing, therefore, once a certain Xiang Yi
It is defined in a attached drawing, then in subsequent attached drawing does not need that it is further discussed.
It was found by the inventors of the present invention that on the basis of big data statistics, when the user volume for accessing a URL network address is larger
When, the content of URL webpage embodies the joint demand of access user, rather than the specific demand of single user.It is therefore proposed that root
The characteristic information of URL is reversely marked according to the user's characteristic information of each user of access URL.The present invention can be used for telecommunications DPI user
Behavioural analysis, can quickly the URL big to amount of access classify, identify the classification of newly-increased URL, existing manual examination and verification,
On the basis of URL feature identification based on web page analysis, the quality and quantity of URL classification can be further increased.
Fig. 1 is flow diagram of the present invention to URL method one embodiment classified.As shown in Figure 1, this method
Include:
Step 102, the user's characteristic information and each user that obtain each user of access URL access the access times of the URL,
Wherein, user's characteristic information includes the weight of the user tag and each user tag that are determined based on user's history internet behavior.
Here, according to the user's characteristic information of the available each user of the history internet behavior of user, for example, user is frequent
Financial web site and P. E Web Sites are accessed, then can stamp two user tags to the user, one is financial web site, and one is body
Educate website.According to the weight for the two available user tags of number that user accesses the two websites, to obtain user
Characteristic information.For example, user's characteristic information may include following content: user tag is financial web site and P. E Web Sites, finance and economics
The weight of website is 20%, and the weight of P. E Web Sites is 80%.
Furthermore it is possible to be adjusted to the quantity of the user tag in user's characteristic information, such as reduce user tag
Quantity, so as to adjust the quantity of the type of webpage in final URL characteristic information.
Step 104, it is determined according to the access times that the user's characteristic information of each user got and each user access URL
URL characteristic information, the URL characteristic information include the type of webpage of URL and the weight of each type of webpage.
The user's characteristic information of each user can react URL characteristic information, will provide illustrative detailed description hereinafter.
Step 106, classified according to URL characteristic information to the URL.
To get the weight of the type of webpage and each type of webpage that have arrived URL after the characteristic information for obtaining URL, one
In a embodiment, web page class of the maximum one or more type of webpage as URL in the weight of each type of webpage can choose
Type, to classify to URL.
The present embodiment can determine the characteristic information of URL by obtaining the user's characteristic information of each user of access URL, from
And can determine the type of webpage of URL, to classify to URL.This mode classification one side, without being directed to different URL
Website design personalization algorithm, classification effectiveness are high;On the other hand, after different URL website revisions, i.e., type of webpage becomes
When change, since the characteristic information of URL can be obtained according to the user's characteristic information for accessing the URL, so as in time to URL weight
Newly classify, automatically updates URL address base.
As a specific embodiment, step 104 shown in Fig. 1 can be achieved in that
Firstly, according to uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P calculates each user j for accessing the URL
Label vector uj, wherein j is positive integer, and 1≤j≤S, S are the total number of users for accessing the URL, xjnIt is marked for the user of user j
Label, kjnFor user tag xjnWeight, jn is positive integer, pjThe access times of the URL are accessed for user j, P is all users visit
Ask total access times of the URL.
Then, by the label vector u of each user jjThe weight of middle same subscriber label is cumulative, and marks by the user after cumulative
The size of the coefficient of label is ranked up user tag, such as ascending or descending order arrangement, to obtain the label vector y of the URL
=(x1×c1, x2×c2..., xt×ct), whereinIf the user tag of each user not phase
Together, thenxtFor user tag, user tag xtCoefficient ctFor the label vector u of S userjIn with xtIt is identical
The sum of weight of user tag.Specifically, work as xt=xjhWhen, ctIt can be expressed as following formula:
Wherein kjh∈(kj1, kj2... kjn), xjh∈(xj1, xj2... xjn)。
Later, the maximum preceding m user tag x of coefficient of user tag is selected from the label vector y of URL1, x2... xm
As the type of webpage of the URL, and willAs type of webpage xiWeight.That is,Respectively type of webpage x1, x2... xmWeight.
In the present embodiment, the available each user of access times which is accessed by user's characteristic information and each user
Label vector, according to the label vector of the available URL of the label vector of each user, to obtain the characteristic information of URL.
Although this is simultaneously it should be understood that above-described embodiment realizes step 104 shown in Fig. 1 by way of label vector
Non-limiting, those skilled in the art can be visited using other modes according to the user's characteristic information of each user and each user
Ask that the access times of URL determine URL characteristic information.
It gives one example below with reference to Fig. 2 column and the method that URL classifies is described in detail in the present invention:
As shown in Fig. 2, access URL:http: total access times of //x.x.com are P=10 times.Wherein, user A is accessed
The access times of URL are p1=2 times, and the access times that user B accesses URL are p2=8 times.
The user's characteristic information of user A are as follows: news, weight 0.6;Shopping, weight 0.2;Sport, weight 0.1.
The label vector of user A is u1=(x1×k11, x2×k12... x1n×k1n)×p1/ P=(news × 0.6, shopping
× 0.2, sport × 0.1) × 2/10=(news × 0.12, shopping × 0.04, sport × 0.02).
The user's characteristic information of user B are as follows: shopping, weight 0.5;Baby children, weight 0.3;Video, weight 0.1.
The label vector of user B is u2=(x1×k21, x2×k22... x2n×k2n)×p2/ P=(shopping × 0.5, baby children
× 0.3, video × 0.1) × 8/10=(shopping × 0.4, baby children × 0.24, video × 0.08).
It is u by the label vector of user A1Label vector with user B is u2The weight of middle same web page type is added (i.e.
The weight 0.04+0.4 of shopping is added) obtain URL:http: the label vector of //x.x.com are as follows: y=(x1×m1, x2×
m2..., xt×mt)=(news × 0.12, shopping × 0.44, sport × 0.02, baby children × 0.24, video × 0.08).
Selection type of webpage maximum 2, i.e. shopping and type of webpage of the baby children as URL, or only select maximum
One, i.e. type of webpage of the shopping as URL, to classify to the URL.
It should be understood that Fig. 2 schematically shows the examples that two users access URL, in practical applications, the present invention is provided
The URL more than access times is particularly suitable for the method that URL classifies in one embodiment can be from the DPI of acquisition
URL of total access times greater than preset threshold is filtered out in data as the URL to classify, to increase the accurate of classification
Property.For example, calculating the access times of each URL in DPI data in certain a period of time, sequence filters out total access times greater than default
The URL of threshold value is as the URL to classify.
In addition, in one embodiment, may be used also to the method that URL classifies to verify the correctness of classification results
To include the following steps:
Step S1 acquires the web page contents of URL, and is divided according to the web page contents of the URL and special algorithm URL
Class.
For example, the web page contents of URL are acquired by way of manual examination and verification or web page crawl, according in the webpage of the URL
Hold, the type of webpage of the URL is identified by text mining algorithm, to classify to URL.Here, different URL is needed
Text mining algorithm is adjusted correspondingly.
Step S2, by the classification results that step S1 is obtained and the classification knot classified according to URL characteristic information to the URL
Fruit is compared.
Step S3 adjusts the size of preset threshold according to comparison result.
If two results are inconsistent, preset threshold can be adjusted to bigger value, so that according to URL spy
The classification results that reference breath classifies to URL are more accurate.If two results are consistent, without adjusting preset threshold.
The present embodiment can verify the correct of the method for the invention to URL classification by the comparison to two kinds of classification results
Property, the size of preset threshold can be in time adjusted according to verification result, to further increase the reliability of classification results.
It is provided by the invention that the method that URL classifies is equally applicable to APP address sort.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with its
The difference of its embodiment, the same or similar part cross-reference between each embodiment.For Installation practice
For, since it is substantially corresponding with embodiment of the method, so being described relatively simple, referring to the portion of embodiment of the method in place of correlation
It defends oneself bright.
Fig. 3 is structural schematic diagram of the present invention to URL device one embodiment classified.As shown in figure 3, the device
Include:
User's characteristic information obtains module 301, for obtaining the user's characteristic information for accessing each user of URL and each user
Access the access times of the URL, wherein user's characteristic information include based on user's history internet behavior determine user tag and
The weight of each user tag;
URL characteristic information determining module 302, for being visited according to the user's characteristic information of each user got and each user
Ask that the access times of URL determine that URL characteristic information, URL characteristic information include the weight of type of webpage and each type of webpage;
URL classification module 303, for being classified according to URL characteristic information to URL.
Illustratively, URL classification module 303 is specifically used for maximum one or more in the weight for selecting each type of webpage
Type of webpage of the type of webpage as URL, to classify to URL.
The present embodiment can determine the characteristic information of URL by obtaining the user's characteristic information of each user of access URL, from
And can determine the type of webpage of URL, to classify to URL.This mode classification one side, without being directed to different URL
Website design personalization algorithm, classification effectiveness are high;On the other hand, after different URL website revisions, i.e., type of webpage becomes
When change, since the characteristic information of URL can be obtained according to the user's characteristic information for accessing the URL, so as in time to URL weight
Newly classify, automatically updates URL address base.
Fig. 4 is the structural schematic diagram for device another embodiment that the present invention classifies to URL.As shown in figure 4, this reality
The URL characteristic information determining module 302 applied in example may include:
User tag computing unit 311, for according to uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P, which is calculated, to be visited
Ask the label vector u of each user j of the URLj, wherein j is positive integer, and 1≤j≤S, S are the total number of users for accessing the URL,
xjnFor the user tag of user j, kjnFor user tag xjnWeight, jn is positive integer, pjThe access of the URL is accessed for user j
Number, P are total access times that all users access the URL;
URL tag calculation unit 321, for by the label vector u of each user jjThe weight of middle same subscriber label is cumulative,
And user tag is ranked up by the size of the coefficient of the user tag after adding up, obtain the label vector y=(x of the URL1
×c1, x2×c2..., xt×ct), whereinxtFor user tag, user tag xtCoefficient ctFor S
The label vector u of a userjIn with xtThe sum of the weight of identical user tag;
URL characteristic information determination unit 331, for selecting the coefficient of user tag maximum from the label vector y of URL
Preceding m user tag x1, x2... xmAs the type of webpage of the URL, and willAs type of webpage xiWeight.
In the present embodiment, the available each user of access times which is accessed by user's characteristic information and each user
Label vector, according to the label vector of the available URL of the label vector of each user, to obtain the characteristic information of URL.
Fig. 5 is the structural schematic diagram for device another embodiment that the present invention classifies to URL.As shown in figure 5, in order to
The accurate of classification is improved, which can also include:
DPI data analysis module 501, for filtering out total access times from the DPI data of acquisition greater than preset threshold
URL as the URL.
Fig. 6 is structural schematic diagram of the present invention to the URL device further embodiment classified.As shown in fig. 6, the dress
Setting to include:
Web page contents acquisition module 601, for acquiring the web page contents of URL, and according to the web page contents of URL and specific calculation
Method identifies the type of webpage of URL, to classify to URL;
Comparison module 602, for by classification results and the classification results classified according to URL characteristic information to URL into
Row compares;
Module 603 is adjusted, for adjusting the size of preset threshold according to comparison result.
The present embodiment can verify the correct of the method for the invention to URL classification by the comparison to two kinds of classification results
Property, the size of preset threshold can be in time adjusted according to verification result, to further increase the reliability of classification results.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through
The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program
When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light
The various media that can store program code such as disk.
Description of the invention is given for the purpose of illustration and description, and is not exhaustively or will be of the invention
It is limited to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.It selects and retouches
It states embodiment and is to more preferably illustrate the principle of the present invention and practical application, and those skilled in the art is enable to manage
The solution present invention is to design various embodiments suitable for specific applications with various modifications.
Claims (10)
1. the method that a kind of couple of URL classifies characterized by comprising
The user's characteristic information and each user that obtain each user of access URL access the access times of the URL, the user characteristics
Information includes the weight of the user tag and each user tag that are determined based on user's history internet behavior;
URL characteristic information is determined according to the access times that the user's characteristic information of each user got and each user access URL,
The URL characteristic information includes the type of webpage of URL and the weight of each type of webpage;
Classified according to the URL characteristic information to the URL;
Wherein, the access times that the user's characteristic information for each user that the basis is got and each user access URL determine URL
Characteristic information includes:
According to uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P calculates the label vector for accessing each user j of the URL
uj, wherein j is positive integer, and 1≤j≤S, S are the total number of users for accessing the URL, xjnFor the user tag of user j, kjnFor user
Label xjnWeight, jn is positive integer, pjThe access times of the URL are accessed for user j, P is that all users access the total of the URL
Access times;
By the label vector u of each user jjThe weight of middle same subscriber label is cumulative, obtains the label vector y=(x of the URL1×
c1, x2×c2..., xt×ct), whereinxtFor user tag, user tag xtCoefficient ctIt is used for S
The label vector u at familyjIn with xtThe sum of the weight of identical user tag;
The maximum preceding m user tag x of coefficient of user tag is selected from the label vector y of URL1, x2... xmAs the URL
Type of webpage, and willAs type of webpage xiWeight.
2. the method according to claim 1, wherein
By the label vector u of each user jjThe weight of middle same subscriber label is cumulative, and by the coefficient of the user tag after adding up
Size is ranked up user tag, obtains the label vector y=(x of the URL1×c1, x2×c2..., xt×ct)。
3. method according to claim 1 or 2, which is characterized in that it is described according to the URL characteristic information to the URL
Carrying out classification includes:
Type of webpage of the maximum one or more type of webpage as the URL in the weight of each type of webpage is selected, with right
The URL classifies.
4. the method according to claim 1, wherein further include:
URL of total access times greater than preset threshold is filtered out from the DPI data of acquisition as the URL.
5. according to the method described in claim 4, it is characterized by further comprising:
The web page contents of the URL are acquired, and identify the webpage of the URL according to the web page contents of the URL and special algorithm
Type, to classify to the URL;
Classification results are compared the URL classification results classified with according to the URL characteristic information;
The size of the preset threshold is adjusted according to comparison result.
6. the device that a kind of couple of URL classifies characterized by comprising
User's characteristic information obtains module, and the user's characteristic information and each user access for obtaining each user of access URL should
The access times of URL, the user's characteristic information include the user tag determined based on user's history internet behavior and each user
The weight of label;
URL characteristic information determining module, for accessing URL's according to the user's characteristic information of each user got and each user
Access times determine that URL characteristic information, the URL characteristic information include the weight of type of webpage and each type of webpage;
URL classification module, for being classified according to the URL characteristic information to the URL;
Wherein, the URL characteristic information determining module includes:
User tag computing unit, for according to uj=(xj1×kj1, xj2×kj2... xjn×kjn)×pj/ P, which is calculated, accesses the URL
Each user j label vector uj, wherein j is positive integer, and 1≤j≤S, S are the total number of users for accessing the URL, xjnFor user
The user tag of j, kjnFor user tag xjnWeight, jn is positive integer, pjThe access times of the URL are accessed for user j, P is
All users access total access times of the URL;
URL tag calculation unit, for by the label vector u of each user jjThe weight of middle same subscriber label is cumulative, is somebody's turn to do
Label vector y=(the x of URL1×c1, x2×c2..., xt×ct), whereinxtFor user tag, use
Family label xtCoefficient ctFor the label vector u of S userjIn with xtThe sum of the weight of identical user tag;
URL characteristic information determination unit, for selecting the maximum preceding m use of the coefficient of user tag from the label vector y of URL
Family label x1, x2... xmAs the type of webpage of the URL, and willAs type of webpage xiWeight.
7. device according to claim 6, which is characterized in that
The URL tag calculation unit is used for the label vector u of each user jjThe weight of middle same subscriber label is cumulative, and presses
The size of the coefficient of user tag after cumulative is ranked up user tag, obtains the label vector y=(x of the URL1×c1,
x2×c2..., xt×ct)。
8. device according to claim 6 or 7, which is characterized in that
The URL classification module is made specifically for one or more type of webpage maximum in the weight of each type of webpage of selection
For the type of webpage of the URL, to classify to the URL.
9. device according to claim 6, which is characterized in that further include:
DPI data analysis module is made for filtering out total access times from the DPI data of acquisition greater than the URL of preset threshold
For the URL.
10. device according to claim 9, which is characterized in that further include:
Web page contents acquisition module, for acquiring the web page contents of the URL, and according to the web page contents of the URL and specific
Algorithm identifies the type of webpage of URL, to classify to the URL;
Comparison module, for by classification results and the classification results classified according to the URL characteristic information to the URL into
Row compares;
Module is adjusted, for adjusting the size of the preset threshold according to comparison result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510733512.3A CN106649384B (en) | 2015-11-03 | 2015-11-03 | The method and apparatus classified to URL |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510733512.3A CN106649384B (en) | 2015-11-03 | 2015-11-03 | The method and apparatus classified to URL |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649384A CN106649384A (en) | 2017-05-10 |
CN106649384B true CN106649384B (en) | 2019-07-09 |
Family
ID=58810876
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510733512.3A Active CN106649384B (en) | 2015-11-03 | 2015-11-03 | The method and apparatus classified to URL |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649384B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107992416B (en) * | 2017-11-28 | 2021-02-23 | 中国联合网络通信集团有限公司 | Method and device for determining webpage time delay |
CN111325495B (en) * | 2018-12-17 | 2023-12-01 | 顺丰科技有限公司 | Abnormal part classification method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101038596A (en) * | 2007-04-29 | 2007-09-19 | 北京搜狗科技发展有限公司 | Method and system for classifying website |
CN102567494A (en) * | 2011-12-22 | 2012-07-11 | 北京亿赞普网络技术有限公司 | Website classification method and device |
WO2014182748A1 (en) * | 2013-05-08 | 2014-11-13 | Microsoft Corporation | Cross-lingual automatic query annotation |
CN104408175A (en) * | 2014-12-12 | 2015-03-11 | 北京奇虎科技有限公司 | Method and device for identifying page type |
CN104424308A (en) * | 2013-09-04 | 2015-03-18 | 中兴通讯股份有限公司 | Web page classification standard acquisition method and device and web page classification method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130066814A1 (en) * | 2011-09-12 | 2013-03-14 | Volker Bosch | System and Method for Automated Classification of Web pages and Domains |
CN104391860B (en) * | 2014-10-22 | 2018-03-02 | 安一恒通(北京)科技有限公司 | content type detection method and device |
-
2015
- 2015-11-03 CN CN201510733512.3A patent/CN106649384B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101038596A (en) * | 2007-04-29 | 2007-09-19 | 北京搜狗科技发展有限公司 | Method and system for classifying website |
CN102567494A (en) * | 2011-12-22 | 2012-07-11 | 北京亿赞普网络技术有限公司 | Website classification method and device |
WO2014182748A1 (en) * | 2013-05-08 | 2014-11-13 | Microsoft Corporation | Cross-lingual automatic query annotation |
CN104424308A (en) * | 2013-09-04 | 2015-03-18 | 中兴通讯股份有限公司 | Web page classification standard acquisition method and device and web page classification method and device |
CN104408175A (en) * | 2014-12-12 | 2015-03-11 | 北京奇虎科技有限公司 | Method and device for identifying page type |
Also Published As
Publication number | Publication date |
---|---|
CN106649384A (en) | 2017-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101770683B1 (en) | Method, apparatus, server, program and computer-readable recording medium of dispalying social network information flow | |
CN104424296B (en) | Query word sorting technique and device | |
CN102929939B (en) | The offer method and device of customized information | |
US11907644B2 (en) | Detecting compatible layouts for content-based native ads | |
CN104133870B (en) | A kind of webpage similarity calculating method and device | |
CN105488023B (en) | A kind of text similarity appraisal procedure and device | |
WO2020224128A1 (en) | News recommendation method and apparatus based on short-term interest of user, and electronic device and medium | |
CN103198069A (en) | Method and device for extracting relational table | |
CN109325179A (en) | A kind of method and device that content is promoted | |
US10073918B2 (en) | Classifying URLs | |
US8990684B2 (en) | System and method for recommending fonts | |
CN104699837B (en) | Method, device and server for selecting illustrated pictures of web pages | |
CN107402932A (en) | Extension processing method, the text of user tag recommend method and apparatus | |
CN106649384B (en) | The method and apparatus classified to URL | |
CN105117434A (en) | Webpage classification method and webpage classification system | |
CN104123321B (en) | A kind of determining method and device for recommending picture | |
CN106445907A (en) | Domain lexicon generation method and apparatus | |
CN106776910A (en) | The display methods and device of a kind of Search Results | |
CN110134812A (en) | A kind of face searching method and its device | |
CN103744920A (en) | Commodity attribute name-value pair extraction method and system | |
CN108319606A (en) | The construction method and device of specialized database | |
CN103514237B (en) | A kind of method and system obtaining user and Document personalization feature | |
CN109460555A (en) | Official document determination method, device and electronic equipment | |
US10606875B2 (en) | Search support apparatus and method | |
CN104778251B (en) | A kind of acquisition methods and device of document temperature |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |