CN108287831A - A kind of URL classification method and system, data processing method and system - Google Patents

A kind of URL classification method and system, data processing method and system Download PDF

Info

Publication number
CN108287831A
CN108287831A CN201710012795.1A CN201710012795A CN108287831A CN 108287831 A CN108287831 A CN 108287831A CN 201710012795 A CN201710012795 A CN 201710012795A CN 108287831 A CN108287831 A CN 108287831A
Authority
CN
China
Prior art keywords
url
sorted
field
mark data
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710012795.1A
Other languages
Chinese (zh)
Other versions
CN108287831B (en
Inventor
郭家龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710012795.1A priority Critical patent/CN108287831B/en
Publication of CN108287831A publication Critical patent/CN108287831A/en
Application granted granted Critical
Publication of CN108287831B publication Critical patent/CN108287831B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Abstract

This application provides a kind of URL classification method and system, data processing method and systems, wherein the URL classification method includes:Whether determine has query argument file-name field in uniform resource position mark URL to be sorted;If there is no query argument file-name field, by the URL to be sorted path and filename, the mark data as the URL to be sorted;If there is query argument file-name field, then by the query argument name and filename in the URL to be sorted, the mark data as the URL to be sorted;According to the mark data, classify to the URL to be sorted.Using technical solution provided by the embodiments of the present application, when can solve URL progress analyzing processing in the prior art, repetitive operation is too many, the low technical problem for the treatment of effeciency, has reached the technique effect for the treatment effeciency for improving URL.

Description

A kind of URL classification method and system, data processing method and system
Technical field
The application belongs to technical field of data processing more particularly to a kind of URL classification method and system, data processing method And system.
Background technology
With the continuous development of network technology, people are also more and more to the use of internet.To the place of internet data Science and engineering also becomes more cumbersome.Such as:For network flow, some be normal network flow (such as:People are normal Orientation flow), some be abnormal network flow (such as:Illegally log in, ask failure etc.).
How analyzing processing to network flow is realized, the safety and orderly function for internet play important work With.In view of uniform resource locator (Uniform Resource Locator, referred to as URL) is to can be from internet The position of obtained resource and a kind of succinct expression of access method, are the addresses of standard resource on internet.On internet Each file there are one unique URL, the information that it includes points out how the position of file and browser should be handled It.
Therefore, many network informations can be known by analyzing URL, for example, by carrying out analysis and arrangement to URL, It is known which URL is dangerous, which is safe.Information entrained in URL is detected, it will also be appreciated that net The flow stood, and browsing situation etc..
However, existing carry out analysis and arrangement to URL, usually traversed one by one in the way of a URL, a URL Mode handled.That is, every URL is specifically analyzed and handled, this is clearly inappropriate, this mode The workload for considerably increasing analyzing processing operation, reduces the efficiency of URL analyzing processings.
In view of the above-mentioned problems, currently no effective solution has been proposed.
Invention content
The application is designed to provide a kind of URL classification method and system, data processing method and system, may be implemented pair The efficient process of URL.
The application provides what a kind of URL classification method and system, data processing method and system were realized in:
A kind of URL classification method, the method includes:
Whether determine has query argument file-name field in uniform resource position mark URL to be sorted;
If there is no query argument file-name field, by the URL to be sorted path and filename, waited for point as described The mark data of class URL;
If there is query argument file-name field, then by the query argument name and filename in the URL to be sorted, as described The mark data of URL to be sorted;
According to the mark data, classify to the URL to be sorted.
A kind of URL classification method, the method includes:
According to preset field extracting rule, from the middle extraction field of URL to be sorted;
By the field of extraction, the mark data as the URL to be sorted, wherein the mark data is for characterizing institute State the processing logic of URL to be sorted;
According to the mark data, classify to the URL to be sorted.
A kind of data processing method, the method includes:
URL in website traffic daily record to be audited is divided into multiple classifications, wherein URL corresponds to same in same category A set of processing logic;
To a plurality of URL in same category, a progress analyzing processing is only extracted.
A kind of URL classification system, the system comprises:
Determining module, for determining whether there is query argument file-name field in URL to be sorted;
First generation module will be in the URL to be sorted in the case where determining no query argument file-name field Path and filename, the mark data as the URL to be sorted;
Second generation module, for determine have query argument file-name field in the case of, by looking into the URL to be sorted Ask parameter name and filename, the mark data as the URL to be sorted;
Division module, for according to the mark data, classifying to the URL to be sorted.
A kind of URL classification system, the system comprises:
Extraction module is used for according to preset field extracting rule, from the middle extraction field of URL to be sorted;
Generation module, the field for that will extract, the mark data as the URL to be sorted, wherein the mark number According to the processing logic for characterizing the URL to be sorted;
Division module, for according to the mark data, classifying to the URL to be sorted.
A kind of data processing system, the system comprises:
Division module, for the URL in website traffic daily record to be audited to be divided into multiple classifications, wherein same class Not middle URL corresponds to same set of processing logic;
Processing module, for a plurality of URL in same category, only extracting a progress analyzing processing.
URL classification method and system, data processing method and system provided by the present application, according to the processing logic of URL Difference extracts the mark data that can characterize URL processing logics in URL, and URL is divided into inhomogeneity according to the mark data Not, the URL in same category is adapted to identical processing logic, to effectively increase the classification effectiveness of URL, and passes through URL Classification can also reduce operation when subsequently being focused on to URL repeatability, to solve in the prior art URL into When row analyzing processing, repetitive operation is too many, the low technical problem for the treatment of effeciency, has reached the treatment effeciency for improving URL Technique effect.
Description of the drawings
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments described in application, for those of ordinary skill in the art, in the premise of not making the creative labor property Under, other drawings may also be obtained based on these drawings.
Fig. 1 is a kind of method flow diagram of embodiment of URL classification method provided by the present application;
Fig. 2 is the method flow diagram of another embodiment of URL classification method provided by the present application;
Fig. 3 is URL feature extractions schematic diagram provided by the present application;
Fig. 4 is the principle schematic of network data security analysis provided by the present application;
Fig. 5 is the hardware architecture diagram of URL classification equipment provided by the present application;
Fig. 6 is the structural schematic diagram of URL classification device provided by the present application;
Fig. 7 is the application scenarios schematic diagram of data processing system provided by the present application.
Specific implementation mode
In order to make those skilled in the art better understand the technical solutions in the application, below in conjunction with the application reality The attached drawing in example is applied, technical solutions in the embodiments of the present application is clearly and completely described, it is clear that described implementation Example is merely a part but not all of the embodiments of the present application.Based on the embodiment in the application, this field is common The every other embodiment that technical staff is obtained without creative efforts should all belong to the application protection Range.
Inventor is substantially a kind of text in view of the existing URL webpages identified, has prodigious similitude, because This, often has many URL using same set of processing logic different URL, for example, being directed to same master in same website The URL of topic often corresponds to identical processing logic, if all carrying out analyzing processing respectively to these URL, often causes very Big operation repeats and the wasting of resources.Therefore, being classified to URL just seems necessary.
For this purpose, in embodiments of the present invention, a kind of URL classification method is provided, as shown in Figure 1, may include following step Suddenly:
Step 101:Extract the mark data of URL to be sorted, wherein the mark data is described to be sorted for characterizing The processing logic of URL;
The URL that above-mentioned URL to be sorted can be extracted from real-time network flow can also be from network flow daily record The URL of extraction.Because URL itself is generated according to predetermined format, URL to be sorted can be a URL, can also be a plurality of The set of URL compositions.
It may include consisting of part to be in URL, and each component part name or create-rule be it is fixed, For example, generating URL according to following standard:Agreement:// domain name:Port/path/filenameQuery argument name [array index]= Query interface.Therefore, the content entrained by URL itself can identify many attributes and information of this this URL, in order to realize pair The classification of URL, so that for may belong to same category, therefore, Ke Yicong suitable for the URL of same set of processing logic URL itself extracts mark data, which can characterize the processing logic of URL to be sorted.Wherein, mark data can be Some in URL or certain several complete field can also be that some or certain several fields in URL carry out that treated obtains Field.
In this example, the mark data of URL to be sorted can be extracted according to preset field extracting rule, such as:It can be with One or more is selected to have the field of identification from the structure of URL, using these fields as the mark data of URL.It is setting After having determined identification field, the mark data of URL can be extracted in the following way:
S1:Whether determine has the first field in URL to be sorted;
S2:If without the first field, the second field and third field are extracted from URL to be sorted as to be sorted The mark data of URL;
S3:If there is the first field, it is determined that whether be useful for transmitting the character of variate-value in the first field;
S4:If there is the character for transmitting variate-value, then removed from the first field described for transmitting variate-value Character, the character and the third field using removal for transmitting variate-value are as the mark data of the URL to be sorted;
S5:If not being used to transmit the character of variate-value, using first field and the third field as institute State the mark data of URL to be sorted.
That is, selection has the field of identification from multiple fields of URL, such as:By the first field (such as:Inquiry ginseng Several fields), the second field (such as:Path field), third field (such as:Filename field) etc. can identify the URL Processing logic, then can pass through these fields generate mark data.Further, it is contemplated that query argument file-name field and Both filename fields can indicate that both filename field and path file-name field can also identify, and be joined certainly by inquiring Several field identifications are more accurate, therefore, whether can have query argument file-name field in URL identified below, if not provided, Direct passage path+filename is as mark data (being referred to as extensive feature), if so, filename+ginseng can be used It is several to be used as mark data.
Wherein, query argument file-name field is the field for the multiple key-value pairs for being used to store the terms and conditions needed for inquiry, can To include multiple key-value pairs, a key can be used as a parameter name.Path field is the path for characterizing the URL in URL Field, filename field is used to characterize the title of file pointed by the URL, can be in the following way when realization Determine filename field:The last one "/" character is from left to right found for current URL, then is found since this character First "" character, if do not have "" character, then it is whole string be used as filename, if there is "" character, then take "/" with "" between Character string as filename.
Wherein, extensive to refer in machine processing input signal, new input signal can be extracted with representative Feature, so that the input signal and existing input signal to be associated, by the input signal work of feature having the same For a kind of input signal.Because the purpose of above-mentioned mark data is exactly in order to realize classification, so that URL and existing same processing The URL of logic is associated, and therefore, above-mentioned mark data is referred to as extensive feature.
Above-mentioned is only the explanation carried out by taking parameter file-name field, path field, filename field as an example, actually realize when It waits, can also be required according to actual Data Analysis Services, generate different mark datas.The application is not construed as limiting this.
In view of for that can have the character for transmitting variate-value in some fields, these characters are merely possible to variable Value is transmitted in processing logic.Therefore, it if the other parts of identification field are all identical, is only used for transmitting the word of variate-value Symbol is different, then this be also considered as belonging to same set of processing logic, in order to avoid because these are used to transmit variate-value Character presence caused by should belong to it is same processing logic URL be divided in different classifications, can first really Determine the character for whether being useful for transmitting variate-value in identification field, if so, the character for transmitting variate-value is just first deleted, and To delete the field after character for transmitting variate-value as mark data.
It is generally used for transmitting the character of variate-value by taking identification field is parameter file-name field as an example, in [], therefore, can incite somebody to action [] in parameter file-name field regenerates mark data after deleting.It is worth noting that, being with the parameter file-name field in URL herein For the explanation that carries out, therefore, the character for transmitting variate-value is [], if it is other fields, then corresponding for passing The character of alternation magnitude also can other characters, the selection for the character for transmitting variate-value can be according to difference The case where Adaptive selection, the application is not construed as limiting this.
Step 102:According to the mark data, classify to the URL to be sorted.
Specifically, can URL to be sorted be divided to the URL classifications of mark data having the same, or be divided to In specified URL classifications.That is, after the mark data for extracting URL to be sorted, so that it may, will to be matched to mark data The URL is divided in corresponding URL classifications.Such as the mark data extracted:For index.php, cid and action, just Whether can search current has the corresponding mark data of URL classifications to be:Index.php, cid and action, if so, then can be with Directly the URL is matched in the category, if it is not, a URL classification can be newly established, the newly-established URL classifications Exactly using index.php, cid and action as mark data.
In this example, URL to be sorted can be the URL for the website traffic daily record for waiting for security audit, that is, can be from waiting pacifying A URL is extracted in the website traffic daily record audited entirely as URL to be sorted.After carrying out above-mentioned sort operation, so that it may with To URL processing to be sorted.For example, if it is needing to carry out security audit (network data security analysis) to URL, then Determining the classification belonging to the URL has had that URL is analyzed processed, then the URL is there is no need to carry out analyzing processing, in this way Pass through the analyzing processing to a URL, so that it may to realize the analyzing processing of the URL of processing logic same to a batch.
If above-mentioned URL is the URL for carrying out data statistic analysis, pass through this mode classification, it is possibility to have effect ground The URL for belonging to same processing logic is divided to same category, in order to subsequent data processing.
It, can be by being divided to after URL to be sorted is divided in the URL classifications for have identical mark data Processing logic corresponding to URL classifications determines whether URL to be sorted is safe network request, such as:Certain one kind URL is Unsafe request, then when dividing, it is only necessary to determine current URL belongs to such and is so assured that it is uneasy Full URL.It further, can be with after URL to be sorted is divided in the URL classifications for have identical mark data It is counted, such as:The URL of same class purpose Website page may correspond to same mark data, then can be by right The statistics of the access times to the Website page of a certain classification is realized in the classification of URL in daily record, either, to inhomogeneity classification Website page is compared, analyzes and count etc..That is, the analysis being simple and efficient to data may be implemented in the classification by URL Processing.
Above-mentioned URL classification method is illustrated with reference to a specific embodiment, however being worth noting is, the specific reality Example is applied merely to the present invention is better described, is not constituted improper limitations of the present invention:
As shown in Fig. 2, the URL classification method may include:Determine in uniform resource position mark URL to be sorted whether look into Ask parameter file-name field;If there is no query argument file-name field, by the URL to be sorted path and filename, as institute State the mark data of URL to be sorted;If there is query argument file-name field, it is determined that whether useful in the query argument file-name field In the character for transmitting variate-value;If so, then by the word of filename, removal for transmitting variate-value in the URL to be sorted Query argument name after symbol and the query argument name without the character for transmitting variate-value, as described to be sorted The mark data of URL.
Based on URL classification method shown in Fig. 2, it is described as follows by taking the URL of several types as an example below:
1) there are path fields, and there are filename field, there are query argument value fields, and [] is not present
For such URL, according to above-mentioned mark data extracting mode, obtained mark data is exactly:Filename And ginseng
Several set.
Such as:http://example.com/news/index.phpCid=1111&action=DOWN
Wherein:
Https is protocol fields;
Example.com is domain name field;
News is path field
Index.php is filename field
Cid=1111&action=DOWN is query argument file-name field.
It can be seen that query argument file-name field is made of multiple key-value pairs, the query argument name cited by this example Section includes following several key-value pairs:
Cid=1111
Action=DOWN
Entrained query argument also just has 2 in the query argument file-name field, respectively:Cid and action.
Therefore, it is exactly according to the above-mentioned regular mark data for generating the URL:Filename field+query argument name is as mark Know data, that is,:Index.php, cid and action.These fields may be used sequence or array form etc. into Row storage, the mark data as the URL.Specifically use which kind of mode mark data that can be chosen according to actual conditions, this Shen Please this is not construed as limiting.
2) path field is not present, there are filename field, there are query argument value fields, and there is []
For such URL, according to above-mentioned mark data extracting mode, obtained mark data is exactly:Filename With remove the parameter name set after [].
Such as:
http://example.com/index.phpAid [9090]=WyJmbV8yIixbIjIiLCIiLHsiZklkI JoiMiIsI&act=DOWN
Corresponding mark data can be:Index.php, aid and act.
3) there are path field, there are filename field, query argument value field is not present
For such URL, according to above-mentioned mark data extracting mode, obtained mark data is exactly:Path and Filename.
Such as:http://example.com/news/index.php
Corresponding mark data can be:News and index.php.
4) path field is not present, in filename field, query argument value field is not present
For such URL, because path field is both not present, query argument value field is also not present, therefore, only Need extraction document name as mark data.
Such as:http://example.com/index.php
Corresponding mark data can be:index.php.
It is important to note, however, that above-mentioned is only the explanation enumerated several URL examples and carried out, can also have other types of URL types, the result for the mark data that each corresponding URL is answered can also generate the difference of field according to the rule of selection, Mark data is generated otherwise, and mark data is generated for the type of URL, and using which field or mode, The application is not especially limited.
It is illustrated by taking the addresses URL in several same websites as an example below, first three URL in four URL below Location is the web page address of same loose-leaf link same article on same shopping website, and the 4th URL is another movable boundary The web page address of article on face:
https://chaoshi.detail.tmall.com/item.htmSpm= A3204.7844270.2739258534.3.d831jb&id=529105235717&acm=lb-zebra-39172- 923071.1003.1.1070468&aldid=3yfWqKCp&scm=1003.1.lb-zebra -39172-923071.null_ 529105235717_1070468&pos=3
https://chaoshi.detail.tmall.com/item.htmSpm= A3204.7844270.2739258534.4.d831jb&id=35303479646&acm=lb- zebra-39172- 923071.1003.1.1070468&aldid=3yfWqKCp&scm=1003.1.lb-zebra -39172-923071.null_ 35303479646_1070468&pos=4
https://chaoshi.detail.tmall.com/item.htmSpm= A3204.7844270.2739258534.8.d831jb&id=525692750271&acm=lb-zebra-39172- 923071.1003.1.1070468&aldid=3yfWqKCp&scm=1003.1.lb-zebra -39172-923071.null_ 525692750271_1070468&pos=8
https://chaoshi.detail.tmall.com/item.htmSpm= A3204.7933263.0.0.njGxsP&id=42158669826&rewcatid=5051200 9
According to the dividing mode of Fig. 2, the mark data that can obtain first URL is:item.htm、spm、id、acm、 The mark data of aldid, scm and pos, second URL is:Item.htm, spm, id, acm, aldid, scm and pos, third The mark data of a URL is:The mark data of item.htm, spm, id, acm, aldid, scm and pos, the 4th URL is: Item.htm, spm, id and rewcatid.
It can be seen that first URL, second URL, third URL are to belong to of a sort, in addition the 4th URL is One classification, because only that mark data corresponding to the 4th URL with it is other it is several be different, it can thus be seen that this Kind determines that the mode of identification data is desirable.
Further, there is no consider that this field generates mark number only with query argument file-name field in this example According to being to correspond to same query argument name, but literary because sometimes will appear different web sites either different disposal logic In the case of part name is different, that is, and then query argument name is identical in time, may be not if filename is different With processing logic, therefore, not single selection query argument name in this example, but determining there is query argument file-name field In the case of, while using two fields of query argument name and filename as mark data, such as:
https://chaoshi.detail.tmall.com/item.htmSpm= A3204.7844270.2739258534.8.d831jb&id=525692750271&acm=lb-zebra-39172- 923071.1003.1.1070468&aldid=3yfWqKCp&scm=1003.1.lb-zebra -39172-923071.null_ 525692750271_1070468&pos=8
https://list.tmall.com/search_product.htmSpm=a3204.8.d831jb&id= 525692750271&acm=lb-zebra-39172-923071.1003.1.1070468&al did=3yfWqKCp&scm= 1003.1.lb-zebra-39172.null_525692750271_1070468&pos=8
The two addresses URL, although query argument name set is identical, because filename is different, Therefore it can be divided in two classifications, that is, the two URL are to use different processing logics.
Judging which is partly query argument file-name field, can be using mark when which is partly filename field The mode of position carries out letter and answers directly to judge, for example, as shown in Figure 3:
The URL of men's clothing cotta T in one shopping website:
https://s.shopping.com/listSpm=a219r.lm895.a214d6t-static.2.tp5fhe& Q=%E7%9F%AD%E8%A2%96T&cat=50344007&style=grid&seller_typ e=shopping.
The URL of men's clothing long sleeves T in one shopping website:
https://s.shopping.com/listSpm=a219r.lm895.a214d6t-static.3.Fhw3a1& Q=%E9%95%BF%E8%A2%96T&cat=50344007&style=grid&seller_typ e=shopping.
Wherein, the transport protocol that URL is used is hypertext transfer protocol (HTTP, Hyper Text Transfer Protocol), s.shopping.com is domain name:Port, list are filename, spm, q, cat, style, seller_type For query argument name, a219r.lm895.a214d6t-static.2.tp5fhe, %E7%9F%AD%E8%A2%96T, 50344007, grid, shopping are query interface.In this example, first identifier information be spm, q, cat, style, seller_type.Wherein, the entitled list of file.
Detecting the URL, whether the mode with query argument title can be:
1) detect in the URL whether include separator "", it can be seen that above-mentioned two URL be with separator "";
2) judge separator "" whether there is character below, it can be seen that the separator of above-mentioned two URL "" it is followed by tool There is character, respectively:
Spm=a219r.lm895.a214d6t-static.2.tp5fhe&q=%E7%9F%AD%E8%A 2% 96T&cat=50344007&style=grid&seller_type=shopping;
Spm=a219r.lm895.a214d6t-static.3.Fhw3a1&q=%E9%95%BF%E8%A 2% 96T&cat=50344007&style=grid&seller_type=shopping.
That is, by separator "" determine the initial position of query argument file-name field, if "" there is character to indicate that later There is query argument file-name field, if "" after no character indicate that there is no query argument file-name field.
Sometimes, filename can be sky, and query argument name may be sky, therefore, if in URL not including inquiry Parameter name, then can directly extraction path information be (i.e.:Path and filename) as representative extensive feature as the URL's Mark data, such as:
The homepage of women's bag, URL are in one shopping website:
https://www.shopping.com/market/nvbao/shouye.php.
As can be seen that separator "" below do not have character, separator "" before routing information be market/nvbao/ Shouye.php, i.e.,:The URL does not include query argument name, and therefore, extractable market/nvbao/shouye.php is used as should The mark data of URL, that is, the mark data of the URL is exactly:market/nvbao/shouye.php.
Character of the query argument file-name field all without removal for transmitting variate-value in illustrated example listed above, however, In view of in a practical situation, the character for transmitting variate-value can be carried in query argument file-name field sometimes, in order to avoid this Influence of the presence of a little characters to classification results, by these character deletions, certainly, can be deleted when generating mark data Except when, be not only to delete [], [] intermediate character is also to delete, that is, only retain in query argument file-name field [] it Outer character.
In upper example, by realizing the classification to URL to the extraction of URL mark datas, mark data can be considered Extensive feature, classification may be considered extensive, that is, are referred to as according to the routing information of URL and query interface name representative general Change feature and extensive expression is carried out to URL, due to the routing information and query interface title of the URL of extensive expression having the same Identical, therefore, execution is that same set of computer disposal logic therefore can be to multiple URL with same extensive expression In some URL analyzed.It, can be by right if applying this mode in daily network data security analysis The request principle of multiple URL is learnt in the analysis of one URL, is such as analyzed Request Log safe handling so as to effectively reduce In repetitive operation, can effectively promote the efficiency of Data Analysis Services.
Further, when carrying out extensive expression to URL, the array index in query interface title can be first removed, then Query interface title based on removal array index carries out extensive expression to URL, removes the array index in query interface title Purpose be in order to keep the extensive expression to URL accurate, such as:Array index is generally passed to computer only as variate-value It handles in logic, such extensive feature can not indicate same set of processing logic, therefore, can give up array index, not by it It is included in extensive feature.
Above-mentioned URL classification method can be applied in daily security audit work, as shown in figure 4, network data is pacified Complete analysis processing is usually a Request Log, and the daily record amount in Request Log is very large under normal circumstances, and There is normal discharge also to have an abnormal flow in daily record, normal discharge and abnormal flow, which mix, to be usually difficult to differentiate between, at this In example, by way of URL classification, extensive processing can be carried out to the magnanimity URL in Request Log, to realizing that execution is same set of The URL of computer disposal logic is clustered, and the URL request of magnanimity carries out the process of Data Analysis Services in Request Log In, it is only necessary to the request principle of a collection of URL can be learnt by carrying out analysis for a URL request in same cluster, in turn Reduce the repetitive operation in log analysis processing, improves the efficiency of log analysis processing.
Fig. 5 shows the schematic construction of the sorting device based on server side of the exemplary embodiment according to the application Figure.Referring to FIG. 5, in hardware view, which includes processor, internal bus, network interface, memory and non-volatile Property memory, is also possible that the required hardware of other business certainly.Processor reads correspondence from nonvolatile memory Computer program to memory in then run, on logic level formed URL classification realization device.Certainly, in addition to software reality Except existing mode, other realization methods, such as the mode etc. of logical device or software and hardware combining is not precluded in the application, That is the executive agent of following process flow is not limited to each logic unit, can also be hardware or logical device.
Referring to FIG. 6, in Software Implementation, which can be applied to analyzing processing server In, which can be an individual server, can also be a server cluster, may include extraction module, life At module and division module.Wherein:
Extraction module is used for according to preset field extracting rule, from the middle extraction field of URL to be sorted;
Generation module, the field for that will extract, the mark data as the URL to be sorted, wherein the mark number According to the processing logic for characterizing the URL to be sorted;
Division module, for the URL to be sorted to be divided in the URL classifications for having identical mark data.
In one embodiment, extraction module can specifically determine in the URL to be sorted whether there is the first field;Such as Fruit then extracts the mark of the second field and third field as the URL to be sorted without the first field from the URL to be sorted Know data;If there is the first field, it is determined that whether be useful for transmitting the character of variate-value in first field;If useful In the character for transmitting variate-value, then the character for transmitting variate-value is removed from first field, removal is used for Transmit the mark data of the character and the third field of variate-value as the URL to be sorted;If not being used to transmit becomes The character of magnitude, then using first field and the third field as the mark data of the URL to be sorted.
The first above-mentioned field can be query argument file-name field, and the second field can be path field, and third field can To be filename field, the character for transmitting variate-value can be array index.
In one embodiment, extraction module can be by the query argument name and filename in URL to be sorted, with sequence Form, the mark data as the URL to be sorted.
In one embodiment, can also include acquisition module, for the mark data for extracting URL to be sorted it Before, a URL is extracted from the website traffic daily record for wait for security audit as the URL to be sorted.
In a specific embodiment, above-mentioned URL classification realization device can carry out URL classification in the following way: Whether determine has query argument file-name field in uniform resource position mark URL to be sorted;Determining no query argument file-name field In the case of, by the URL to be sorted path and filename, the mark data as the URL to be sorted;It is looked into determination In the case of asking parameter file-name field, by the query argument name and filename in the URL to be sorted, as the URL to be sorted Mark data;The URL to be sorted is divided in the URL classifications for having identical mark data.It specifically, can also be really Determine the character for whether being useful for transmitting variate-value in query argument file-name field;If so, then by the file in the URL to be sorted Query argument name after name, character of the removal for transmitting variate-value and without the character for transmitting variate-value Query argument name, the mark data as the URL to be sorted.
In this example, a kind of data processing system, including above-mentioned URL classification device are additionally provided, the URL classification is passed through URL in website traffic daily record to be audited is divided into multiple classifications by device, wherein URL corresponds to same set of in same category Handle logic;Then to a plurality of URL in same category, a progress analyzing processing is only extracted.
For function that wherein URL classification device is realized and specific operation, mode that can be as described above into Row, details are not described herein is not construed as limiting to this by the application.
The data processing system can be applied in scene as shown in Figure 7, and Request Log is obtained, which can be with It is that multiple users record the access of a certain website.Because being stored with a plurality of URL in Request Log accesses record, therefore, URL Sorter can as needed classify to URL, can be the request day to acquisition when practical operation certainly Every URL in will carries out sort operation, can also be one by one processing logic corresponding to determining current URL whether by It is processed, if processed, this URL can be ignored, to next URL processing, so as to effectively reduce operation Repeatability.
The above-mentioned access record to website can be that client generates, and the client can be that guest operation uses Terminal device or software.Specifically, client can be smart mobile phone, tablet computer, laptop, desktop computer, The terminal devices such as smartwatch or other wearable devices.Certainly, client can also be that can run on above-mentioned terminal device In software.Such as:The application software such as mobile phone Taobao, Alipay or browser.
It should be noted that term " first " in the specification and claims of aforementioned present invention and above-mentioned attached drawing, " second " etc. is for distinguishing similar object, without being used to describe specific sequence or precedence.It should be appreciated that in this way The data used can be interchanged in the appropriate case, so that the embodiment of the present invention described herein can be in addition to scheming herein Sequence other than those of showing or describe is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that In cover it is non-exclusive include, for example, containing the process of series of steps or unit, method, system, product or equipment need not Those of be limited to clearly to list step or unit, but may include not listing clearly or for these processes, method, The intrinsic other steps of product or equipment or unit.
URL classification method and system, data processing method and system provided by the present application, according to the processing logic of URL Difference extracts the mark data that can characterize URL processing logics in URL, and URL is divided into inhomogeneity according to the mark data Not, the URL in same category is adapted to identical processing logic, to effectively increase the classification effectiveness of URL, and passes through URL Classification can also reduce operation when subsequently being focused on to URL repeatability, improve the processing handled URL Efficiency.
Foregoing description in the application involved by each embodiment is only the application in some embodiments in the application, Embodiment modified slightly can also carry out each embodiment of above-mentioned the application on the basis of certain standards, model, method Scheme.Certainly, meet the process method step described in the application the various embodiments described above other without creative deformations, Still identical application may be implemented, details are not described herein.
Although this application provides the method operating procedure as described in embodiment or flow chart, based on conventional or noninvasive The labour for the property made may include more or less operating procedure.The step of being enumerated in embodiment sequence is only numerous steps A kind of mode in execution sequence does not represent and unique executes sequence.It, can when device or client production in practice executes With according to embodiment, either method shown in the drawings sequence is executed or parallel executed (such as at parallel processor or multithreading The environment of reason).
The device or module that above-described embodiment illustrates can specifically realize by computer chip or entity, or by having The product of certain function is realized.For convenience of description, it is divided into various modules when description apparatus above with function to describe respectively. The function of each module is realized can in the same or multiple software and or hardware when implementing the application.It is of course also possible to The module for realizing certain function is combined into realization by multiple submodule or subelement.
Method, apparatus or module described herein can realize that controller is pressed in a manner of computer readable program code Any mode appropriate is realized, for example, controller can take such as microprocessor or processor and storage can be by (micro-) The computer-readable medium of computer readable program code (such as software or firmware) that processor executes, logic gate, switch, specially With integrated circuit (Application Specific Integrated Circuit, ASIC), programmable logic controller (PLC) and embedding Enter the form of microcontroller, the example of controller includes but not limited to following microcontroller:ARC 625D、Atmel AT91SAM、 Microchip PIC18F26K20 and Silicone Labs C8051F320, Memory Controller are also implemented as depositing A part for the control logic of reservoir.It is also known in the art that in addition to real in a manner of pure computer readable program code Other than existing controller, completely can by by method and step carry out programming in logic come so that controller with logic gate, switch, special The form of integrated circuit, programmable logic controller (PLC) and embedded microcontroller etc. realizes identical function.Therefore this controller It is considered a kind of hardware component, and hardware can also be considered as to the device for realizing various functions that its inside includes Structure in component.Or even, it can will be considered as the software either implementation method for realizing the device of various functions Module can be the structure in hardware component again.
Part of module in herein described device can be in the general of computer executable instructions Described in context, such as program module.Usually, program module includes executing particular task or realization specific abstract data class The routine of type, program, object, component, data structure, class etc..The application can also be put into practice in a distributed computing environment, In these distributed computing environment, by executing task by the connected remote processing devices of communication network.In distribution In computing environment, program module can be located in the local and remote computer storage media including storage device.
As seen through the above description of the embodiments, those skilled in the art can be understood that the application can It is realized by the mode of software plus required hardware.Based on this understanding, the technical solution of the application is substantially in other words The part that contributes to existing technology can be expressed in the form of software products, and can also pass through the implementation of Data Migration It embodies in the process.The computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disc, CD, packet Some instructions are included to use so that a computer equipment (can be personal computer, mobile terminal, server or network are set It is standby etc.) execute method described in certain parts of each embodiment of the application or embodiment.
Each embodiment in this specification is described by the way of progressive, same or analogous portion between each embodiment Point just to refer each other, and each embodiment focuses on the differences from other embodiments.The whole of the application or Person part can be used in numerous general or special purpose computing system environments or configuration.Such as:Personal computer, server calculate Machine, handheld device or portable device, mobile communication terminal, multicomputer system, based on microprocessor are at laptop device System, programmable electronic equipment, network PC, minicomputer, mainframe computer include the distribution of any of the above system or equipment Formula computing environment etc..
Although depicting the application by embodiment, it will be appreciated by the skilled addressee that the application there are many deformation and Variation is without departing from spirit herein, it is desirable to which the attached claims include these deformations and change without departing from the application's Spirit.

Claims (19)

1. a kind of URL classification method, which is characterized in that the method includes:
Whether determine has query argument file-name field in uniform resource position mark URL to be sorted;
If there is no query argument file-name field, by the URL to be sorted path and filename, as described to be sorted The mark data of URL;
If there is query argument file-name field, then by the query argument name and filename in the URL to be sorted, waited for point as described The mark data of class URL;
According to the mark data, classify to the URL to be sorted.
2. according to the method described in claim 1, it is characterized in that, by the query argument name and file in the URL to be sorted Name, as the mark data of the URL to be sorted, including:
The character for whether being useful for transmitting variate-value in the query argument file-name field determined;
If so, then by the query argument after the character of filename, removal for transmitting variate-value in the URL to be sorted Name and without the character for transmitting variate-value query argument name, the mark data as the URL to be sorted.
3. according to the method described in claim 1, it is characterized in that, according to the mark data, to the URL to be sorted into After row classification, the method further includes:
By the processing logic corresponding to the URL classifications that are divided to, determine whether the URL to be sorted is that safe network is asked It asks.
4. a kind of URL classification method, which is characterized in that the method includes:
According to preset field extracting rule, from the middle extraction field of URL to be sorted;
By the field of extraction, the mark data as the URL to be sorted, wherein the mark data is for characterizing described wait for The processing logic of classification URL;
According to the mark data, classify to the URL to be sorted.
5. according to the method described in claim 4, it is characterized in that, according to preset field extracting rule, from URL's to be sorted Middle extraction field, including:
Determine in the URL to be sorted whether there is the first field;
If without the first field, the second field and third field are extracted from the URL to be sorted as described to be sorted The mark data of URL;
If there is the first field, it is determined that whether be useful for transmitting the character of variate-value in first field;
If there is the character for transmitting variate-value, then the word for transmitting variate-value is removed from first field Symbol, the character and the third field using removal for transmitting variate-value are as the mark data of the URL to be sorted;
If not being used to transmit the character of variate-value, using first field and the third field as described to be sorted The mark data of URL.
6. according to the method described in claim 5, it is characterized in that, first field is query argument file-name field, described the Two fields are path field, and the third field is filename field.
7. according to the method described in claim 5, it is characterized in that, the character for transmitting variate-value is array index.
8. according to the method described in claim 5, it is characterized in that, using first field and the third field as described in The mark data of URL to be sorted, including:
By the query argument name and filename in the URL to be sorted, with sequence form, the mark as the URL to be sorted Data.
9. method according to any one of claims 4 to 8, which is characterized in that in the mark data for extracting URL to be sorted Before, further include:
A URL is extracted from the website traffic daily record for wait for security audit as the URL to be sorted.
10. method according to any one of claims 4 to 8, which is characterized in that according to the mark data, to institute It states after URL to be sorted classified, further includes:
By the processing logic corresponding to the URL classifications that are divided to, to the URL processing to be sorted.
11. according to the method described in claim 10, it is characterized in that, being patrolled by the processing corresponding to the URL classifications that are divided to Volume, to the URL processing to be sorted, including:
By the processing logic corresponding to the URL classifications that are divided to, determine whether the URL to be sorted is that safe network is asked It asks.
12. a kind of data processing method, which is characterized in that the method includes:
URL in website traffic daily record to be audited is divided into multiple classifications, wherein URL corresponds to same set of in same category Handle logic;
To a plurality of URL in same category, a progress analyzing processing is only extracted.
13. according to the method for claim 12, which is characterized in that divide the URL in website traffic daily record to be audited For multiple classifications, including:
The affiliated classification of a URL is determined in the following way:
Whether determine has query argument file-name field in URL to be sorted;
If there is no query argument file-name field, by the URL to be sorted path and filename, as described to be sorted The mark data of URL;
If there is query argument file-name field, then by the query argument name and filename in the URL to be sorted, waited for point as described The mark data of class URL;
According to the mark data, classify to the URL to be sorted.
14. according to the method for claim 13, which is characterized in that by the query argument name and text in the URL to be sorted Part name, as the mark data of the URL to be sorted, including:
The character for whether being useful for transmitting variate-value in the query argument file-name field determined;
If so, then by the query argument after the character of filename, removal for transmitting variate-value in the URL to be sorted Name and without the character for transmitting variate-value query argument name, the mark data as the URL to be sorted.
15. a kind of URL classification system, which is characterized in that the system comprises:
Determining module, for determining whether there is query argument file-name field in URL to be sorted;
First generation module, in the case where determining no query argument file-name field, by the path in the URL to be sorted And filename, the mark data as the URL to be sorted;
Second generation module, in the case where determination has query argument file-name field, the inquiry in the URL to be sorted to be joined Several and filename, the mark data as the URL to be sorted;
Division module, for according to the mark data, classifying to the URL to be sorted.
16. system according to claim 15, which is characterized in that second generation module includes:
Determination unit, for determining the character for whether being useful for transmitting variate-value in the query argument file-name field;
Generation unit, for determine be useful for transmit variate-value character in the case of, by the file in the URL to be sorted Query argument name after name, character of the removal for transmitting variate-value and without the character for transmitting variate-value Query argument name, the mark data as the URL to be sorted.
17. system according to claim 15, which is characterized in that further include:
Determining module, for passing through what is be divided to after classifying to the URL to be sorted according to the mark data Processing logic corresponding to URL classifications determines whether the URL to be sorted is safe network request.
18. a kind of URL classification system, which is characterized in that the system comprises:
Extraction module is used for according to preset field extracting rule, from the middle extraction field of URL to be sorted;
Generation module, the field for that will extract, the mark data as the URL to be sorted, wherein the mark data is used In the processing logic for characterizing the URL to be sorted;
Division module, for according to the mark data, classifying to the URL to be sorted.
19. a kind of data processing system, which is characterized in that the system comprises:
Division module, for the URL in website traffic daily record to be audited to be divided into multiple classifications, wherein in same category URL corresponds to same set of processing logic;
Processing module, for a plurality of URL in same category, only extracting a progress analyzing processing.
CN201710012795.1A 2017-01-09 2017-01-09 URL classification method and system and data processing method and system Active CN108287831B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710012795.1A CN108287831B (en) 2017-01-09 2017-01-09 URL classification method and system and data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710012795.1A CN108287831B (en) 2017-01-09 2017-01-09 URL classification method and system and data processing method and system

Publications (2)

Publication Number Publication Date
CN108287831A true CN108287831A (en) 2018-07-17
CN108287831B CN108287831B (en) 2022-08-05

Family

ID=62819154

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710012795.1A Active CN108287831B (en) 2017-01-09 2017-01-09 URL classification method and system and data processing method and system

Country Status (1)

Country Link
CN (1) CN108287831B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126928A (en) * 2018-10-29 2020-05-08 阿里巴巴集团控股有限公司 Method and device for auditing release content
CN112308590A (en) * 2019-08-01 2021-02-02 腾讯科技(深圳)有限公司 Parameter processing method and device, computing equipment and storage medium
CN112532697A (en) * 2020-11-16 2021-03-19 广州大学 Resource downloading method, system, device and medium based on text coding

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1831826A (en) * 2005-03-09 2006-09-13 先锋株式会社 Contents distribution system, contents distribution method, and computer-readable storage medium therefor
US20110066650A1 (en) * 2009-09-16 2011-03-17 Microsoft Corporation Query classification using implicit labels
CN102006174A (en) * 2010-11-08 2011-04-06 中兴通讯股份有限公司 Data processing method and device based on online behavior of mobile phone user
US20110119268A1 (en) * 2009-11-13 2011-05-19 Rajaram Shyam Sundar Method and system for segmenting query urls
US20110191171A1 (en) * 2010-02-03 2011-08-04 Yahoo! Inc. Search engine output-associated bidding in online advertising
CN102185741A (en) * 2011-06-10 2011-09-14 浙江大学 Method for estimating needs of transaction in processor in multi-tier architecture
CN102394885A (en) * 2011-11-09 2012-03-28 中国人民解放军信息工程大学 Information classification protection automatic verification method based on data stream
CN102801697A (en) * 2011-12-20 2012-11-28 北京安天电子设备有限公司 Malicious code detection method and system based on plurality of URLs (Uniform Resource Locator)
CN105138912A (en) * 2015-09-25 2015-12-09 北京奇虎科技有限公司 Method and device for generating phishing website detection rules automatically
US20150356196A1 (en) * 2014-06-04 2015-12-10 International Business Machines Corporation Classifying uniform resource locators
CN105187439A (en) * 2015-09-25 2015-12-23 北京奇虎科技有限公司 Phishing website detection method and device
CN105589917A (en) * 2015-09-17 2016-05-18 广州市动景计算机科技有限公司 Method and device for analyzing log information of browser
CN106027528A (en) * 2016-05-24 2016-10-12 微梦创科网络科技(中国)有限公司 WEB horizontal authority automatic identification method and device

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1831826A (en) * 2005-03-09 2006-09-13 先锋株式会社 Contents distribution system, contents distribution method, and computer-readable storage medium therefor
US20110066650A1 (en) * 2009-09-16 2011-03-17 Microsoft Corporation Query classification using implicit labels
US20110119268A1 (en) * 2009-11-13 2011-05-19 Rajaram Shyam Sundar Method and system for segmenting query urls
US20110191171A1 (en) * 2010-02-03 2011-08-04 Yahoo! Inc. Search engine output-associated bidding in online advertising
CN102006174A (en) * 2010-11-08 2011-04-06 中兴通讯股份有限公司 Data processing method and device based on online behavior of mobile phone user
CN102185741A (en) * 2011-06-10 2011-09-14 浙江大学 Method for estimating needs of transaction in processor in multi-tier architecture
CN102394885A (en) * 2011-11-09 2012-03-28 中国人民解放军信息工程大学 Information classification protection automatic verification method based on data stream
CN102801697A (en) * 2011-12-20 2012-11-28 北京安天电子设备有限公司 Malicious code detection method and system based on plurality of URLs (Uniform Resource Locator)
US20150356196A1 (en) * 2014-06-04 2015-12-10 International Business Machines Corporation Classifying uniform resource locators
CN105589917A (en) * 2015-09-17 2016-05-18 广州市动景计算机科技有限公司 Method and device for analyzing log information of browser
CN105138912A (en) * 2015-09-25 2015-12-09 北京奇虎科技有限公司 Method and device for generating phishing website detection rules automatically
CN105187439A (en) * 2015-09-25 2015-12-23 北京奇虎科技有限公司 Phishing website detection method and device
CN106027528A (en) * 2016-05-24 2016-10-12 微梦创科网络科技(中国)有限公司 WEB horizontal authority automatic identification method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
PETE BURNAP 等: "Real-time classification of malicious URLs on Twitter using machine activity data", 《2015 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING (ASONAM)》 *
SHAUN EGAN 等: "An evaluation of lightweight classification methods for identifying malicious URLs", 《2011 INFORMATION SECURITY FOR SOUTH AFRICA》 *
李定: "智能Web广告爬虫系统研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126928A (en) * 2018-10-29 2020-05-08 阿里巴巴集团控股有限公司 Method and device for auditing release content
CN111126928B (en) * 2018-10-29 2024-03-22 阿里巴巴集团控股有限公司 Method and device for auditing release content
CN112308590A (en) * 2019-08-01 2021-02-02 腾讯科技(深圳)有限公司 Parameter processing method and device, computing equipment and storage medium
CN112308590B (en) * 2019-08-01 2023-07-04 腾讯科技(深圳)有限公司 Parameter processing method and device, computing equipment and storage medium
CN112532697A (en) * 2020-11-16 2021-03-19 广州大学 Resource downloading method, system, device and medium based on text coding

Also Published As

Publication number Publication date
CN108287831B (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
Buber et al. NLP based phishing attack detection from URLs
CN102542061B (en) Intelligent product classification method
CN105359139A (en) Security information management system and security information management method
CN111797239A (en) Application program classification method and device and terminal equipment
CN103530336A (en) Equipment and method for identifying invalid parameters in URLs
CN113409555B (en) Real-time alarm linkage method and system based on Internet of things
CN108287831A (en) A kind of URL classification method and system, data processing method and system
CN111324797A (en) Method and device for acquiring data accurately at high speed
CN108197474A (en) The classification of mobile terminal application and detection method
CN114528457A (en) Web fingerprint detection method and related equipment
CN109033203A (en) A kind of feature extraction method for parallel processing towards big data
CN110020161B (en) Data processing method, log processing method and terminal
CN104268289A (en) Link URL (Uniform Resource Locator) failure detection method and device
WO2018047027A1 (en) A method for exploring traffic passive traces and grouping similar urls
CN103530337A (en) Device and method for recognizing invalid parameters in URL
KR101631032B1 (en) Data storing system and method based on unstructured data filtering and common format conversion
KR102257139B1 (en) Method and apparatus for collecting information regarding dark web
CN111125704B (en) Webpage Trojan horse recognition method and system
CN105653941A (en) Heuristic detection method and system for phishing website
Yue et al. Fine-grained mining and classification of malicious Web pages
Ham et al. Big Data Preprocessing Mechanism for Analytics of Mobile Web Log.
CN106982147A (en) The communication monitoring method and device of a kind of Web communication applications
CN109922444A (en) A kind of refuse messages recognition methods and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant