The content of the invention
Technical problems to be solved in this application are to provide a kind of analysis method of network log URL, to reduce canonical
The amount of calculation matched somebody with somebody, reduces and calculates cost.
Present invention also provides a kind of analytical equipment of network log URL, be used to ensure the above method in practice should
With and realize.
In order to solve the above problems, this application discloses a kind of analysis method of network log URL, including:
Extract the URL in Webpage log;
Duplicate removal treatment is carried out to the URL;
Preset multiple regular expressions are used successively, canonical matching is carried out to URL after duplicate removal, extract and URL after duplicate removal
The numbering of the regular expression of matching;
For URL before duplicate removal, the regular expression numbering of URL after same duplicate removal is replicated, as corresponding canonical
Expression formula is numbered;
Different regular expression numbering corresponding to each URL before duplicate removal is counted.
Preferably, the URL before duplicate removal and after duplicate removal is stored in the first form and the second form in column form respectively;Institute
The corresponding regular expression numberings of the URL after duplicate removal are stated, correspondence storage is in the second form.
Preferably, it is described for duplicate removal before all URL, in the URL after duplicate removal, find same URL correspondences
Regular expression, include the step of as corresponding regular expression:
The data of the second form are entered into every trade and turns row;
Equivalent connection is carried out by URL columns in the first form and the second form, all URL before duplicate removal are found
Its corresponding regular expression numbering.
Preferably, the corresponding regular expression numberings of URL before the duplicate removal, correspondence is added in the first form.
Preferably, the corresponding regular expression numberings of URL before the duplicate removal, replace corresponding URL in the first form.
Preferably, the step of different regular expression numbering corresponding to each URL before duplicate removal is counted is to divide
Each different regular expression is not calculated numbers the number of times occurred in all URL before duplicate removal.
Preferably, the numbering of the regular expression is the numbering of its affiliated business category.
Present invention also provides a kind of analytical equipment of network log URL, including:
URL extraction modules, for extracting the URL in Webpage log;
URL deduplication modules, for carrying out duplicate removal treatment to the URL;
Canonical matching module, for using preset multiple regular expressions successively, canonical is carried out to URL after duplicate removal
Match somebody with somebody, the numbering of the regular expression that extraction is matched with URL after duplicate removal;
Matching result replication module, for for URL before duplicate removal, replicating the regular expressions of URL after same duplicate removal
Formula is numbered, and is numbered as corresponding regular expression;
Statistical module, for being counted to the corresponding different regular expression numberings of each URL before duplicate removal.
Preferably, the URL before duplicate removal and after duplicate removal is stored in the first form and the second form in column form respectively;Institute
The corresponding regular expression numberings of the URL after duplicate removal are stated, correspondence storage is in the second form.
Preferably, the matching result replication module includes:
Row turns row submodule, and row are turned for the data of the second form to be entered into every trade;
Equivalence connection submodule, for carrying out equivalent connection by URL columns in the first form and the second form, makes
All URL before duplicate removal find its corresponding regular expression numbering.
Compared with prior art, the application has advantages below:
According to the application, for the URL in the Web log of magnanimity, the URL for wherein repeating first is removed, after to duplicate removal
URL carries out canonical matching, and due to the log the insides of magnanimity, the number of times of the repeated accesses of URL is very high, after duplicate removal, for identical
URL carries out canonical matching technique cost only once, by the matching result of URL after duplicate removal, you can obtain same all
The corresponding regular expressions of URL.It is reduced to therefore, it is possible to the very effective calculating cost by the matching of URL canonicals minimum.
The application can carry out equivalence by the URL storages before and after duplicate removal in the table by by URL columns before and after duplicate removal
Connection, you can find all URL and the corresponding relation of its regular expression before duplicate removal, connect compared to the non-equivalence that canonical is matched
Connect, calculating cost can be reduced.And, it is right in can selecting for regular expression numbering to replace table when equivalent connection is carried out
The URL for answering, displaying result just only has the numbering of canonical matching expression, compared to the situation that there is URL, substantially reduces table
The col width of lattice, takes resource smaller.
Specific embodiment
It is below in conjunction with the accompanying drawings and specific real to enable above-mentioned purpose, the feature and advantage of the application more obvious understandable
Mode is applied to be described in further detail the application.
With reference to Fig. 1, a kind of flow chart of the analysis method embodiment of network log URL of the application is shown, specifically may be used
To comprise the following steps:
Step 101, the URL extracted in Webpage log.
Webpage log be record the various raw informations such as web server reception processing request and run time error with
.log the file for ending up, specifically, it should be server log.The webpage ground of guest request access is contained in Webpage log
Location URL.
URL is made up of agreement, domain name, the part of request address three, and intactly URL has uniquely determined a resource for request,
The resource can be the page, content module, file or multimedia resource etc..For website, the use of URL is to money to URL
Unique positioning in source, so mode can have a lot, with unique description of resource (resource name or referred to as etc.), resource it is unique
Identification code (ID, numeral mark etc.), or dynamic parameter.Therefore, the information by extracting in URL can learn visitor's visit
Which web page contents is asked, by the analysis to URL in massive logs, can learn that various web page resources are accessed for situation,
Such as number of times, the information such as frequency.
Step 102, duplicate removal treatment is carried out to the URL.
One URL can be accessed repeatedly in mono- day, therefore, can there is the URL of substantial amounts of repetition in the network log of magnanimity.
The duplicate removal is processed as removing the network address repeated in the Webpage log, and the URL for retaining is differed.Carrying out duplicate removal
When treatment, unduplicated URL in all URL can be extracted, or URL is sequentially placed into table, in judging table before storing
With the presence or absence of the identical network address, if not existing, it is added in table, if in the presence of not being added.
In a preferred embodiment of the present application, the URL before duplicate removal and after duplicate removal can be stored in column form respectively
In the first form and the second form.It is shown in the following example.
First form is:
A |
http://men.taobao.com/123456 |
http://men.taobao.com/123456 |
http://men.taobao.com/123456 |
http://women.taobao.com/123456 |
http://women.taobao.com/123456 |
http://women.taobao.com/123456 |
Wherein, http://men.taobao.com/123456 this URL is repeated 3 times, http://
Women.taobao.com/123456 this URL is also repeated 3 times, therefore, the second form obtained after duplicate removal is:
D |
http://men.taobao.com/123456 |
http://women.taobao.com/123456 |
Step 103, successively use preset multiple regular expressions, canonical matching is carried out to URL after duplicate removal, extract and go
The numbering of the regular expression of URL matchings after weight.
It is well known that, regular expression is the instrument for carrying out text matches, generally by some general characters and some
Metacharacter (meta characters) is constituted.General character includes the letter and number of capital and small letter, and metacharacter is then with special
Implication.The matching of regular expression is it is to be understood that in given character string, find and given regular expression phase
The part matched somebody with somebody.It is possible in character string have more than one part to meet given regular expression, at this moment each such portion
Divide and be referred to as a matching.It is herein that URL is matched with the default regular expression comprising keyword, has matched and said
Comprising the keyword in regular expression in bright URL, unmatch, explanation does not include.Multiple regular expressions are carried out by URL
The matching of formula can learn the classification of information or information included in URL.
In a preferred embodiment of the present application, the corresponding regular expression numberings of URL after the duplicate removal can be right
Should store in the second form.Specifically, the numbering of the regular expression can be the numbering of its affiliated business category.
As above example, as a result as shown in the table after being matched:
D |
E |
http://men.taobao.com/123456 |
men |
http://women.taobao.com/123456 |
men |
http://women.taobao.com/123456 |
women |
To http://men.taobao.com/123456 carries out canonical matching using a plurality of preset regular expression, obtains
Go out and the matching regular expressions that numbering is men.http:Both the key of men is included in //women.taobao.com/123456
Word, also the keyword comprising women, can be men and the matching regular expressions of women with numbering.
Step 104, for URL before duplicate removal, the regular expression numbering of URL after same duplicate removal is replicated, as right
The regular expression answered is numbered.
URL before duplicate removal can be found in the URL after duplicate removal it is same, therefore, for URL before duplicate removal, can be by
The corresponding regular expression numberings of same URL are numbered as oneself corresponding regular expression.Because the application is pin
Canonical matching is carried out to the URL after duplicate removal, relative to being matched one by one for every URL in the prior art, work can be greatly reduced
Measure.As above example, needs to be matched one by one for 6 URL in the prior art, and after duplicate removal, only 2 URL need to be carried out
Match somebody with somebody, it is then that matching result is corresponding with 6 URL.
In concrete implementation, after the result that the URL before and after duplicate removal and canonical are matched is put into form, the step
104 can include:
Sub-step S11, the data of the second form are entered every trade turn row.
Sub-step S12, equivalent connection is carried out by URL columns in the first form and the second form, before making duplicate removal
All URL find its corresponding regular expression numbering.
After canonical matching is carried out, the corresponding regular expression numbering of each bar URL is stored in column form, can
So that by the numbering of the corresponding regular expression of url, sequential storage is as follows to a row the inside by size:
D |
E |
F |
http://men.taobao.com/123456 |
men |
|
http://women.taobao.com/123456 |
men |
women |
Then, equivalent connection is carried out to A row and D row, the regular expression in E row, F row and G row can be thus compiled
Number and A row in URL associate.
In a preferred embodiment of the present application, the corresponding regular expression numberings of URL, can correspond to before the duplicate removal
It is added in the first form, for example, regular expression numbering is added in the row in the first form on the right side of URL, and enters with URL
Row correspondence.
In another preferred embodiment of the present application, the corresponding regular expression numberings of URL, can replace before the duplicate removal
Corresponding URL in the first form is changed, i.e., for each URL in the first form, by URL correspondences same in the second form
Regular expression numbering, be added in the first form, and replace former URL.
The application only carries out canonical matching, therefore, it is possible to highly effective for the URL of magnanimity to wherein unduplicated URL
The matching of reduction url canonicals calculating cost to minimum.
Step 105, different regular expression numbering corresponding to each URL before duplicate removal are counted.
In concrete implementation, the step 105 can be to calculate each different regular expression numbering respectively going
The number of times occurred in all URL before weight, according to the keyword corresponding to different regular expressions or classification, can be to guest access
The various information of website are counted.
In concrete implementation, the application can be implemented in the Data Warehouse Platforms such as Hadoop or Hive.
In sum, according to the application, for the URL in the Web log of magnanimity, the URL for wherein repeating first is removed, right
URL after duplicate removal carries out canonical matching, and due to the Web log the insides of magnanimity, the number of times of the repeated accesses of URL is very high, duplicate removal
Afterwards, for identical URL carry out canonical matching technique cost only once, by the matching result of URL after duplicate removal, you can obtain with
The corresponding regular expressions of all URL of identical.Therefore, it is possible to the very effective calculating cost for matching URL canonicals
It is reduced to minimum.
The application can carry out equivalence by the URL storages before and after duplicate removal in the table by by URL columns before and after duplicate removal
Connection, you can find all URL and the corresponding relation of its regular expression before duplicate removal, connect compared to the non-equivalence that canonical is matched
Connect, calculating cost can be reduced.And, it is right in can selecting for regular expression numbering to replace table when equivalent connection is carried out
The URL for answering, displaying result just only has the numbering of canonical matching expression, compared to the situation that there is URL, substantially reduces form
Col width, take resource it is smaller.
For embodiment of the method, in order to be briefly described, therefore it is all expressed as a series of combination of actions, but this area
Technical staff should know that the application is not limited by described sequence of movement, because according to the application, some steps can
Sequentially or simultaneously carried out with using other.Secondly, those skilled in the art should also know, implementation described in this description
Example belongs to preferred embodiment, necessary to involved action and module not necessarily the application.
With reference to Fig. 2, it illustrates a kind of structured flowchart of the analytical equipment embodiment of network log URL of the application, tool
Body can include with lower module:
URL extraction modules 201, for extracting the URL in Webpage log;
URL deduplication modules 202, for carrying out duplicate removal treatment to the URL;
Canonical matching module 203, for using preset multiple regular expressions successively, canonical is carried out to URL after duplicate removal
Matching, the numbering of the regular expression that extraction is matched with URL after duplicate removal;
Matching result replication module 204, for for URL before duplicate removal, replicating the canonical table of URL after same duplicate removal
Up to formula numbering, numbered as corresponding regular expression;
Statistical module 205, for being counted to the corresponding different regular expression numberings of each URL before duplicate removal.
In a preferred embodiment of the present application, the URL before duplicate removal and after duplicate removal can be stored in column form respectively
In the first form and the second form;The corresponding regular expression numberings of URL after the duplicate removal, can correspond to storage second
In form.
In a preferred embodiment of the present application, the matching result replication module can include:
Row turns row submodule, and row are turned for the data of the second form to be entered into every trade;
Equivalence connection submodule, for carrying out equivalent connection by URL columns in the first form and the second form, makes
All URL before duplicate removal find its corresponding regular expression numbering.
In a preferred embodiment of the present application, the corresponding regular expression numberings of URL, can correspond to before the duplicate removal
It is added in the first form.
In a preferred embodiment of the present application, the corresponding regular expression numberings of URL, can replace before the duplicate removal
Corresponding URL in first form.
In a preferred embodiment of the present application, the statistical module can be,
Computing module, numbers what is occurred in all URL before duplicate removal for calculating each different regular expression respectively
Number of times.
In a preferred embodiment of the present application, the numbering of the regular expression can be its affiliated business category
Numbering.
Because described device embodiment essentially corresponds to the embodiment of the method shown in earlier figures 1 and Fig. 2, therefore the present embodiment
Not detailed part, may refer to the related description in previous embodiment in description, just not repeat herein.
The application can be used in numerous general or special purpose computing system environments or configuration.For example:Personal computer, service
Device computer, handheld device or portable set, laptop device, multicomputer system, the system based on microprocessor, top set
Box, programmable consumer-elcetronics devices, network PC, minicom, mainframe computer, including any of the above system or equipment
DCE etc..
The application can be described in the general context of computer executable instructions, such as program
Module.Usually, program module includes performing particular task or realizes routine, program, object, the group of particular abstract data type
Part, data structure etc..The application can also be in a distributed computing environment put into practice, in these DCEs, by
Remote processing devices connected by communication network perform task.In a distributed computing environment, program module can be with
In local and remote computer-readable storage medium including including storage device.
Herein, term " including ", "comprising" or any other variant thereof is intended to cover non-exclusive inclusion, from
And the process, method, article or the equipment that include a series of key elements is not only included those key elements, but also including not bright
Other key elements really listed, or it is this process, method, article or the intrinsic key element of equipment also to include.Do not having
In the case of more limitations, the key element limited by sentence " including ... ", it is not excluded that in the mistake including the key element
Also there is other identical element in journey, method, article or equipment.
Above to a kind of analysis method of network log URL provided herein, and, a kind of network log URL's
Analytical equipment is described in detail, and specific case used herein is explained the principle and implementation method of the application
State, the explanation of above example is only intended to help and understands the present processes and its core concept;Simultaneously for this area
Those skilled in the art, according to the thought of the application, will change, to sum up institute in specific embodiments and applications
State, this specification content should not be construed as the limitation to the application.