Summary of the invention
The application's technical matters to be solved provides the analytical approach of a kind of network log URL, and to reduce the calculated amount of canonical coupling, reduction assesses the cost.
The application also provides the analytical equipment of a kind of network log URL, in order to guarantee said method application and realization in practice.
In order to address the above problem, the application discloses the analytical approach of a kind of network log URL, comprising:
Extract the URL in the Webpage log;
Described URL is gone heavily to process;
Adopt successively a plurality of regular expressions preset, to go heavy after URL carry out the canonical coupling, extract with go heavy after the numbering of regular expression of URL coupling;
For removing heavy front URL, copy the regular expression numbering of removing heavy rear URL identical with it, as the regular expression numbering of correspondence;
Add up going different regular expression numberings corresponding to heavy front each URL.
Preferably, go heavy before and go the URL after heavy to be stored in the first form and the second form with the form of row respectively; Described regular expression numbering corresponding to URL of going after heavy, corresponding stored is in the second form.
Preferably, described all URL for going before heavy in the URL that goes after heavy, find regular expression corresponding to the URL identical with it, comprise as the step of the regular expression of correspondence:
The data of the second form are gone turn row;
By to the first form be connected in the form URL column and carry out equivalence and connect, make all URL before heavy find its corresponding regular expression numbering.
Preferably, described regular expression numbering corresponding to heavy front URL of going, correspondence is added in the first form.
Preferably, described regular expression numbering corresponding to heavy front URL of going replaced URL corresponding in the first form.
Preferably, described to go heavy before different regular expression numberings corresponding to each URL step of adding up be to calculate respectively each different regular expression numbering at the number of times that goes to occur among all URL before heavy.
Preferably, the numbering that is numbered its affiliated business category of described regular expression.
The application also provides the analytical equipment of a kind of network log URL, comprising:
The URL extraction module is for the URL that extracts Webpage log;
URL removes the molality piece, is used for described URL is gone heavily to process;
The canonical matching module is used for a plurality of regular expressions that successively employing is preset, and carries out the canonical coupling to removing heavy rear URL, extracts the numbering with the regular expression that goes heavy rear URL coupling;
The matching result replication module is used for copying the regular expression numbering of removing heavy rear URL identical with it, as the regular expression numbering of correspondence for removing heavy front URL;
Statistical module is used for going different regular expression numberings corresponding to heavy front each URL to add up.
Preferably, go heavy before and go the URL after heavy to be stored in the first form and the second form with the form of row respectively; Described regular expression numbering corresponding to URL of going after heavy, corresponding stored is in the second form.
Preferably, described matching result replication module comprises:
Row turns the row submodule, turns row for the data of the second form are gone;
Equivalent connexon module, be used for by to the first form be connected form URL column and carry out equivalence and connect, make all URL before heavy find its corresponding regular expression to number.
Compared with prior art, the application has the following advantages:
According to the application, for the URL among the Web log of magnanimity, remove first the URL that wherein repeats, URL after going is heavily being carried out the canonical coupling, because the log of magnanimity the inside, the number of times of the repeated accesses of URL is very high, after going to weigh, carry out canonical matching technique cost for once for identical URL, the matching result by removing heavy rear URL can obtain with it identical regular expression corresponding to all URL.Therefore, can very effectively be reduced to assessing the cost of URL canonical coupling minimum.
The application can will go the URL of heavy front and back to be stored in the form, connect by going heavy front and back URL column to carry out equivalence, can find the corresponding relation of heavy front all URL and its regular expression, the non-equivalence connection than the canonical coupling can reduce assessing the cost.And, when carrying out equivalent the connection, can select the regular expression numbering is replaced URL corresponding in the table, show that the result just only has the numbering of canonical coupling expression formula, than the situation that has URL, greatly reduced the col width of form, it is less to take resource.
Embodiment
For above-mentioned purpose, the feature and advantage that make the application can become apparent more, below in conjunction with the drawings and specific embodiments the application is described in further detail.
With reference to figure 1, show the process flow diagram of analytical approach embodiment of a kind of network log URL of the application, specifically can may further comprise the steps:
URL in step 101, the extraction Webpage log.
Webpage log is the file with the .log ending of the various raw informations such as record web server reception ﹠ disposal request and run time error, and definite says, should be server log.The web page address URL that has comprised the guest request access in the Webpage log.
URL is comprised of agreement, domain name, request address three parts, the unique resource of determining a request of URL intactly, and this resource can be the page, content module, file or multimedia resource etc.URL is for the website, and the use of URL is the unique location to resource, so mode can have a lot, with unique description (resource name or abbreviation etc.) of resource, the unique identifier of resource (ID, figure notation etc.) also can be dynamic parameter.Therefore, by extracting information among the URL which web page contents of can having learnt guest access, by the analysis to URL in the massive logs, can learn the situation that various web page resources are accessed, such as number of times, the information such as frequency.
Step 102, described URL is gone heavily to process.
URL can repeatedly be accessed in one day, therefore, can have the URL of a large amount of repetitions in the network log of magnanimity.Described going heavily is treated to the network address of repeating in the described Webpage log of removal, and the URL that retains is all not identical.When going heavily to process, can extract unduplicated URL among all URL, or URL is put into table successively, before storage, judge whether there is the identical network address in the table,, then add in the table if do not exist, if exist, then do not add.
In a preferred embodiment of the present application, go heavy before and go URL after heavy can be respectively be stored in the first form and the second form with the form of row.In the following example.
The first form is:
A |
http://men.taobao.com/123456 |
http://men.taobao.com/123456 |
http://men.taobao.com/123456 |
http://women.taobao.com/123456 |
http://women.taobao.com/123456 |
http://women.taobao.com/123456 |
Wherein, this URL of http://men.taobao.com/123456 has repeated 3 times, and this URL of http://women.taobao.com/123456 has also repeated 3 times, and therefore, the second form that goes to obtain after heavy is:
D |
http://men.taobao.com/123456 |
http://women.taobao.com/123456 |
Step 103, adopt a plurality of regular expressions preset successively, to go heavy after URL carry out the canonical coupling, extract with go heavy after the numbering of regular expression of URL coupling.
Be well known that regular expression is be used to the instrument that carries out text matches, usually formed by some common characters and some metacharacters (meta characters).Common character comprises the letter and number of capital and small letter, and metacharacter then has special implication.The coupling of regular expression can be understood as, and in given character string, seeks the part that is complementary with given regular expression.Might have a more than part to satisfy given regular expression in the character string, at this moment each such part be called as a coupling.That URL is mated with the default regular expression that comprises key word herein, on the coupling key word that comprises among the URL in the regular expression is described, unmatch, illustrate not comprise.Can learn the information that comprises among the URL or the classification of information by the coupling of URL being carried out a plurality of regular expressions.
In a preferred embodiment of the present application, described regular expression numbering corresponding to URL of going after heavy, can corresponding stored in the second form.Concrete, the numbering of described regular expression can be the numbering of business category under it.
As above routine, after mating, the result is as shown in the table:
D |
E |
http://men.taobao.com/123456 |
men |
http://women.taobao.com/123456 |
men |
http://women.taobao.com/123456 |
women |
Use many regular expressions that preset to carry out the canonical coupling to http://men.taobao.com/123456, draw and the matching regular expressions that is numbered men.The key word that had both comprised men among the http://women.taobao.com/123456 also comprises the key word of women, can with the matching regular expressions that is numbered men and women.
Step 104, for go heavy before URL, copy identical with it go heavy after the regular expression numbering of URL, as the regular expression numbering of correspondence.
It is identical with it to go URL before heavy all can find in the URL that goes after heavy, therefore, for go heavy before URL, identical with it URL can be numbered as the regular expression of own correspondence at corresponding regular expression numbering.Because the application carries out the canonical coupling for the URL after going heavily, with respect to mating one by one for every URL in the prior art, can greatly reduce workload.As above example need to be mated one by one for 6 URL in the prior art, and after going to weigh, only needs mate 2 URL, then with matching result and 6 corresponding getting final product of URL.
In concrete realization, the result who goes URL before and after heavy and canonical coupling put into form after, described step 104 can comprise:
Substep S11, the data of the second form are gone turn row.
Substep S12, by to the first form be connected in the form URL column and carry out equivalence and connect, make all URL before heavy find its corresponding regular expression numbering.
After carrying out canonical coupling, the regular expression numbering that each bar URL is corresponding is that the form that is listed as is stored, and the numbering of regular expression that can url is corresponding is sequential storage to row the inside by size, and is as follows:
D |
E |
F |
http://men.taobao.com/123456 |
men |
|
http://women.taobao.com/123456 |
men |
women |
Then, A row be connecteds with D carry out equivalence and connect, so just E can be listed as, F is listed as and the regular expression of G in being listed as numbered and the URL of A in being listed as associates.
In a preferred embodiment of the present application, described regular expression numbering corresponding to heavy front URL of going can correspondence be added in the first form, for example, the regular expression numbering added in the row on URL right side in the first form, and carry out corresponding with URL.
In another preferred embodiment of the present application, described regular expression numbering corresponding to heavy front URL of going, can replace URL corresponding in the first form, namely for each URL in the first form, numbering with the regular expression that identical URL is corresponding with it in the second form, add in the first form, and replace former URL.
The application is for the URL of magnanimity, only unduplicated URL wherein carried out the canonical coupling, and therefore can very effective reduction url canonical coupling assess the cost is minimum.
Step 105, to go heavy before different regular expression numberings corresponding to each URL add up.
In concrete realization, described step 105 can for, calculate respectively the number of times that each different regular expression numbering occurs among all URL before going heavily, according to the corresponding keyword of different regular expressions or classification, can add up the various information of guest access website.
In concrete realization, can in the Data Warehouse Platforms such as Hadoop or Hive, implement the application.
In sum, according to the application, for the URL among the Web log of magnanimity, remove first the URL that wherein repeats, the URL after going is heavily being carried out the canonical coupling, because the Web log of magnanimity the inside, the number of times of the repeated accesses of URL is very high, after going to weigh, carries out canonical matching technique cost for once for identical URL, by go heavy after the matching result of URL, can obtain all identical with it URL corresponding regular expression.Therefore, can very effectively be reduced to assessing the cost of URL canonical coupling minimum.
The application can will go the URL of heavy front and back to be stored in the form, connect by going heavy front and back URL column to carry out equivalence, can find the corresponding relation of heavy front all URL and its regular expression, the non-equivalence connection than the canonical coupling can reduce assessing the cost.And, when carrying out equivalent the connection, can select the regular expression numbering is replaced URL corresponding in the table, show that the result just only has the numbering of canonical coupling expression formula, than the situation that has URL, greatly reduced the col width of form, it is less to take resource.
For embodiment of the method, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the application is not subjected to the restriction of described sequence of movement, because according to the application, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the application is necessary.
With reference to figure 2, the structured flowchart of analytical equipment embodiment that it shows a kind of network log URL of the application specifically can comprise with lower module:
URL extraction module 201 is for the URL that extracts Webpage log;
URL removes molality piece 202, is used for described URL is gone heavily to process;
Canonical matching module 203 is used for a plurality of regular expressions that successively employing is preset, and carries out the canonical coupling to removing heavy rear URL, extracts the numbering with the regular expression that goes heavy rear URL coupling;
Matching result replication module 204 is used for copying the regular expression numbering of removing heavy rear URL identical with it, as the regular expression numbering of correspondence for removing heavy front URL;
Statistical module 205 is used for going different regular expression numberings corresponding to heavy front each URL to add up.
In a preferred embodiment of the present application, go heavy before and go URL after heavy can be respectively be stored in the first form and the second form with the form of row; Described regular expression numbering corresponding to URL of going after heavy, can corresponding stored in the second form.
In a preferred embodiment of the present application, described matching result replication module can comprise:
Row turns the row submodule, turns row for the data of the second form are gone;
Equivalent connexon module, be used for by to the first form be connected form URL column and carry out equivalence and connect, make all URL before heavy find its corresponding regular expression to number.
In a preferred embodiment of the present application, described regular expression numbering corresponding to heavy front URL of going can correspondence be added in the first form.
In a preferred embodiment of the present application, described regular expression numbering corresponding to heavy front URL of going can be replaced URL corresponding in the first form.
In a preferred embodiment of the present application, described statistical module can for,
Computing module is used for calculating respectively each different regular expression numbering number of times that all URL occur before going heavily.
In a preferred embodiment of the present application, the numbering of described regular expression can be the numbering of business category under it.
Because described device embodiment is substantially corresponding to aforementioned embodiment of the method illustrated in figures 1 and 2, so not detailed part in the description of present embodiment can referring to the related description in the previous embodiment, just not given unnecessary details at this.
The application can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, set top box, programmable consumer-elcetronics devices, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment etc.
The application can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract data type, program, object, assembly, data structure etc.Also can in distributed computing environment, put into practice the application, in these distributed computing environment, be executed the task by the teleprocessing equipment that is connected by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
In this article, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby not only comprise those key elements so that comprise process, method, article or the equipment of a series of key elements, but also comprise other key elements of clearly not listing, or also be included as the intrinsic key element of this process, method, article or equipment.Do not having in the situation of more restrictions, the key element that is limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.
More than the analytical approach of a kind of network log URL that the application is provided, and, the analytical equipment of a kind of network log URL is described in detail, used specific case herein the application's principle and embodiment are set forth, the explanation of above embodiment just is used for helping to understand the application's method and core concept thereof; Simultaneously, for one of ordinary skill in the art, the thought according to the application all will change in specific embodiments and applications, and in sum, this description should not be construed as the restriction to the application.