CN105005600B - Preprocessing method of URL (Uniform Resource Locator) in access log - Google Patents

Preprocessing method of URL (Uniform Resource Locator) in access log Download PDF

Info

Publication number
CN105005600B
CN105005600B CN201510383588.8A CN201510383588A CN105005600B CN 105005600 B CN105005600 B CN 105005600B CN 201510383588 A CN201510383588 A CN 201510383588A CN 105005600 B CN105005600 B CN 105005600B
Authority
CN
China
Prior art keywords
url
referer
request
rule
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510383588.8A
Other languages
Chinese (zh)
Other versions
CN105005600A (en
Inventor
陈静
房鹏展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Focus Technology Co Ltd
Original Assignee
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Focus Technology Co Ltd filed Critical Focus Technology Co Ltd
Priority to CN201510383588.8A priority Critical patent/CN105005600B/en
Publication of CN105005600A publication Critical patent/CN105005600A/en
Application granted granted Critical
Publication of CN105005600B publication Critical patent/CN105005600B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a preprocessing method of a URL (Uniform Resource Locator) in a website access log. The preprocessing method comprises the following steps of: Step 11, performing website URL collection: sorting and concluding a website URL address system; Step 12, performing URL configuration and storage: configuring the website URLs obtained through collection in the Step 11, and storing the website URLs into a URL rule storage table, wherein the URL rule storage table comprises the following fields including a URL unique code, a URL identification rule, a URL name and a URL matching sequence; Step 13, taking out information in the URL rule storage table obtained in Step 12, and performing sequencing according to the URL matching sequence in a way of ensuring a mother URL to be arranged in front of a son URL; Step 14, obtaining an access log record which includes the visitor IP, the accessing time, REFERER information and REQUEST information; Step 15, matching the REFERER and the REQUEST in each access log record in the Step 14 with the URL identification rule in the URL rule storage table obtained in the Step 13 according to the sequence obtained in the Step 13; and Step 16, obtaining records of the REFERER and the REQUEST which are not successfully matched in the Step 15 and are coded into -1 or null values.

Description

The preprocess method of URL in a kind of access log
Technical field
The present invention relates to web analytics field, in particular to a kind of preprocess method of web log URL and Device.
Background technology
Website visitation path analysis is the structure and page layout of optimization website, and understands Behavior preference etc. of visitor and carry Important data have been supplied to support and instruct.And the basic data of web path analysis accesses day from the access log of website Be have recorded in will the IP of visitor, access time, REFERER (page that the last time accesses), the REQUEST (pages of current accessed Face) etc. information.Wherein, REFERER and REQUEST are to build the very main information for accessing collections of web pages and access path.
The REFERER and REQUEST recorded in access log are the forms of URL addresses, such as made in China net is (below Referred to as:MIC) URL (address of uniform resource locator, i.e., WWW pages) address of homepage is
“www.made-in-china.com”.Carried out based on the original REFERER and REQUEST recorded in access log Can be suffered a problem that when path analysis, REFERER and REQUEST are excessively detailed, be unfavorable for subsequent statistical analysis and carry Take access path.Such as, the visitor of MIC mainly enters into MIC search listings page, different search words or search by GOOGLE The URL addresses of the corresponding search listing page of condition are different, such as, with " led " is scanned for, the URL of search listing page is
“www.made-in-china.com/productdirectory.doWord=led&subaction=hunt& Style=b&mode=and&code=0&comProvince=nolimit&order=0&isOp enCorrection= 1”。
With " led light " are scanned for, the URL of search listing page is
“www.made-in-china.com/productdirectory.doSubaction=hunt&style=b& Mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrec tion=1&word=led + light ", actually we are that will be similar to that the specific URL of both the above is carried out necessarily when path analysis is conducted interviews Conclusion and classification, such as they are all identified as " MIC search listings page ", could so analyze the visitor for obtaining whole website Access path situation.
Currently, it is mainly concentrated in collecting accession page set and the structure of each visitor for the research of access path Access path.Seldom referred to for how to REFERER and REQUEST pre-process in access log, and the step for be Build the important prerequisite of access path.
HTTP Referer are a parts of header, when browser sends to web server asks, general meeting Take Referer, Tell server current page be come from which page jump address (come from which page link, The i.e. last page for accessing), server takes this to obtain some information for processing.REQUESTheader is a visitor Family end (typically browser) sends an order line (page for access for request when sending a request to Web server URL)。
The content of the invention
Goal of the invention:The present invention provides the preprocess method and dress of the original URL recorded in a kind of web log Put, solve the problems, such as REFERER and the data prediction of REQUEST for being recorded in access log in path analysis.
A kind of preprocess method of web log URL, its step includes:
S11:Website URL is collected, i.e., to the arrangement and conclusion of website URL addresses system;It is that collecting net station owner wants or important Page URL, and confirm the essential information of these URL, including URL recognition rules, URL name;Wherein, URL recognition rules are Refer to the constitutive characteristic of the URL that a certain class page for drawing is analyzed and concluded according to the URL of original web page;URL recognition rules can make It is described with regular expression;
S12:URL is configured and stored, and the website URL for obtaining will be collected in S11 and is configured and is stored URL rule storage tables In;URL rule storage tables include following field:URL unique encodings, URL recognition rules, URL name, URL matching order;Its In, " URL unique encodings " are used to mark the unique identities of each URL recognition rule, are automatically generated by database;" URL is recognized Rule " and " URL name " derive from S11 steps;" URL matching order " is used to control URL matching order;
The determination method of " URL matching order " is:Assuming that B_URL is a substring of A_URL, then claim A_URL and There is character string inclusion relation, wherein A_URL is mother URL, and B_URL is sub- URL between B_URL;Then A_URL and B_URL Sequence ligand be A_URL preceding, B_URL rear, i.e., before mother URL comes sub- URL;
S13:The information in the URL rule storage tables obtained in S12 is taken out, and is ranked up according to " URL matching order ", Before ensureing that mother URL comes sub- URL;
S14:Obtain the record of access log, including the IP of visitor, access time, the REFERER (pages that the last time accesses Face), REQUEST (page of access) information;
S15:By the REFERER and REQUEST in each access log record in S14 respectively with the URL that obtains in S13 URL recognition rules in regular storage table are matched according to the acquisition sequence in S13;If the match is successful, URL is recorded The corresponding URL unique encodings of recognition rule, encode as the coding and REQUEST of REFERER;If REFERER or REQUEST Can not all be matched with any URL recognition rule, then take -1 or null value as REFERER coding or REQUEST encode;
Preferably, the present invention step S15 by the REFERER in each access log record in step S14 and REQUEST is matched with the URL recognition rules in the URL rule storage tables obtained in S13 according to the sequence in S13 respectively, Including:
When the URL recognition rules in REFERER or REQUEST and URL rule storage tables are character string inclusion relations, i.e., REFERER or REQUEST are female URL of the URL recognition rules, then it represents that REFERER or REQUEST and the URL recognition rules The match is successful;
If REFERER or REQUEST can the match is successful with the multiple URL recognition rules in URL rule storage tables, Take according to the URL recognition rules for making number one in S13;
S16:There is no the REFERER and REQUEST that the match is successful in acquisition S15, i.e. REFERER codings or REQUEST are compiled Code is -1 or the record of null value, is merged together all of without the REFERER and REQUEST that the match is successful, is obtained not Matching set of URL;
S17:The set of URL that do not match in S16 carries out statistical analysis, is not matched the most URL of quantity in set of URL, During (and can combine artificial judgement and monitoring) will match URL rule configuration lists without the URL that the match is successful, so as to not The disconnected URL recognition rules improved in URL rule configuration lists.
The present invention provides the URL pretreatment units in a kind of access log, and its feature includes:
URL collector units:For collecting the URL of website, and determine URL recognition rules, URL name;Wherein, URL knows Rule does not refer to the constitutive characteristic of the URL for analyzing and concluding a certain class page for drawing according to the URL of original web page, URL identifications Rule can be described using regular expression;
URL rule storage units:Recognition rule and relevant information for storing URL, including:URL unique encodings, URL recognition rules, URL name, URL matching order.Wherein, " URL unique encodings " are used to mark each URL recognition rule Unique identities;" URL recognition rules " and " URL name " derives from URL collector units;For controlling URL matching order.
Preferably, URL rule storage units include:
URL rule configuration modules:For determining " URL unique encodings " and " URL matching order ".Wherein " URL is uniquely compiled Code " can be automatically generated by database, or be manually generated, as long as ensureing that URL unique encodings and URL recognition rules are man-to-man Relation.The determination method of " URL matching order " is:Assuming that B_URL is that (such as A_URL is for the substring of A_URL " abcd ", B_URL is " abc "), then claim that there is character string inclusion relation between A_URL and B_URL, wherein A_URL is mother URL, B_URL is sub- URL.It is then that the matching order of A_URL and B_URL is A_URL preceding, rear, i.e. mother URL comes sub- URL to B_URL Before.
URL rule memory modules:For store URL rule storage table, including URL recognition rule and relevant information, Including:URL unique encodings, URL recognition rules, URL name, URL matching order;
URL Rule units:The URL recognition rules are ranked up according to URL matching order, and are obtained in this order Take URL recognition rules and URL unique encodings;
Log recording acquiring unit:For obtaining each record in access log, including the IP of visitor, access time, The information such as REFERER (page that the last time accesses), REQUEST (page of access);
URL matching units:For the REFERER and REQUEST of each record in access log to be recognized into rule with URL Then matched.A log recording is taken out, and the order that REFERER or REQUEST is obtained according to URL recognition rules is one by one Matched with URL recognition rules, if REFERER or REQUEST are female URL of a certain URL recognition rules, the match is successful And the URL unique encodings of the URL recognition rules are taken out as REFERER codings or REQUEST codings;If REFERER or REQUEST and any one URL recognition rule all do not have character string inclusion relation, then compile REFERER codings or REQUEST Code does special marking, such as labeled as " -1 " or null value so far completes the matching of this log recording and jumps out this time matching. Then, a log recording is removed, is matched according to the method described above, until all matching completions of all of log recording;
Matching result collection memory cell:Matching result for storing access log and URL recognition rules, including:Access Raw information such as IP, access time, REFERER, REQUEST in daily record etc., and above-mentioned REFERER codings, REQUEST are compiled Code;
Not matching URL monitoring unit includes:Non- matched data acquiring unit:Do not matched into for obtaining matching result concentration Work(REFERER and REQUEST, and be merged into not matching set of URL;Non- matched data statistical module:Count and do not match URL Concentrate the record strip number of each URL;Non- matched data monitoring modular:According to the record strip number for going out not matching each URL in set of URL And descending arrangement is carried out according to record strip number, the set of URL not matched can be collected into and closed.In conjunction with actual business demand, can Determine whether by these URL be configured to URL rule storage table in, if necessary to configure, then come back to URL collector units according to Above-mentioned flow is performed, until the URL of institute's analysis in need is added in the regular storage tables of URL.
Beneficial outcomes of the invention are as follows:The present invention provides a kind of the pre- of the original URL of record in web log Processing method, can solve to be asked for the REFERER and the data prediction of REQUEST that are recorded in access log in path analysis Topic:
1) by collecting the page URL of website and forming website URL rule storage tables, by what is recorded in original access log URL recognition rules in REFERER and REQUEST and URL rule storage table are matched, by each REFERER and REQUEST is encoded and named, and the original URL address formats conversion of REFERER and REQUEST is divided for ease of subsequent statistical Analysis and the coding and Business Name of application.
2) by the monitoring and analysis to not matching set of URL, URL rule storage tables can be constantly improved, can be caused URL rule storage tables progressively comprehensively cover all of Website page, so that recording in ensureing access log is as much as possible Matching obtains REFERER codings and REQUEST codings.For the follow-up analysis based on access log provides perfect pretreatment Data.
Brief description of the drawings
Fig. 1 is a kind of preprocess method flow chart of web log URL of the embodiment of the present invention;
Fig. 2 is a kind of structural representation of the pretreatment unit of web log URL of the embodiment of the present invention.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiments of the present invention are described in further detail, it is obvious that institute The embodiment of description is only a part of embodiment of the invention, rather than whole embodiments.Based on embodiments herein, and Change or equivalent variations that the technical spirit of the claims in the present invention is made, still fall within the scope of the application protection.
Refering to shown in Fig. 1, the implementation steps of the application are as follows:
S11:Website URL is collected, i.e., to the arrangement and conclusion of website URL addresses system.The collection of website URL is in initial rank Section can rely on the mode for artificially collecting, and by artificially collecting website than major or important page URL, and confirm these The essential information of URL, including URL recognition rules, URL name etc..Wherein, URL recognition rules refer to the URL according to original web page The constitutive characteristic of the URL of a certain class page that analysis and conclusion draw.
Such as, the URL addresses of the product search list page of made in China net are all with " www.made-in-china.com/ productdirectory.do" beginning;Then the recognition rule of product search list page is exactly " www.made-in- china.com/productdirectory.do”.And, URL recognition rules can be described using regular expression.
The URL addresses of product search list page are following forms:
www.made-in-china.com/productdirectory.doWord=led&subaction=hunt& Style=b&mode=and&code=0&comProvince=nolimit&order=0&isOp enCorrection=1, It is characterized in that
With " www.made-in-china.com/productdirectory.do" start, the parameter such as " word " below Have recorded the information such as search word used.If so just can be according to certain URL with " www.made-in-china.com/ productdirectory.do" beginning, then the URL is product search list page.
The homepage URL addresses of made in China net are:www.made-in-china.com.
The special activities homepage URL addresses of made in China net are:www.made-in-china.com/special.
Detail pages of URL address of special activities of made in China net is:(such as magic-show special topics)
www.made-in-china.com/special/magic-show/。
Then the corresponding URL recognition rules of four pages of the above and URL name can be respectively:
“www.made-in-china.com/productdirectory.do", " product search list page ";
" www.made-in-china.com $ ", " MIC homepages ";
" www.made-in-china.com/special ", " thematic homepage ";
" www.made-in-china.com/special/ ", " thematic detail pages ".
Wherein, " $ " in the recognition rule of MIC homepages is the method for expressing of regular expression, is represented with the word before " $ " Symbol string ending, represents all character strings ended up with " www.made-in-china.com " herein;
S12:URL is configured and stored, and the website URL for obtaining will be collected in S11 and is configured and is stored URL rule storage tables In.URL rule storage tables include following field:URL unique encodings, URL recognition rules, URL name, URL matching order.Its In, " URL unique encodings " are used to mark the unique identities of each URL recognition rule, can be automatically generated by database;" URL knows Not other rule " and " URL name " are from S11 steps;" URL matching order " is used to control URL matching order.
The determination method of " URL matching order " is:Assuming that B_URL is that (such as A_URL is for the substring of A_URL " abcd ", B_URL is " abc "), then claim that there is character string inclusion relation between A_URL and B_URL, wherein A_URL is mother URL, B_URL is sub- URL.It is then that the matching order of A_URL and B_URL is A_URL preceding, rear, i.e. mother URL comes sub- URL to B_URL Before.
Specifically, if configuration website first page URL to URL rule storage table in, the product of made in China net As a example by product search listing page, then URL unique encodings, URL recognition rules, URL name, URL matching order are respectively:
" 1001 ", " www.made-in-china.com/productdirectory.do", " product search list page ", " product search list page ".It should be noted that the URL identification rule not being configured also in URL rules storage table at present Then, therefore the value of " URL matching order " can be random, value of the URL name as URL matching order is can use herein.
The URL unique encodings of above-mentioned 4 pages, URL recognition rules, URL name, URL matching order are respectively:
" 1002 ", " www.made-in-china.com/productdirectory.do", " product search list page ", " product search list page ";
" 1003 ", " www.made-in-china.com $ ", " MIC homepages ", " MIC homepages ";
" 1004 ", " www.made-in-china.com/special ", " thematic homepage ", " thematic page 2 ";
" 1005 ", " www.made-in-china.com/special/ ", " thematic detail pages ", " thematic page 1 ".
Wherein, thematic detail pages of URL recognition rules (www.made-in-china.com/special/) are special topics Female URL of the recognition rule (www.made-in-china.com/special) of homepage, therefore detail pages of special topic and special topic are first The URL matching order of page is respectively " thematic page 1 ", " thematic page 2 ", so ensure that and is arranged according to URL matching order ascending order When, before thematic detail pages comes thematic homepage.
S13:The information in the URL rule storage tables obtained in S12 is taken out, and is ranked up according to " URL matching order ", Before ensureing that mother URL comes sub- URL.
Specifically, taking out the information in above-mentioned URL rules storage table, and arranged according to " URL matching order " ascending order, obtained Arrive:
" 1003 ", " www.made-in-china.com $ ", " MIC homepages ", " MIC homepages ";
" 1002 ", " www.made-in-china.com/productdirectory.do", " product search list page ", " product search list page ";
" 1005 ", " www.made-in-china.com/special/ ", " thematic detail pages ", " thematic page 1 ";
" 1004 ", " www.made-in-china.com/special ", " thematic homepage ", " thematic page 2 ".
S14:Obtain the record of access log, including the IP of visitor, access time, the REFERER (pages that the last time accesses Face), the information such as REQUEST (page of access).Specifically, the record in access log can be following form:
192.168.1.1,2015-01-0112:01:00, www.made-in-china.com, www.google.com;
192.168.1.1,2015-01-0112:01:30, www.made-in-china.com/special/vacuum- Pump/, www.made-in-china.com;
192.168.1.1,2015-01-0112:01:30, sourcing.made-in-china.com/ Suppliers.html, www.made-in-china.com/special/vacuum-pump/;
192.168.2.1,2015-01-0112:02:10, www.made-in-china.com, www.google.com;
192.168.2.1,2015-01-0112:03:10,
http://www.made-in-china.com/productdirectory.doWord=led&subaction =hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0 & IsOpenCorrection=1,
www.made-in-china.com;
Wherein, " 192.168.1.1 " and " 192.168.2.1 " is the IP address of visitor;It is with the time that IP address is closed on The access time of guest access respective page;The URL address adjacent with access time is that the currently accessed page URL of visitor is REQUEST, the www.made-in-china.com in such as first record;URL addresses after the page URL of current accessed The upper i.e. REFERER of a page URL accessed by visitor, the www.google.com in such as first record.That is Visitor be from a upper accession page (REFERER) jump to the current accessed page (REQUEST), i.e. visitor be from Www.google.com jumps to www.made-in-china.com's.
S15:By the REFERER and REQUEST in each access log record in S14 respectively with the URL that obtains in S13 URL recognition rules in regular storage table are matched according to the acquisition sequence in S13.If the match is successful, URL is recorded The corresponding URL unique encodings of recognition rule, encode as the coding and REQUEST of REFERER.If REFERER or REQUEST Can not all be matched with any URL recognition rule, then take -1 or null value as REFERER coding or REQUEST encode.
Preferably, in the application, the REFERER and REQUEST in each access log record in S14 are divided in S15 Do not matched according to the sequence in S13 with the URL recognition rules in the URL rule storage tables obtained in S13.Including:
When the URL recognition rules in REFERER or REQUEST and URL rule storage tables are character string inclusion relations, i.e., REFERER or REQUEST are female URL of the URL recognition rules, then it represents that REFERER or REQUEST and the URL recognition rules The match is successful.
If REFERER or REQUEST can the match is successful with the multiple URL recognition rules in URL rule storage tables, Take according to the URL recognition rules for making number one in S13.
Specifically, the log recording that will be listed in S14, is matched with the URL rule storage tables in S13.
Take out first record:
192.168.1.1,2015-01-0112:01:00, www.made-in-china.com, www.google.com;
REQUES is www.made-in-china.com, can match the MIC homepages in the URL rule storage tables of S13, The corresponding URL unique encodings " 1003 " of MIC homepages are taken to be encoded as this REQUEST for recording.REFERFER is Any one URL recognition rule in www.google.com, with the URL of S13 rule storage table all is unmatched, and is set " -1 " It is the REFERER codings of this record.
Take out Article 2 record:
192.168.1.1,2015-01-0112:01:30, www.made-in-china.com/special/vacuum- Pump/, www.made-in-china.com;
REQUES is www.made-in-china.com/special/vacuum-pump/, can simultaneously match S13's Thematic detail pages in URL rule storage tables and thematic homepage, take the recognition rule that first is come according to matching order, that is, take Thematic detail pages of corresponding URL unique encodings " 1005 " encodes as this REQUEST for recording.REFERFER is The match is successful for MIC homepages in www.made-in-china.com, with the URL rule storage tables of S13, this record REFERER is encoded to " 1003 ".
Method like this, until the matching of all log recordings is completed.Finally, the matching result of all records following (IP, access Time, REQUEST, REFERER, REQUEST coding, REFERER codings):
192.168.1.1,2015-01-0112:01:00, www.made-in-china.com, www.google.com, 1003, -1;
192.168.1.1,2015-01-0112:01:30, www.made-in-china.com/special/vacuum- Pump/, www.made-in-china.com, 1005,1003;
192.168.1.1,2015-01-0112:01:30, sourcing.made-in-china.com/ Suppliers.html, www.made-in-china.com/special/vacuum-pump/, -1,1005;
192.168.2.1,2015-01-0112:02:10, www.made-in-china.com, www.google.com, 1003, -1;
192.168.2.1,2015-01-0112:02:10,
http://www.made-in-china.com/productdirectory.doWord=led&subaction =hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0 & IsOpenCorrection=1,
Www.made-in-china.com, 1002,1003;
S16:There is no the REFERER and REQUEST that the match is successful in acquisition S15, i.e. REFERER codings or REQUEST are compiled Code is -1 or the record of null value, is merged together all of without the REFERER and REQUEST that the match is successful, is obtained not Matching set of URL.
Specifically, there is no the REFERER that the match is successful to be in S15:
www.google.com;
www.google.com。
It is without the REQUEST that the match is successful:
sourcing.made-in-china.com/suppliers.html。
Merging is not matched set of URL:
www.google.com;
www.google.com;
sourcing.made-in-china.com/suppliers.html。
S17:The set of URL that do not match in S16 carries out statistical analysis, is not matched the most URL of quantity in set of URL, And combine artificial judgement and monitoring, can in matching URL rule configuration lists without the URL that the match is successful, such that it is able to URL recognition rules in constantly improve URL rule configuration lists.
Specifically, S16 do not match set of URL in the URL that does not match be:Www.google.com, sourcing.made-in-china.com/suppliers.html.Wherein www.google.com is main search engine, It is the access entrance of most of external visitors, it should be that the access that most of websites should all be paid close attention to is originated.Therefore, can be by Www.google.com is also collected and is configured in URL rule storage tables, then repeat S11 to S13.
It sets each functional module to the URL pretreatment units also provided in a kind of access log of the invention in the above method, URL collector units:For collecting the URL of website, and determine URL recognition rules, URL name etc..Wherein, URL recognition rules Refer to the constitutive characteristic of the URL for analyzing and concluding a certain class page for drawing according to the URL of original web page, such as made in China net The URL addresses of product search list page be all with " www.made-in-china.com/productdirectory.do" open Head, then the recognition rule of product search list page is exactly " www.made-in-china.com/ productdirectory.do”.And, URL recognition rules can be described using regular expression.
URL rule storage units:Recognition rule and relevant information for storing URL, including:URL unique encodings, URL recognition rules, URL name, URL matching order.Wherein, " URL unique encodings " are used to mark each URL recognition rule Unique identities;" URL recognition rules " and " URL name " derives from URL collector units;For controlling URL matching order.
Log recording acquiring unit:For obtaining each record in access log, including the IP of visitor, access time, The information such as REFERER (page that the last time accesses), REQUEST (page of access).
URL matching units:For the REFERER and REQUEST of each record in access log to be recognized into rule with URL Then matched.A log recording is taken out, and the order that REFERER or REQUEST is obtained according to URL recognition rules is one by one Matched with URL recognition rules, if REFERER or REQUEST are female URL of a certain URL recognition rules, the match is successful And the URL unique encodings of the URL recognition rules are taken out as REFERER codings or REQUEST codings;If REFERER or REQUEST and any one URL recognition rule all do not have character string inclusion relation, then compile REFERER codings or REQUEST Code does special marking, such as labeled as " -1 " or null value so far completes the matching of this log recording and jumps out this time matching. Then, a log recording is removed, is matched according to the method described above, until all matching completions of all of log recording.
Matching result collection memory cell:Matching result for storing access log and URL recognition rules, including:Access Raw information such as IP, access time, REFERER, REQUEST in daily record etc., and above-mentioned REFERER codings, REQUEST are compiled Code.
URL monitoring unit is not matched:For being divided the REFERER and REQUEST that the match is successful in access log Analysis, so as to the whole pages for improving URL recognition rules to cover website, reaches the purpose of gradual perfection and optimization.
Not matching URL monitoring unit includes:
Non- matched data acquiring unit:The match is successful REFERER and REQUEST is concentrated for obtaining matching result, and It is merged into not matching set of URL.
Non- matched data statistical module:Count the record strip number for not matching each URL in set of URL.
Non- matched data monitoring modular:According to the record strip number for going out not matching each URL in set of URL and according to record strip number Descending arrangement is carried out, the set of URL not matched can be collected into and closed.In conjunction with actual business demand, it may be determined whether by these URL is configured in URL rule storage tables, if necessary to configure, is then come back to URL collector units and is performed according to above-mentioned flow, Until the URL of institute's analysis in need is added in the regular storage tables of URL.
Method and system provided by the present invention is described in detail above, but these explanations can not be understood to The scope of the present invention is limited, protection scope of the present invention is limited by appended claims, it is any to be wanted in right of the present invention Change on the basis of asking all is protection scope of the present invention.

Claims (3)

1. a kind of preprocess method of web log URL, it is characterized in that step includes:
S11:Website URL is collected, i.e., to the arrangement and conclusion of website URL addresses system;Page that collecting net station owner wants or important Face URL, and confirm the essential information of these URL, including URL recognition rules, URL name;Wherein, URL recognition rules refer to root The constitutive characteristic of the URL of a certain class page drawn according to the URL analyses and conclusion of original web page;URL recognition rules can be using just Then expression formula is described;
S12:URL configure and store, will be collected in S11 the website URL for obtaining configure and store URL rule storage table in;URL Regular storage table includes following field:URL unique encodings, URL recognition rules, URL name, URL matching order;Wherein, " URL Unique encodings " are used to mark the unique identities of each URL recognition rule, are automatically generated by database;" URL recognition rules " and " URL name " derives from S11 steps;" URL matching order " is used to control URL matching order;
The determination method of " URL matching order " is:Assuming that B_URL is a substring of A_URL, then claim A_URL and B_ There is character string inclusion relation, wherein A_URL is mother URL, and B_URL is sub- URL between URL;Then A_URL and B_URL are matched Order is A_URL preceding, B_URL rear, i.e., before mother URL comes sub- URL;
S13:The information in the URL rule storage tables obtained in S12 is taken out, and is ranked up according to " URL matching order ", it is ensured that Before female URL comes sub- URL;
S14:Obtain the record of access log, including the IP of visitor, access time, the last page REFERER for accessing, access Page REQUEST information;
S15:URL by the REFERER and REQUEST in each access log record in S14 respectively with acquisition in S13 is regular URL recognition rules in storage table are matched according to the acquisition sequence in S13;If the match is successful, URL identifications are recorded The corresponding URL unique encodings of rule, encode as the coding and REQUEST of REFERER;If REFERER or REQUEST with appoint What URL recognition rule can not all be matched, then take -1 or null value encoded as the coding or REQUEST of REFERER;
S16:Obtaining in S15 does not have a REFERER and REQUEST that the match is successful, i.e. and REFERER codings or REQUEST be encoded to- 1 or the record of null value, it is merged together all of without the REFERER and REQUEST that the match is successful, do not matched Set of URL;
S17:The set of URL that do not match in S16 carries out statistical analysis, is not matched the most URL of quantity in set of URL, will not have In thering is the URL that the match is successful to match URL rule configuration lists, rule are recognized so as to the URL in constantly improve URL rule configuration lists Then.
2. the preprocess method of web log URL according to claim 1, it is characterized in that step S15 is by step S14 In REFERER in each access log record and REQUEST respectively with the URL rule storage tables that obtain in S13 in URL Recognition rule is matched according to the sequence in S13, including:
When the URL recognition rules in REFERER or REQUEST and URL rule storage tables are character string inclusion relations, i.e., REFERER or REQUEST are female URL of the URL recognition rules, then it represents that REFERER or REQUEST and the URL recognition rules The match is successful;
If REFERER or REQUEST can with URL rule storage table in multiple URL recognition rules the match is successful, take by According to the URL recognition rules for making number one in S13.
3. a kind of pretreatment system of the web log URL of method according to claim 1, its feature includes:
URL collector units:For collecting the URL of website, and determine URL recognition rules, URL name;Wherein, URL identifications rule Refer to then the constitutive characteristic of the URL for analyzing and concluding a certain class page for drawing according to the URL of original web page, URL recognition rules Can be described using regular expression;
URL rule storage units:Recognition rule and relevant information for storing URL, including:URL unique encodings, URL know Not rule, URL name, URL matching order;Wherein, " URL unique encodings " are used to mark the unique of each URL recognition rule Identity;" URL recognition rules " and " URL name " derives from URL collector units;For controlling URL matching order;
Log recording acquiring unit:For obtaining each record in access log, including the IP of visitor, access time, REFERER, REQUEST information;
URL matching units:For the REFERER and REQUEST and URL recognition rules of each record in access log to be entered Row matching;Take out a log recording, and the order that REFERER or REQUEST is obtained according to URL recognition rules one by one with URL recognition rules are matched, if REFERER or REQUEST are female URL of a certain URL recognition rules, the match is successful simultaneously The URL unique encodings for taking out the URL recognition rules are encoded or REQUEST codings as REFERER;If REFERER or REQUEST and any one URL recognition rule all do not have character string inclusion relation, then compile REFERER codings or REQUEST Code does special marking, such as labeled as " -1 " or null value so far completes the matching of this log recording and jumps out this time matching; Then, a log recording is removed, is matched according to the method described above, until all matching completions of all of log recording;
Matching result collection memory cell:Matching result for storing access log and URL recognition rules, including:Access log In raw information such as IP, access time, REFERER, REQUEST, and above-mentioned REFERER coding, REQUEST coding;
Not matching URL monitoring unit includes:Non- matched data acquiring unit, concentrates what the match is successful for obtaining matching result REFERER and REQUEST, and be merged into not matching set of URL;Non- matched data statistical module:Count and do not match set of URL In each URL record strip number;Non- matched data monitoring modular:According to the record strip number for not matching each URL in set of URL and press Descending arrangement is carried out according to record strip number, the set of URL not matched can be collected into and closed;In conjunction with actual business demand, can determine that Whether these URL are configured in URL rule storage tables, if necessary to configure, then come back to URL collector units according to above-mentioned Flow is performed, until the URL of institute's analysis in need is added in the regular storage tables of URL;
URL rule storage units include:
URL rule configuration modules:For determining " URL unique encodings " and " URL matching order ";Wherein " URL unique encodings " by Database is automatically generated, or is manually generated, as long as ensureing that URL unique encodings and URL recognition rules are man-to-man relations; The determination method of " URL matching order " is:Assuming that B_URL is a substring of A_URL, then between title A_URL and B_URL With character string inclusion relation, wherein A_URL is mother URL, and B_URL is sub- URL;The matching order for being then A_URL and B_URL is A_URL preceding, B_URL rear, i.e., before mother URL comes sub- URL;
URL rule memory modules:For store URL rule storage table, including URL recognition rule and relevant information, including: URL unique encodings, URL recognition rules, URL name, URL matching order;
URL Rule units:The URL recognition rules are ranked up according to URL matching order, and are obtained in this order URL recognition rules and URL unique encodings.
CN201510383588.8A 2015-07-02 2015-07-02 Preprocessing method of URL (Uniform Resource Locator) in access log Active CN105005600B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510383588.8A CN105005600B (en) 2015-07-02 2015-07-02 Preprocessing method of URL (Uniform Resource Locator) in access log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510383588.8A CN105005600B (en) 2015-07-02 2015-07-02 Preprocessing method of URL (Uniform Resource Locator) in access log

Publications (2)

Publication Number Publication Date
CN105005600A CN105005600A (en) 2015-10-28
CN105005600B true CN105005600B (en) 2017-05-24

Family

ID=54378276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510383588.8A Active CN105005600B (en) 2015-07-02 2015-07-02 Preprocessing method of URL (Uniform Resource Locator) in access log

Country Status (1)

Country Link
CN (1) CN105005600B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445700A (en) * 2016-09-20 2017-02-22 杭州华三通信技术有限公司 Method and device for uniform resource locator (URL) matching

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105763633B (en) * 2016-04-14 2019-05-21 上海牙木通讯技术有限公司 A kind of correlating method of domain name and website visiting behavior
CN107404392A (en) * 2016-05-20 2017-11-28 中兴通讯股份有限公司 The processing method and processing device of the scheduling rule of uniform resource position mark URL
CN106330563B (en) * 2016-08-30 2019-09-17 北京神州绿盟信息安全科技股份有限公司 A kind of method and device of determining Intranet http communication stream service type
CN106445815B (en) * 2016-09-06 2019-04-23 优酷网络技术(北京)有限公司 A kind of automated testing method and device
CN107317892B (en) * 2017-06-30 2020-08-07 北京知道创宇信息技术股份有限公司 Network address processing method, computing device and readable storage medium
CN107330090A (en) * 2017-07-04 2017-11-07 北京锐安科技有限公司 A kind of information processing method and device
CN109995889B (en) * 2018-01-02 2022-02-25 中国移动通信有限公司研究院 Method and device for updating mapping relation table, gateway equipment and storage medium
CN109242528A (en) * 2018-07-26 2019-01-18 焦点科技股份有限公司 A kind of the funnel analysis method and device in the customized path of electric business platform
CN111162956B (en) * 2018-11-08 2021-07-30 优信数享(北京)信息技术有限公司 Log recording method and device
CN111368227B (en) * 2018-12-25 2023-06-27 阿里巴巴集团控股有限公司 URL processing method and device
CN115577197B (en) * 2022-12-07 2023-10-27 杭州城市大数据运营有限公司 Component discovery method, system and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103297435A (en) * 2013-06-06 2013-09-11 中国科学院信息工程研究所 Abnormal access behavior detection method and system on basis of WEB logs
CN103377260A (en) * 2012-04-28 2013-10-30 阿里巴巴集团控股有限公司 Analysis method and device of URLs (Uniform Resource Locator) of weblog

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080209030A1 (en) * 2007-02-28 2008-08-28 Microsoft Corporation Mining Web Logs to Debug Wide-Area Connectivity Problems

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103377260A (en) * 2012-04-28 2013-10-30 阿里巴巴集团控股有限公司 Analysis method and device of URLs (Uniform Resource Locator) of weblog
CN103297435A (en) * 2013-06-06 2013-09-11 中国科学院信息工程研究所 Abnormal access behavior detection method and system on basis of WEB logs

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445700A (en) * 2016-09-20 2017-02-22 杭州华三通信技术有限公司 Method and device for uniform resource locator (URL) matching
CN106445700B (en) * 2016-09-20 2019-11-12 新华三技术有限公司 A kind of URL matching process and device

Also Published As

Publication number Publication date
CN105005600A (en) 2015-10-28

Similar Documents

Publication Publication Date Title
CN105005600B (en) Preprocessing method of URL (Uniform Resource Locator) in access log
CN101957816B (en) Webpage metadata automatic extraction method and system based on multi-page comparison
US11321421B2 (en) Method, apparatus and device for generating entity relationship data, and storage medium
CN103136360B (en) A kind of internet behavior markup engine and to should the behavior mask method of engine
CN100405371C (en) Method and system for abstracting new word
CN101320375B (en) Digital book search method based on user click action
CN100394727C (en) Log analyzing method and system
CN107800591B (en) Unified log data analysis method
CN105022827A (en) Field subject-oriented Web news dynamic aggregation method
CN103970843B (en) Conversation combining method based on UUID in a kind of Web log integrities
CN103530429B (en) Webpage content extracting method
CN101499062A (en) Method and equipment for collecting entity alias
CN103823824A (en) Method and system for automatically constructing text classification corpus by aid of internet
CN106021583B (en) Statistical method and system for page flow data
CN102609456A (en) System and method for real-time and smart article capturing
CN110970112B (en) Knowledge graph construction method and system for nutrition and health
CN104615734B (en) A kind of community management service big data processing system and its processing method
KR20180075234A (en) Method and device for recommending contents based on inflow keyword and relevant keyword for contents
CN112149422B (en) Dynamic enterprise news monitoring method based on natural language
Noro et al. Twitter user rank using keyword search
CN101719124A (en) System of infinite layering multi-path acquisition based on regular matching
KR100671077B1 (en) Server, Method and System for Providing Information Search Service by Using Sheaf of Pages
CN106844782A (en) The multichannel big data acquisition system and method for a kind of network-oriented
CN106682977A (en) Finance and tax artificial intelligence system
CN110413882B (en) Information pushing method, device and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant