CN105005600B - Preprocessing method of URL (Uniform Resource Locator) in access log - Google Patents
Preprocessing method of URL (Uniform Resource Locator) in access log Download PDFInfo
- Publication number
- CN105005600B CN105005600B CN201510383588.8A CN201510383588A CN105005600B CN 105005600 B CN105005600 B CN 105005600B CN 201510383588 A CN201510383588 A CN 201510383588A CN 105005600 B CN105005600 B CN 105005600B
- Authority
- CN
- China
- Prior art keywords
- url
- referer
- request
- rule
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides a preprocessing method of a URL (Uniform Resource Locator) in a website access log. The preprocessing method comprises the following steps of: Step 11, performing website URL collection: sorting and concluding a website URL address system; Step 12, performing URL configuration and storage: configuring the website URLs obtained through collection in the Step 11, and storing the website URLs into a URL rule storage table, wherein the URL rule storage table comprises the following fields including a URL unique code, a URL identification rule, a URL name and a URL matching sequence; Step 13, taking out information in the URL rule storage table obtained in Step 12, and performing sequencing according to the URL matching sequence in a way of ensuring a mother URL to be arranged in front of a son URL; Step 14, obtaining an access log record which includes the visitor IP, the accessing time, REFERER information and REQUEST information; Step 15, matching the REFERER and the REQUEST in each access log record in the Step 14 with the URL identification rule in the URL rule storage table obtained in the Step 13 according to the sequence obtained in the Step 13; and Step 16, obtaining records of the REFERER and the REQUEST which are not successfully matched in the Step 15 and are coded into -1 or null values.
Description
Technical field
The present invention relates to web analytics field, in particular to a kind of preprocess method of web log URL and
Device.
Background technology
Website visitation path analysis is the structure and page layout of optimization website, and understands Behavior preference etc. of visitor and carry
Important data have been supplied to support and instruct.And the basic data of web path analysis accesses day from the access log of website
Be have recorded in will the IP of visitor, access time, REFERER (page that the last time accesses), the REQUEST (pages of current accessed
Face) etc. information.Wherein, REFERER and REQUEST are to build the very main information for accessing collections of web pages and access path.
The REFERER and REQUEST recorded in access log are the forms of URL addresses, such as made in China net is (below
Referred to as:MIC) URL (address of uniform resource locator, i.e., WWW pages) address of homepage is
“www.made-in-china.com”.Carried out based on the original REFERER and REQUEST recorded in access log
Can be suffered a problem that when path analysis, REFERER and REQUEST are excessively detailed, be unfavorable for subsequent statistical analysis and carry
Take access path.Such as, the visitor of MIC mainly enters into MIC search listings page, different search words or search by GOOGLE
The URL addresses of the corresponding search listing page of condition are different, such as, with " led " is scanned for, the URL of search listing page is
“www.made-in-china.com/productdirectory.doWord=led&subaction=hunt&
Style=b&mode=and&code=0&comProvince=nolimit&order=0&isOp enCorrection=
1”。
With " led light " are scanned for, the URL of search listing page is
“www.made-in-china.com/productdirectory.doSubaction=hunt&style=b&
Mode=and&code=0&comProvince=nolimit&order=0&isOpenCorrec tion=1&word=led
+ light ", actually we are that will be similar to that the specific URL of both the above is carried out necessarily when path analysis is conducted interviews
Conclusion and classification, such as they are all identified as " MIC search listings page ", could so analyze the visitor for obtaining whole website
Access path situation.
Currently, it is mainly concentrated in collecting accession page set and the structure of each visitor for the research of access path
Access path.Seldom referred to for how to REFERER and REQUEST pre-process in access log, and the step for be
Build the important prerequisite of access path.
HTTP Referer are a parts of header, when browser sends to web server asks, general meeting
Take Referer, Tell server current page be come from which page jump address (come from which page link,
The i.e. last page for accessing), server takes this to obtain some information for processing.REQUESTheader is a visitor
Family end (typically browser) sends an order line (page for access for request when sending a request to Web server
URL)。
The content of the invention
Goal of the invention:The present invention provides the preprocess method and dress of the original URL recorded in a kind of web log
Put, solve the problems, such as REFERER and the data prediction of REQUEST for being recorded in access log in path analysis.
A kind of preprocess method of web log URL, its step includes:
S11:Website URL is collected, i.e., to the arrangement and conclusion of website URL addresses system;It is that collecting net station owner wants or important
Page URL, and confirm the essential information of these URL, including URL recognition rules, URL name;Wherein, URL recognition rules are
Refer to the constitutive characteristic of the URL that a certain class page for drawing is analyzed and concluded according to the URL of original web page;URL recognition rules can make
It is described with regular expression;
S12:URL is configured and stored, and the website URL for obtaining will be collected in S11 and is configured and is stored URL rule storage tables
In;URL rule storage tables include following field:URL unique encodings, URL recognition rules, URL name, URL matching order;Its
In, " URL unique encodings " are used to mark the unique identities of each URL recognition rule, are automatically generated by database;" URL is recognized
Rule " and " URL name " derive from S11 steps;" URL matching order " is used to control URL matching order;
The determination method of " URL matching order " is:Assuming that B_URL is a substring of A_URL, then claim A_URL and
There is character string inclusion relation, wherein A_URL is mother URL, and B_URL is sub- URL between B_URL;Then A_URL and B_URL
Sequence ligand be A_URL preceding, B_URL rear, i.e., before mother URL comes sub- URL;
S13:The information in the URL rule storage tables obtained in S12 is taken out, and is ranked up according to " URL matching order ",
Before ensureing that mother URL comes sub- URL;
S14:Obtain the record of access log, including the IP of visitor, access time, the REFERER (pages that the last time accesses
Face), REQUEST (page of access) information;
S15:By the REFERER and REQUEST in each access log record in S14 respectively with the URL that obtains in S13
URL recognition rules in regular storage table are matched according to the acquisition sequence in S13;If the match is successful, URL is recorded
The corresponding URL unique encodings of recognition rule, encode as the coding and REQUEST of REFERER;If REFERER or REQUEST
Can not all be matched with any URL recognition rule, then take -1 or null value as REFERER coding or REQUEST encode;
Preferably, the present invention step S15 by the REFERER in each access log record in step S14 and
REQUEST is matched with the URL recognition rules in the URL rule storage tables obtained in S13 according to the sequence in S13 respectively,
Including:
When the URL recognition rules in REFERER or REQUEST and URL rule storage tables are character string inclusion relations, i.e.,
REFERER or REQUEST are female URL of the URL recognition rules, then it represents that REFERER or REQUEST and the URL recognition rules
The match is successful;
If REFERER or REQUEST can the match is successful with the multiple URL recognition rules in URL rule storage tables,
Take according to the URL recognition rules for making number one in S13;
S16:There is no the REFERER and REQUEST that the match is successful in acquisition S15, i.e. REFERER codings or REQUEST are compiled
Code is -1 or the record of null value, is merged together all of without the REFERER and REQUEST that the match is successful, is obtained not
Matching set of URL;
S17:The set of URL that do not match in S16 carries out statistical analysis, is not matched the most URL of quantity in set of URL,
During (and can combine artificial judgement and monitoring) will match URL rule configuration lists without the URL that the match is successful, so as to not
The disconnected URL recognition rules improved in URL rule configuration lists.
The present invention provides the URL pretreatment units in a kind of access log, and its feature includes:
URL collector units:For collecting the URL of website, and determine URL recognition rules, URL name;Wherein, URL knows
Rule does not refer to the constitutive characteristic of the URL for analyzing and concluding a certain class page for drawing according to the URL of original web page, URL identifications
Rule can be described using regular expression;
URL rule storage units:Recognition rule and relevant information for storing URL, including:URL unique encodings,
URL recognition rules, URL name, URL matching order.Wherein, " URL unique encodings " are used to mark each URL recognition rule
Unique identities;" URL recognition rules " and " URL name " derives from URL collector units;For controlling URL matching order.
Preferably, URL rule storage units include:
URL rule configuration modules:For determining " URL unique encodings " and " URL matching order ".Wherein " URL is uniquely compiled
Code " can be automatically generated by database, or be manually generated, as long as ensureing that URL unique encodings and URL recognition rules are man-to-man
Relation.The determination method of " URL matching order " is:Assuming that B_URL is that (such as A_URL is for the substring of A_URL
" abcd ", B_URL is " abc "), then claim that there is character string inclusion relation between A_URL and B_URL, wherein A_URL is mother URL,
B_URL is sub- URL.It is then that the matching order of A_URL and B_URL is A_URL preceding, rear, i.e. mother URL comes sub- URL to B_URL
Before.
URL rule memory modules:For store URL rule storage table, including URL recognition rule and relevant information,
Including:URL unique encodings, URL recognition rules, URL name, URL matching order;
URL Rule units:The URL recognition rules are ranked up according to URL matching order, and are obtained in this order
Take URL recognition rules and URL unique encodings;
Log recording acquiring unit:For obtaining each record in access log, including the IP of visitor, access time,
The information such as REFERER (page that the last time accesses), REQUEST (page of access);
URL matching units:For the REFERER and REQUEST of each record in access log to be recognized into rule with URL
Then matched.A log recording is taken out, and the order that REFERER or REQUEST is obtained according to URL recognition rules is one by one
Matched with URL recognition rules, if REFERER or REQUEST are female URL of a certain URL recognition rules, the match is successful
And the URL unique encodings of the URL recognition rules are taken out as REFERER codings or REQUEST codings;If REFERER or
REQUEST and any one URL recognition rule all do not have character string inclusion relation, then compile REFERER codings or REQUEST
Code does special marking, such as labeled as " -1 " or null value so far completes the matching of this log recording and jumps out this time matching.
Then, a log recording is removed, is matched according to the method described above, until all matching completions of all of log recording;
Matching result collection memory cell:Matching result for storing access log and URL recognition rules, including:Access
Raw information such as IP, access time, REFERER, REQUEST in daily record etc., and above-mentioned REFERER codings, REQUEST are compiled
Code;
Not matching URL monitoring unit includes:Non- matched data acquiring unit:Do not matched into for obtaining matching result concentration
Work(REFERER and REQUEST, and be merged into not matching set of URL;Non- matched data statistical module:Count and do not match URL
Concentrate the record strip number of each URL;Non- matched data monitoring modular:According to the record strip number for going out not matching each URL in set of URL
And descending arrangement is carried out according to record strip number, the set of URL not matched can be collected into and closed.In conjunction with actual business demand, can
Determine whether by these URL be configured to URL rule storage table in, if necessary to configure, then come back to URL collector units according to
Above-mentioned flow is performed, until the URL of institute's analysis in need is added in the regular storage tables of URL.
Beneficial outcomes of the invention are as follows:The present invention provides a kind of the pre- of the original URL of record in web log
Processing method, can solve to be asked for the REFERER and the data prediction of REQUEST that are recorded in access log in path analysis
Topic:
1) by collecting the page URL of website and forming website URL rule storage tables, by what is recorded in original access log
URL recognition rules in REFERER and REQUEST and URL rule storage table are matched, by each REFERER and
REQUEST is encoded and named, and the original URL address formats conversion of REFERER and REQUEST is divided for ease of subsequent statistical
Analysis and the coding and Business Name of application.
2) by the monitoring and analysis to not matching set of URL, URL rule storage tables can be constantly improved, can be caused
URL rule storage tables progressively comprehensively cover all of Website page, so that recording in ensureing access log is as much as possible
Matching obtains REFERER codings and REQUEST codings.For the follow-up analysis based on access log provides perfect pretreatment
Data.
Brief description of the drawings
Fig. 1 is a kind of preprocess method flow chart of web log URL of the embodiment of the present invention;
Fig. 2 is a kind of structural representation of the pretreatment unit of web log URL of the embodiment of the present invention.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiments of the present invention are described in further detail, it is obvious that institute
The embodiment of description is only a part of embodiment of the invention, rather than whole embodiments.Based on embodiments herein, and
Change or equivalent variations that the technical spirit of the claims in the present invention is made, still fall within the scope of the application protection.
Refering to shown in Fig. 1, the implementation steps of the application are as follows:
S11:Website URL is collected, i.e., to the arrangement and conclusion of website URL addresses system.The collection of website URL is in initial rank
Section can rely on the mode for artificially collecting, and by artificially collecting website than major or important page URL, and confirm these
The essential information of URL, including URL recognition rules, URL name etc..Wherein, URL recognition rules refer to the URL according to original web page
The constitutive characteristic of the URL of a certain class page that analysis and conclusion draw.
Such as, the URL addresses of the product search list page of made in China net are all with " www.made-in-china.com/
productdirectory.do" beginning;Then the recognition rule of product search list page is exactly " www.made-in-
china.com/productdirectory.do”.And, URL recognition rules can be described using regular expression.
The URL addresses of product search list page are following forms:
www.made-in-china.com/productdirectory.doWord=led&subaction=hunt&
Style=b&mode=and&code=0&comProvince=nolimit&order=0&isOp enCorrection=1,
It is characterized in that
With " www.made-in-china.com/productdirectory.do" start, the parameter such as " word " below
Have recorded the information such as search word used.If so just can be according to certain URL with " www.made-in-china.com/
productdirectory.do" beginning, then the URL is product search list page.
The homepage URL addresses of made in China net are:www.made-in-china.com.
The special activities homepage URL addresses of made in China net are:www.made-in-china.com/special.
Detail pages of URL address of special activities of made in China net is:(such as magic-show special topics)
www.made-in-china.com/special/magic-show/。
Then the corresponding URL recognition rules of four pages of the above and URL name can be respectively:
“www.made-in-china.com/productdirectory.do", " product search list page ";
" www.made-in-china.com $ ", " MIC homepages ";
" www.made-in-china.com/special ", " thematic homepage ";
" www.made-in-china.com/special/ ", " thematic detail pages ".
Wherein, " $ " in the recognition rule of MIC homepages is the method for expressing of regular expression, is represented with the word before " $ "
Symbol string ending, represents all character strings ended up with " www.made-in-china.com " herein;
S12:URL is configured and stored, and the website URL for obtaining will be collected in S11 and is configured and is stored URL rule storage tables
In.URL rule storage tables include following field:URL unique encodings, URL recognition rules, URL name, URL matching order.Its
In, " URL unique encodings " are used to mark the unique identities of each URL recognition rule, can be automatically generated by database;" URL knows
Not other rule " and " URL name " are from S11 steps;" URL matching order " is used to control URL matching order.
The determination method of " URL matching order " is:Assuming that B_URL is that (such as A_URL is for the substring of A_URL
" abcd ", B_URL is " abc "), then claim that there is character string inclusion relation between A_URL and B_URL, wherein A_URL is mother URL,
B_URL is sub- URL.It is then that the matching order of A_URL and B_URL is A_URL preceding, rear, i.e. mother URL comes sub- URL to B_URL
Before.
Specifically, if configuration website first page URL to URL rule storage table in, the product of made in China net
As a example by product search listing page, then URL unique encodings, URL recognition rules, URL name, URL matching order are respectively:
" 1001 ", " www.made-in-china.com/productdirectory.do", " product search list page ",
" product search list page ".It should be noted that the URL identification rule not being configured also in URL rules storage table at present
Then, therefore the value of " URL matching order " can be random, value of the URL name as URL matching order is can use herein.
The URL unique encodings of above-mentioned 4 pages, URL recognition rules, URL name, URL matching order are respectively:
" 1002 ", " www.made-in-china.com/productdirectory.do", " product search list page ",
" product search list page ";
" 1003 ", " www.made-in-china.com $ ", " MIC homepages ", " MIC homepages ";
" 1004 ", " www.made-in-china.com/special ", " thematic homepage ", " thematic page 2 ";
" 1005 ", " www.made-in-china.com/special/ ", " thematic detail pages ", " thematic page 1 ".
Wherein, thematic detail pages of URL recognition rules (www.made-in-china.com/special/) are special topics
Female URL of the recognition rule (www.made-in-china.com/special) of homepage, therefore detail pages of special topic and special topic are first
The URL matching order of page is respectively " thematic page 1 ", " thematic page 2 ", so ensure that and is arranged according to URL matching order ascending order
When, before thematic detail pages comes thematic homepage.
S13:The information in the URL rule storage tables obtained in S12 is taken out, and is ranked up according to " URL matching order ",
Before ensureing that mother URL comes sub- URL.
Specifically, taking out the information in above-mentioned URL rules storage table, and arranged according to " URL matching order " ascending order, obtained
Arrive:
" 1003 ", " www.made-in-china.com $ ", " MIC homepages ", " MIC homepages ";
" 1002 ", " www.made-in-china.com/productdirectory.do", " product search list page ",
" product search list page ";
" 1005 ", " www.made-in-china.com/special/ ", " thematic detail pages ", " thematic page 1 ";
" 1004 ", " www.made-in-china.com/special ", " thematic homepage ", " thematic page 2 ".
S14:Obtain the record of access log, including the IP of visitor, access time, the REFERER (pages that the last time accesses
Face), the information such as REQUEST (page of access).Specifically, the record in access log can be following form:
192.168.1.1,2015-01-0112:01:00, www.made-in-china.com, www.google.com;
192.168.1.1,2015-01-0112:01:30, www.made-in-china.com/special/vacuum-
Pump/, www.made-in-china.com;
192.168.1.1,2015-01-0112:01:30, sourcing.made-in-china.com/
Suppliers.html, www.made-in-china.com/special/vacuum-pump/;
192.168.2.1,2015-01-0112:02:10, www.made-in-china.com, www.google.com;
192.168.2.1,2015-01-0112:03:10,
http://www.made-in-china.com/productdirectory.doWord=led&subaction
=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0 &
IsOpenCorrection=1,
www.made-in-china.com;
Wherein, " 192.168.1.1 " and " 192.168.2.1 " is the IP address of visitor;It is with the time that IP address is closed on
The access time of guest access respective page;The URL address adjacent with access time is that the currently accessed page URL of visitor is
REQUEST, the www.made-in-china.com in such as first record;URL addresses after the page URL of current accessed
The upper i.e. REFERER of a page URL accessed by visitor, the www.google.com in such as first record.That is
Visitor be from a upper accession page (REFERER) jump to the current accessed page (REQUEST), i.e. visitor be from
Www.google.com jumps to www.made-in-china.com's.
S15:By the REFERER and REQUEST in each access log record in S14 respectively with the URL that obtains in S13
URL recognition rules in regular storage table are matched according to the acquisition sequence in S13.If the match is successful, URL is recorded
The corresponding URL unique encodings of recognition rule, encode as the coding and REQUEST of REFERER.If REFERER or REQUEST
Can not all be matched with any URL recognition rule, then take -1 or null value as REFERER coding or REQUEST encode.
Preferably, in the application, the REFERER and REQUEST in each access log record in S14 are divided in S15
Do not matched according to the sequence in S13 with the URL recognition rules in the URL rule storage tables obtained in S13.Including:
When the URL recognition rules in REFERER or REQUEST and URL rule storage tables are character string inclusion relations, i.e.,
REFERER or REQUEST are female URL of the URL recognition rules, then it represents that REFERER or REQUEST and the URL recognition rules
The match is successful.
If REFERER or REQUEST can the match is successful with the multiple URL recognition rules in URL rule storage tables,
Take according to the URL recognition rules for making number one in S13.
Specifically, the log recording that will be listed in S14, is matched with the URL rule storage tables in S13.
Take out first record:
192.168.1.1,2015-01-0112:01:00, www.made-in-china.com, www.google.com;
REQUES is www.made-in-china.com, can match the MIC homepages in the URL rule storage tables of S13,
The corresponding URL unique encodings " 1003 " of MIC homepages are taken to be encoded as this REQUEST for recording.REFERFER is
Any one URL recognition rule in www.google.com, with the URL of S13 rule storage table all is unmatched, and is set " -1 "
It is the REFERER codings of this record.
Take out Article 2 record:
192.168.1.1,2015-01-0112:01:30, www.made-in-china.com/special/vacuum-
Pump/, www.made-in-china.com;
REQUES is www.made-in-china.com/special/vacuum-pump/, can simultaneously match S13's
Thematic detail pages in URL rule storage tables and thematic homepage, take the recognition rule that first is come according to matching order, that is, take
Thematic detail pages of corresponding URL unique encodings " 1005 " encodes as this REQUEST for recording.REFERFER is
The match is successful for MIC homepages in www.made-in-china.com, with the URL rule storage tables of S13, this record
REFERER is encoded to " 1003 ".
Method like this, until the matching of all log recordings is completed.Finally, the matching result of all records following (IP, access
Time, REQUEST, REFERER, REQUEST coding, REFERER codings):
192.168.1.1,2015-01-0112:01:00, www.made-in-china.com, www.google.com,
1003, -1;
192.168.1.1,2015-01-0112:01:30, www.made-in-china.com/special/vacuum-
Pump/, www.made-in-china.com, 1005,1003;
192.168.1.1,2015-01-0112:01:30, sourcing.made-in-china.com/
Suppliers.html, www.made-in-china.com/special/vacuum-pump/, -1,1005;
192.168.2.1,2015-01-0112:02:10, www.made-in-china.com, www.google.com,
1003, -1;
192.168.2.1,2015-01-0112:02:10,
http://www.made-in-china.com/productdirectory.doWord=led&subaction
=hunt&style=b&mode=and&code=0&comProvince=nolimit&order=0 &
IsOpenCorrection=1,
Www.made-in-china.com, 1002,1003;
S16:There is no the REFERER and REQUEST that the match is successful in acquisition S15, i.e. REFERER codings or REQUEST are compiled
Code is -1 or the record of null value, is merged together all of without the REFERER and REQUEST that the match is successful, is obtained not
Matching set of URL.
Specifically, there is no the REFERER that the match is successful to be in S15:
www.google.com;
www.google.com。
It is without the REQUEST that the match is successful:
sourcing.made-in-china.com/suppliers.html。
Merging is not matched set of URL:
www.google.com;
www.google.com;
sourcing.made-in-china.com/suppliers.html。
S17:The set of URL that do not match in S16 carries out statistical analysis, is not matched the most URL of quantity in set of URL,
And combine artificial judgement and monitoring, can in matching URL rule configuration lists without the URL that the match is successful, such that it is able to
URL recognition rules in constantly improve URL rule configuration lists.
Specifically, S16 do not match set of URL in the URL that does not match be:Www.google.com,
sourcing.made-in-china.com/suppliers.html.Wherein www.google.com is main search engine,
It is the access entrance of most of external visitors, it should be that the access that most of websites should all be paid close attention to is originated.Therefore, can be by
Www.google.com is also collected and is configured in URL rule storage tables, then repeat S11 to S13.
It sets each functional module to the URL pretreatment units also provided in a kind of access log of the invention in the above method,
URL collector units:For collecting the URL of website, and determine URL recognition rules, URL name etc..Wherein, URL recognition rules
Refer to the constitutive characteristic of the URL for analyzing and concluding a certain class page for drawing according to the URL of original web page, such as made in China net
The URL addresses of product search list page be all with " www.made-in-china.com/productdirectory.do" open
Head, then the recognition rule of product search list page is exactly " www.made-in-china.com/
productdirectory.do”.And, URL recognition rules can be described using regular expression.
URL rule storage units:Recognition rule and relevant information for storing URL, including:URL unique encodings,
URL recognition rules, URL name, URL matching order.Wherein, " URL unique encodings " are used to mark each URL recognition rule
Unique identities;" URL recognition rules " and " URL name " derives from URL collector units;For controlling URL matching order.
Log recording acquiring unit:For obtaining each record in access log, including the IP of visitor, access time,
The information such as REFERER (page that the last time accesses), REQUEST (page of access).
URL matching units:For the REFERER and REQUEST of each record in access log to be recognized into rule with URL
Then matched.A log recording is taken out, and the order that REFERER or REQUEST is obtained according to URL recognition rules is one by one
Matched with URL recognition rules, if REFERER or REQUEST are female URL of a certain URL recognition rules, the match is successful
And the URL unique encodings of the URL recognition rules are taken out as REFERER codings or REQUEST codings;If REFERER or
REQUEST and any one URL recognition rule all do not have character string inclusion relation, then compile REFERER codings or REQUEST
Code does special marking, such as labeled as " -1 " or null value so far completes the matching of this log recording and jumps out this time matching.
Then, a log recording is removed, is matched according to the method described above, until all matching completions of all of log recording.
Matching result collection memory cell:Matching result for storing access log and URL recognition rules, including:Access
Raw information such as IP, access time, REFERER, REQUEST in daily record etc., and above-mentioned REFERER codings, REQUEST are compiled
Code.
URL monitoring unit is not matched:For being divided the REFERER and REQUEST that the match is successful in access log
Analysis, so as to the whole pages for improving URL recognition rules to cover website, reaches the purpose of gradual perfection and optimization.
Not matching URL monitoring unit includes:
Non- matched data acquiring unit:The match is successful REFERER and REQUEST is concentrated for obtaining matching result, and
It is merged into not matching set of URL.
Non- matched data statistical module:Count the record strip number for not matching each URL in set of URL.
Non- matched data monitoring modular:According to the record strip number for going out not matching each URL in set of URL and according to record strip number
Descending arrangement is carried out, the set of URL not matched can be collected into and closed.In conjunction with actual business demand, it may be determined whether by these
URL is configured in URL rule storage tables, if necessary to configure, is then come back to URL collector units and is performed according to above-mentioned flow,
Until the URL of institute's analysis in need is added in the regular storage tables of URL.
Method and system provided by the present invention is described in detail above, but these explanations can not be understood to
The scope of the present invention is limited, protection scope of the present invention is limited by appended claims, it is any to be wanted in right of the present invention
Change on the basis of asking all is protection scope of the present invention.
Claims (3)
1. a kind of preprocess method of web log URL, it is characterized in that step includes:
S11:Website URL is collected, i.e., to the arrangement and conclusion of website URL addresses system;Page that collecting net station owner wants or important
Face URL, and confirm the essential information of these URL, including URL recognition rules, URL name;Wherein, URL recognition rules refer to root
The constitutive characteristic of the URL of a certain class page drawn according to the URL analyses and conclusion of original web page;URL recognition rules can be using just
Then expression formula is described;
S12:URL configure and store, will be collected in S11 the website URL for obtaining configure and store URL rule storage table in;URL
Regular storage table includes following field:URL unique encodings, URL recognition rules, URL name, URL matching order;Wherein, " URL
Unique encodings " are used to mark the unique identities of each URL recognition rule, are automatically generated by database;" URL recognition rules " and
" URL name " derives from S11 steps;" URL matching order " is used to control URL matching order;
The determination method of " URL matching order " is:Assuming that B_URL is a substring of A_URL, then claim A_URL and B_
There is character string inclusion relation, wherein A_URL is mother URL, and B_URL is sub- URL between URL;Then A_URL and B_URL are matched
Order is A_URL preceding, B_URL rear, i.e., before mother URL comes sub- URL;
S13:The information in the URL rule storage tables obtained in S12 is taken out, and is ranked up according to " URL matching order ", it is ensured that
Before female URL comes sub- URL;
S14:Obtain the record of access log, including the IP of visitor, access time, the last page REFERER for accessing, access
Page REQUEST information;
S15:URL by the REFERER and REQUEST in each access log record in S14 respectively with acquisition in S13 is regular
URL recognition rules in storage table are matched according to the acquisition sequence in S13;If the match is successful, URL identifications are recorded
The corresponding URL unique encodings of rule, encode as the coding and REQUEST of REFERER;If REFERER or REQUEST with appoint
What URL recognition rule can not all be matched, then take -1 or null value encoded as the coding or REQUEST of REFERER;
S16:Obtaining in S15 does not have a REFERER and REQUEST that the match is successful, i.e. and REFERER codings or REQUEST be encoded to-
1 or the record of null value, it is merged together all of without the REFERER and REQUEST that the match is successful, do not matched
Set of URL;
S17:The set of URL that do not match in S16 carries out statistical analysis, is not matched the most URL of quantity in set of URL, will not have
In thering is the URL that the match is successful to match URL rule configuration lists, rule are recognized so as to the URL in constantly improve URL rule configuration lists
Then.
2. the preprocess method of web log URL according to claim 1, it is characterized in that step S15 is by step S14
In REFERER in each access log record and REQUEST respectively with the URL rule storage tables that obtain in S13 in URL
Recognition rule is matched according to the sequence in S13, including:
When the URL recognition rules in REFERER or REQUEST and URL rule storage tables are character string inclusion relations, i.e.,
REFERER or REQUEST are female URL of the URL recognition rules, then it represents that REFERER or REQUEST and the URL recognition rules
The match is successful;
If REFERER or REQUEST can with URL rule storage table in multiple URL recognition rules the match is successful, take by
According to the URL recognition rules for making number one in S13.
3. a kind of pretreatment system of the web log URL of method according to claim 1, its feature includes:
URL collector units:For collecting the URL of website, and determine URL recognition rules, URL name;Wherein, URL identifications rule
Refer to then the constitutive characteristic of the URL for analyzing and concluding a certain class page for drawing according to the URL of original web page, URL recognition rules
Can be described using regular expression;
URL rule storage units:Recognition rule and relevant information for storing URL, including:URL unique encodings, URL know
Not rule, URL name, URL matching order;Wherein, " URL unique encodings " are used to mark the unique of each URL recognition rule
Identity;" URL recognition rules " and " URL name " derives from URL collector units;For controlling URL matching order;
Log recording acquiring unit:For obtaining each record in access log, including the IP of visitor, access time,
REFERER, REQUEST information;
URL matching units:For the REFERER and REQUEST and URL recognition rules of each record in access log to be entered
Row matching;Take out a log recording, and the order that REFERER or REQUEST is obtained according to URL recognition rules one by one with
URL recognition rules are matched, if REFERER or REQUEST are female URL of a certain URL recognition rules, the match is successful simultaneously
The URL unique encodings for taking out the URL recognition rules are encoded or REQUEST codings as REFERER;If REFERER or
REQUEST and any one URL recognition rule all do not have character string inclusion relation, then compile REFERER codings or REQUEST
Code does special marking, such as labeled as " -1 " or null value so far completes the matching of this log recording and jumps out this time matching;
Then, a log recording is removed, is matched according to the method described above, until all matching completions of all of log recording;
Matching result collection memory cell:Matching result for storing access log and URL recognition rules, including:Access log
In raw information such as IP, access time, REFERER, REQUEST, and above-mentioned REFERER coding, REQUEST coding;
Not matching URL monitoring unit includes:Non- matched data acquiring unit, concentrates what the match is successful for obtaining matching result
REFERER and REQUEST, and be merged into not matching set of URL;Non- matched data statistical module:Count and do not match set of URL
In each URL record strip number;Non- matched data monitoring modular:According to the record strip number for not matching each URL in set of URL and press
Descending arrangement is carried out according to record strip number, the set of URL not matched can be collected into and closed;In conjunction with actual business demand, can determine that
Whether these URL are configured in URL rule storage tables, if necessary to configure, then come back to URL collector units according to above-mentioned
Flow is performed, until the URL of institute's analysis in need is added in the regular storage tables of URL;
URL rule storage units include:
URL rule configuration modules:For determining " URL unique encodings " and " URL matching order ";Wherein " URL unique encodings " by
Database is automatically generated, or is manually generated, as long as ensureing that URL unique encodings and URL recognition rules are man-to-man relations;
The determination method of " URL matching order " is:Assuming that B_URL is a substring of A_URL, then between title A_URL and B_URL
With character string inclusion relation, wherein A_URL is mother URL, and B_URL is sub- URL;The matching order for being then A_URL and B_URL is
A_URL preceding, B_URL rear, i.e., before mother URL comes sub- URL;
URL rule memory modules:For store URL rule storage table, including URL recognition rule and relevant information, including:
URL unique encodings, URL recognition rules, URL name, URL matching order;
URL Rule units:The URL recognition rules are ranked up according to URL matching order, and are obtained in this order
URL recognition rules and URL unique encodings.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510383588.8A CN105005600B (en) | 2015-07-02 | 2015-07-02 | Preprocessing method of URL (Uniform Resource Locator) in access log |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510383588.8A CN105005600B (en) | 2015-07-02 | 2015-07-02 | Preprocessing method of URL (Uniform Resource Locator) in access log |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105005600A CN105005600A (en) | 2015-10-28 |
CN105005600B true CN105005600B (en) | 2017-05-24 |
Family
ID=54378276
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510383588.8A Active CN105005600B (en) | 2015-07-02 | 2015-07-02 | Preprocessing method of URL (Uniform Resource Locator) in access log |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105005600B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106445700A (en) * | 2016-09-20 | 2017-02-22 | 杭州华三通信技术有限公司 | Method and device for uniform resource locator (URL) matching |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105763633B (en) * | 2016-04-14 | 2019-05-21 | 上海牙木通讯技术有限公司 | A kind of correlating method of domain name and website visiting behavior |
CN107404392A (en) * | 2016-05-20 | 2017-11-28 | 中兴通讯股份有限公司 | The processing method and processing device of the scheduling rule of uniform resource position mark URL |
CN106330563B (en) * | 2016-08-30 | 2019-09-17 | 北京神州绿盟信息安全科技股份有限公司 | A kind of method and device of determining Intranet http communication stream service type |
CN106445815B (en) * | 2016-09-06 | 2019-04-23 | 优酷网络技术(北京)有限公司 | A kind of automated testing method and device |
CN107317892B (en) * | 2017-06-30 | 2020-08-07 | 北京知道创宇信息技术股份有限公司 | Network address processing method, computing device and readable storage medium |
CN107330090A (en) * | 2017-07-04 | 2017-11-07 | 北京锐安科技有限公司 | A kind of information processing method and device |
CN109995889B (en) * | 2018-01-02 | 2022-02-25 | 中国移动通信有限公司研究院 | Method and device for updating mapping relation table, gateway equipment and storage medium |
CN109242528A (en) * | 2018-07-26 | 2019-01-18 | 焦点科技股份有限公司 | A kind of the funnel analysis method and device in the customized path of electric business platform |
CN111162956B (en) * | 2018-11-08 | 2021-07-30 | 优信数享(北京)信息技术有限公司 | Log recording method and device |
CN111368227B (en) * | 2018-12-25 | 2023-06-27 | 阿里巴巴集团控股有限公司 | URL processing method and device |
CN115577197B (en) * | 2022-12-07 | 2023-10-27 | 杭州城市大数据运营有限公司 | Component discovery method, system and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103297435A (en) * | 2013-06-06 | 2013-09-11 | 中国科学院信息工程研究所 | Abnormal access behavior detection method and system on basis of WEB logs |
CN103377260A (en) * | 2012-04-28 | 2013-10-30 | 阿里巴巴集团控股有限公司 | Analysis method and device of URLs (Uniform Resource Locator) of weblog |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080209030A1 (en) * | 2007-02-28 | 2008-08-28 | Microsoft Corporation | Mining Web Logs to Debug Wide-Area Connectivity Problems |
-
2015
- 2015-07-02 CN CN201510383588.8A patent/CN105005600B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103377260A (en) * | 2012-04-28 | 2013-10-30 | 阿里巴巴集团控股有限公司 | Analysis method and device of URLs (Uniform Resource Locator) of weblog |
CN103297435A (en) * | 2013-06-06 | 2013-09-11 | 中国科学院信息工程研究所 | Abnormal access behavior detection method and system on basis of WEB logs |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106445700A (en) * | 2016-09-20 | 2017-02-22 | 杭州华三通信技术有限公司 | Method and device for uniform resource locator (URL) matching |
CN106445700B (en) * | 2016-09-20 | 2019-11-12 | 新华三技术有限公司 | A kind of URL matching process and device |
Also Published As
Publication number | Publication date |
---|---|
CN105005600A (en) | 2015-10-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105005600B (en) | Preprocessing method of URL (Uniform Resource Locator) in access log | |
CN101957816B (en) | Webpage metadata automatic extraction method and system based on multi-page comparison | |
US11321421B2 (en) | Method, apparatus and device for generating entity relationship data, and storage medium | |
CN103136360B (en) | A kind of internet behavior markup engine and to should the behavior mask method of engine | |
CN100405371C (en) | Method and system for abstracting new word | |
CN101320375B (en) | Digital book search method based on user click action | |
CN100394727C (en) | Log analyzing method and system | |
CN107800591B (en) | Unified log data analysis method | |
CN105022827A (en) | Field subject-oriented Web news dynamic aggregation method | |
CN103970843B (en) | Conversation combining method based on UUID in a kind of Web log integrities | |
CN103530429B (en) | Webpage content extracting method | |
CN101499062A (en) | Method and equipment for collecting entity alias | |
CN103823824A (en) | Method and system for automatically constructing text classification corpus by aid of internet | |
CN106021583B (en) | Statistical method and system for page flow data | |
CN102609456A (en) | System and method for real-time and smart article capturing | |
CN110970112B (en) | Knowledge graph construction method and system for nutrition and health | |
CN104615734B (en) | A kind of community management service big data processing system and its processing method | |
KR20180075234A (en) | Method and device for recommending contents based on inflow keyword and relevant keyword for contents | |
CN112149422B (en) | Dynamic enterprise news monitoring method based on natural language | |
Noro et al. | Twitter user rank using keyword search | |
CN101719124A (en) | System of infinite layering multi-path acquisition based on regular matching | |
KR100671077B1 (en) | Server, Method and System for Providing Information Search Service by Using Sheaf of Pages | |
CN106844782A (en) | The multichannel big data acquisition system and method for a kind of network-oriented | |
CN106682977A (en) | Finance and tax artificial intelligence system | |
CN110413882B (en) | Information pushing method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |