CN1791022A - Log analyzing method and system - Google Patents

Log analyzing method and system Download PDF

Info

Publication number
CN1791022A
CN1791022A CN 200510132486 CN200510132486A CN1791022A CN 1791022 A CN1791022 A CN 1791022A CN 200510132486 CN200510132486 CN 200510132486 CN 200510132486 A CN200510132486 A CN 200510132486A CN 1791022 A CN1791022 A CN 1791022A
Authority
CN
China
Prior art keywords
url
word
database
antistop list
cutting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200510132486
Other languages
Chinese (zh)
Other versions
CN100394727C (en
Inventor
李江华
姜兴
李昂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZHEJIANG INTIME E-COMMERCE Co.,Ltd.
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CNB2005101324865A priority Critical patent/CN100394727C/en
Publication of CN1791022A publication Critical patent/CN1791022A/en
Application granted granted Critical
Publication of CN100394727C publication Critical patent/CN100394727C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a log analysis method, which comprises: presetting keyword list, cutting the URL according to delimiter, deciding whether the URL containing word not contained in said keyword list; if no, storing the URL into URL list as well as user access condition; or else, replacing this word with uniform symbol to store in URL list as well as the word and user access condition; according to statistic condition, acquiring data. This invention can flexible add channel or page for statistic with little effect to analysis time and speed, and improves log analysis speed.

Description

A kind of log analysis method and system
Technical field
The present invention relates to a kind of analytical method and system of internet information data, particularly relate to a kind of method and system that the journal file of the webserver is analyzed.
Background technology
Along with the development of internet information service, many government departments, company, universities and colleges, scientific research institutions etc. have all had or have built the website of oneself.A Web server is all being moved in the back of each website, and Web server is a software that is used to manage the Web page, and these pages are used for client browser by local network or Internet.Web server commonly used now comprises the Enterprise server of Apache, IIS and Iplanet.Management to the website, require not only to pay close attention to the server throughput of every day, also to understand the visit situation of each page of website, improve the readability of the content of webpage and quality, raising content according to the click frequency of each page, trace packet contains the step of business transaction and the data of management Web website " backstage " etc.
Especially for being for the network company of business with ecommerce or search engine, all the more so; Need carry out detailed and thorough analysis to the operation and the visit situation of web server, understand website ruuning situation, find the deficiency that the website exists, promote the better development of website, and these requirements can be accomplished by the statistics and analysis to the journal file of web server.Common log analysis tool has WebTrends, Wusage, wwwstat, http-analyze, pwebstats, WebStat Explorer, webalizer, AWStats etc.The process that journal file is analyzed, checked excavates unknown, valuable pattern or rule exactly from mass data, be the complex process of decision service.
Page browsing situation for certain channel, existing log analysis tool generally all is to carry out statistics and analysis according to the rule that configures to obtain, and in actual applications, often need increase or variation statistical analysis to the situation of browsing of certain channel or certain page, at this moment just the URL rule of correspondence need be added in the log analysis system, just can obtain the log analysis result who needs.Promptly need add in the url list, then every log record be mated with the URL that needs statistics, thereby obtain a result according to the URL of rule with the needs statistics.Such processing mode needs often to increase or change url list, and URL rule of every increase, just needs every log record and the URL rule that increases newly are mated, so cause the flexibility of log analysis system and autgmentability relatively poor.Described URL is the abbreviation of Uniform ResourceLocator (uniform resource locator), is used to specify the method for expressing of information position on the WWW of Internet service routine.Described channel is meant the classification of a certain class content in website.
If the URL rule that increases also needs to obtain the relevant information in the history log data, then must analyze history log again, otherwise can't obtain the history visit situation of this URL, because the history log data amount is very big and all need to carry out The matching analysis for the rule of each interpolation, so cause the speed of log analysis slower.
When prior art was added up at a plurality of URL rules, every log record all needed to mate with all statistical rules, calculates statistics.For example, the URL rule that we need add up has 10, log record has 1,000 ten thousand, and (it is very normal that e-commerce company or search company etc. have such log record, for example, the log record of Chinese website of Alibaba company every day is above 4,000 ten thousand), then computer need carry out ten thousand matching operations of 1,000 ten thousand * 10=10000, be that the URL rule that needs of every increase are analyzed just needs to increase by 1,000 ten thousand matching operations, and the analysis speed of existing analysis tool can not satisfy increasing statistical analysis demand.
In the reality, more and more enterprises is wished and can be analyzed the sequence of pages of user capture website, thus better analysis understanding user's behavior, as the basic data of business decision.But general log analysis system is the support path analytic function not, even if can analyze, also can only carry out rough analysis, for example one-step route analysis, long-term action that can't certain visitor of Long-term analysis.For example: we carry out the homepage path analysis, and are general only to obtain accounting for from the hits that homepage is clicked certain page the percentage of whole homepage hits.
And, for every day different URL 1,000,000 grades of other websites, the path during for the analysis user access websites, if the URL when directly preserving user capture in the log analysis system, the memory space that then needs is very huge; If all different URL are numbered, preserve the ID of URL correspondence, then since every day different URL in 1,000,000 ranks, the dimension of URL is too big and can't realize.That is to say, general log analysis system, for every day different URL 1,000,000 grades of other websites, can't carry out path analysis to all URL.
Summary of the invention
In view of the above problems, the purpose of this invention is to provide a kind of log analysis method and system, can increase the channel or the page that need statistics flexibly, and very little to the time and the speed influence of analyzing and processing; Can carry out statistical analysis to history log data easily; Can carry out long-term, whole analyses to the path of client access website; And the speed that can improve analysis.
For solving the problems of the technologies described above, the objective of the invention is to be achieved through the following technical solutions:
The invention discloses a kind of log analysis method, may further comprise the steps: preset antistop list; To the uniform resource locator URL of log record according to the separator cutting; Judge whether described URL contains non-existent word in antistop list; If do not contain, then described URL is stored to URL dimension table, and the memory address of described URL in URL dimension table is saved to user capture situation database; If described URL contains non-existent word in antistop list, then adopt unified symbol to replace being stored to URL dimension table behind this word, and the memory address in URL dimension table is saved to user capture situation database with described word and described URL; Obtain related data according to statistical condition.
Preferably, described log analysis method can also comprise: the path parameter of preserving the user capture website is to user capture situation database.
Preferably, described log analysis method may further include: adopt unified symbol to replace the dynamic marks of the URL in the website; URL after the arrangement is carried out cutting according to separator; The cutting result is stored the formation antistop list.Preferably, described cutting result can comprise word and corresponding level.
Preferably, described log analysis method can also comprise: the URL to the needs statistics carries out cutting according to separator, if the word that cutting obtains does not exist in the corresponding level of described antistop list, then the word that described cutting is obtained is stored in the described antistop list.
Preferably, described log analysis method can also comprise: if having identical URL in the described URL dimension table, then keep original URL; Otherwise, described URL is stored in the described URL dimension table.
Preferably, the described step of obtaining related data according to statistical condition may further include: will need the URL that adds up to carry out cutting according to separator; Mate with the URL that stores in the described URL dimension table, obtain corresponding memory address; If the described URL of statistics that needs contains non-existent word in the corresponding level of antistop list, then adopt unified symbol to replace mating behind this word, obtain corresponding memory address; Non-existent word in the corresponding level of antistop list according to memory address and described URL contain obtains related data.
Preferably, the described step of obtaining related data according to statistical condition may further include: the storage subscriber identity information is to user capture situation database; Tong Ji user's identity information obtains related data as required.
The present invention also provides a kind of log analysis system, comprising: keyword database is used to store keyword and corresponding level; Url database is used for the URL that storing daily record writes down, and the non-existent word in the corresponding level of antistop list that described URL contains is replaced by unified symbol; User capture situation database is used for storing the identity information of non-existent word, calling party in the corresponding level of antistop list that described URL contains at the address information of url database, described URL; The cutting module is used for URL is carried out cutting according to certain rule; Statistical analysis module is used for obtaining related data according to statistical condition.Preferably, described user capture situation database also comprises the path parameter of user to access pages URL.
Can draw from above technical scheme, the present invention has following advantage:
The present invention is in the statistical analysis process, the URL of needs statistics is carried out generating character string according to antistop list after the cutting, mate with the character string of storing in the URL dimension table, obtain corresponding memory address (URL ID), according to the actual data value of the dynamic part of the URL of URL ID and needs statistics, from database, obtain related data.The dynamic part of described URL is meant when the website shows dynamic web page, the sign that dynamically generates in URL.Though because may be in 1,000,000 ranks based on website different URL amount every day of dynamic web page, but wherein a lot of URL contains dynamic part, after after URL removes these dynamic parts or with unified symbol, replacing, different URL are less than 5,000, actual quantity significantly reduces, and the character string of storing in the described URL dimension table is not considered the URL of dynamic part just, so the record number is less.Because the record number of storage is less in the URL dimension table, so matching speed is very fast.
Because the present invention is provided with antistop list, URL is formed keyword and corresponding level according to certain regular cutting, further reduce the dimension that need inquire about when carrying out the URL analysis, mate fully after the URL cutting to the needs analysis, can improve analysis speed.For example, Chinese website of Alibaba Co different URL amount every day reaches 1,000,000, if still do not consider the dynamic part of URL, then different URL is only less than 5000; If according to separator URL is carried out cutting, the keyword that obtains is only less than 2000.The computer that is arranged so that of antistop list can be realized the judgement to the URL dynamic part automatically.
Because the present invention carries out cutting to every log record, result after the cutting is saved in URL dimension table and the described database, taking out related data then from described database gets final product, change the analytical method of in the prior art every log record all being carried out the expression formula coupling with the URL of needs statistics, improved analysis speed.For example, if 1,000 ten thousand log records are arranged, then prior art need be carried out ten thousand couplings of 1,000 ten thousand * 10=10000; But the present invention only need carry out cutting to every record, carry out 1,000 ten thousand times cutting altogether, result with cutting is saved in URL dimension table and the described database then, and Tong Ji URL number (10) takes out corresponding visit capacity as required, i.e. 1,000 ten thousand cuttings coupling is taken out operation+10 times.If increase the URL of a needs statistics, the taking-up operation gets final product then only to need to increase once, and prior art then needs to increase by ten thousand couplings of 1,000 ten thousand * 1=1000.Thereby as can be seen, analysis times of the present invention reduces, so analysis speed is accelerated, and it is very little to increase behind the URL that needs add up the time effects to analyzing and processing.
Owing to stored the dynamic part of the URL of all log records in the database of the present invention, so can not consider the data value of its dynamic part when increasing the URL that needs statistics.When increasing the URL that needs statistics, only need check that whether the keyword among this URL exists, if exist, then need not any operation in antistop list.For example, comprise http://www.kk.com/category/1.html in the url list that originally needs to add up, the existing needs increases http://www.kk.com/category/2.html, then need not any operation, can directly analyze to get final product.It is more convenient, more flexible when so the present invention increases or change the URL that need add up.
Owing to preserved all URL of user capture website in the database of the present invention, for historical data, direct multidate information by storing in the database, obtaining related data gets final product, and the URL that the needs that do not need basis to increase newly are added up analyzes once more to historical data, thereby has improved analysis speed and efficient to historical data.
The present invention only preserves the memory address of URL in URL dimension table and the data value of the dynamic part of URL in database, changed the storage means of storing whole URL in the prior art for path analysis, reduced the memory space of database, in described database, can also preserve the path parameter of user capture webpage URL, thereby can realize all URL are carried out path analysis, as the comparable data of business decision.
Description of drawings
The present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
Fig. 1 is the detail flowchart of log analysis method of the present invention.
Embodiment
Core concept of the present invention is: the data of in user capture situation database, only preserving URL ID and dynamic part, thus reduce memory space, can the realizing route analysis and increase statistical condition easily; URL to the needs analysis carries out mating fully after the cutting according to keyword, can improve analysis speed.
With reference to Fig. 1, be the detail flowchart of log analysis method of the present invention.
Step s1 is with the unified symbol replacement of the dynamic marks employing of the URL in the website.
But the content that number of site provided is the content of online updating, so need to adopt dynamic web page, described dynamic web page is meant and uses the page script language, such as php, asp, asp.net etc., by script with the web site contents dynamic memory to database, the user capture website is the method for coming dynamic generation web page by reading database.Mainly be some frame foundations on the website, the content of webpage mostly is stored in the database.In order to distinguish different information, need dynamically to generate URL and distinguish.For example the information of wanting to buy on the Alibaba Chinese website needs to show in real time the information of wanting to buy that the member increases newly.And the different information of wanting to buy is when wanting to buy information according to member issue, generation want to buy that sign distinguishes.Show on the website that these are different when wanting to buy information, wanting to buy message identification by dynamic generation in URL distinguishes, http://detail.china.alibaba.com/buyer/offerdetail/31056206.html, wherein " 31056206 " are the sign of the information of wanting to buy of dynamic generation.
Find that by analyzing based on the website of dynamic page, the URL of employing dynamically generates, in about page access amount (Page View) is 3,500 ten thousand, every day, different URL reached 80-120 ten thousand, but after removing the dynamic marks of URL, in fact less than 5000.For example:
http://www.kk.com/category/12345.html
http://www.kk.com/category/12342.html
http://www.kk.com/category/12341.html
Remove dynamic marks, these 3 kinds different URL correspondences be a kind of URL:
http://www.kk.com/category/*.html。
The arrangement process can be for as follows: the dynamic marks among the URL replaces with *.For example:
Http:// www.kk.com/category/12345.html is organized into according to rule:
http://www.kk.com/category/*.html。Dynamic marks among the described URL also can adopt other symbols to replace.
Step s2 carries out cutting with the URL after the arrangement according to separator, obtains keyword and corresponding level.Described separator is meant the symbol that is used for separating variable or parameter among the URL, and for example: "/" and ". " etc., the character between two separators is exactly a keyword.Described level is the description to the position of described keyword in URL, can freely set, and is general, is level 1 between " http :/" and first "/", runs into "/" level then and adds 1; Can certainly set between two separators is exactly a level.
URL from the website obtains keyword and corresponding level, can also carry out cutting to URL according to separator earlier, obtain word and corresponding level (having comprised the speech that obtains after the cutting to dynamic marks), the frequency that occurs in URL according to the word that obtains is screened then, remove dynamic marks part (because the frequency of occurrences is not in same order), thereby obtain keyword and corresponding level.
URL from the website obtains keyword and corresponding level, can also draw keyword and corresponding level directly by manually the URL in the website being analyzed.For the less situation of the URL of website, manual analysis is more practical.
Step s 3, with described keyword and corresponding hierarchical storage to antistop list.Described antistop list only need generate in system initialization, adjusts according to demand then, when need not to carry out log analysis every day, all regenerates antistop list.For example: http://www.kk.com/category/*.html can be cut into:
Level Keyword
1 /www.
1 .kk.
1 .com/
2 /category/
3 .html
Table 1
Judge according to keyword and corresponding level that cutting obtains,, then keep original keyword if there is identical keyword in the described antistop list under the corresponding level; Otherwise the keyword that described cutting is obtained is stored under the corresponding level.Guarantee that promptly keyword and corresponding level are unique existence in the antistop list.
The process of described keyword of described storage and corresponding level also can be as follows: whole keywords are all stored, then each level is carried out the keyword The matching analysis, if keyword repeats, then the delete position after keyword, till not repeating.
The result of described step s1, s2, s3 has preset a keyword database exactly.Certainly, according to different websites or different rules, the specifying information of described keyword and corresponding level can exist different.
Step S4 carries out cutting to the URL of the log record of needs analysis according to separator.
If the word that cutting obtains does not exist in the corresponding level of described antistop list, then this word is the dynamic part of the URL of the described log record that need analyze.Be the setting that the present invention passes through antistop list, make computer can realize judgement automatically the dynamic part of the URL of log record.The dynamic marks of the URL that assert for the dynamic part that guarantees the URL that computer is judged and technical staff is consistent, then described according to separator segmentation rules and the setting of level all with the rule of aforesaid antistop list, set identical.For example, be level 1 between " http :/" and first "/", run into "/" level then and add 1; According to level and separator "/" and ". " URL is carried out cutting.
Open the journal file that needs analysis, read a record in the log information, the URL in this record is carried out cutting.The dynamic part of the URL of the log record that needs are analyzed adopts unified symbol to replace, and generates URL character string (url_string).For example, URL:/www.kk.com/category/12345.html, antistop list are still as shown in table 1.Then according to top segmentation rules, will generate URL:/www.kk.com/category/*.html and dynamic part " 12345 " automatically, wherein "/www.kk.com/category/*.html " is the static part of URL.
Step s5 is stored to URL dimension table with described URL character string.
If the word that cutting obtains does not exist in the corresponding level of described antistop list, promptly this word is the dynamic part of the URL of the described log record that need analyze, then needs to be stored to URL dimension table after the unified symbol replacement of this word employing among the described URL.
Judge according to described URL character string,, then keep original URL character string if having the identical characters string in the described URL dimension table; Otherwise, described URL character string is stored in the described URL dimension table.Guarantee that promptly the URL character string that exists in the URL dimension table is unduplicated, need to indicate the storage address information (URL ID) of each URL character string simultaneously in the URL dimension table.
Described URL dimension table can also be classified (url_type) to the URL of storage, if whole URL character string all is made up of keyword, then is included into the first kind, static URL; If have non-existent word in the antistop list in the URL character string, promptly URL has dynamic part, then is included into second class, dynamically URL.
For example, URL dimension table:
url_id url_type url_string
1 1 /www.kk.com/
2 2 /www.kk.com/category/*.html
Table 2
Step s6 is saved to user capture situation database with the memory address of described URL in URL dimension table.For example, URL:/www.kk.com/category/*.html and dynamic part 12345.Storage address information in the table 2 (url_id) " 2 " is stored to user capture situation database.
Step s7 if described URL contains non-existent word in antistop list, then is saved to this word user capture situation database.For the non-existent word in antistop list that described URL contains, promptly belong to the described URL dynamic part that computer is judged according to antistop list.This word (actual data value of described URL dynamic part) is stored in the user capture situation database.For example, the dynamic part in the last example is " 12345 ".
Only preserved the memory address of URL static part in URL dimension table in the described database, so reduced memory data output.Also stored the actual data value of the dynamic part of URL in the log record in the described database, so clearly locating query arrives the URL that needs statistics.Described dynamic part can comprise a plurality of data values, in database with these a plurality of data value separate storage.
The journal file of Website server has write down the various raw informations such as mistake when Website server receives the request of processing and operation.A log record of the combined type of using with Apache Server describes as example below:
218.242.102.121--[06/Dec/2002:00:00:00-0400]
″GET/2/face/shnew/ad/via20020915logo.gifHTTP/1.1″3040″http://www.mpsoft.net/″″Mozilla/4.0(compatible;MSIE 6.0;Windows 98)″
From top journal file as can be seen log record can write down the IP address (218.242.102.121) of client, time (06/Dec/2002:00:00:00-0400) that visit takes place, access request page URL (/ 2/face/shnew/ad/via20020915logo.gif), the state information (304) that the web server returns for this request, size (0, be unit), the reference address (http://www.mpsoft.net/) of this request, the client browser type information such as (Mozilla/4.0) that returns to the content of client with the byte.
"--" in the described log record is used to write down viewer's identification information, be blank generally speaking, if but some content request user of website carries out authentication, name that provides when the viewer carries out authentication etc. will be provided in this position so.Last "+0000 " is meant that server time zone of living in was positioned at UTC 4 hours before in the information of expression access time, and wherein UTC represents general time zone, i.e. Greenwich Mean Time." GET " in the described log record is meant the method (method) of request, can also be " POST " or legal method such as " HEAD "." HTTP/1.1 " in the described log record is meant the agreement and the version number (protocol) thereof of use.
" 304 " in the described log record are meant the state information code that the web server returns for this request.In general, state code with 2 beginnings is represented success, represent because various because users request has been redirected to other positions with the state code of 3 beginnings, state code with 4 beginnings represents that there is certain mistake in client, represents that with the state code of 5 beginnings server has run into certain mistake.The complete list of concrete detailed state code and their implication are well known to those skilled in the art, and do not repeat them here.
Top example explanation, article one, comprised a lot of information in the log record, when log record is analyzed, except the data (Value) with address information (url_id) and URL dynamic part are saved in the user capture situation database, can also preserve a lot of other information of relevant this visit, thereby help analytic statistics, for example: the sequential scheduling information of visitor's area, time period, accession page to the visit situation.Providing a concrete user capture situation database below describes:
Session_id Parent_id url_order url_id Valuel
1 -1 1 1
1 1 2 2 12345
1 1 3 2 23456
Table 3
Wherein, Session_id refers to the identity information of calling party, and is given by system, is used for distinguishing the user; Parent_id is meant the URL of the previous page of user capture current page; Url_order is meant the order of user to access pages, i.e. the path parameter of user capture website; Url_id: the storage address information of the static part of the URL of current page in URL dimension table; Valuel is meant the actual data value of the dynamic part among the current page URL.Described database can be preserved a plurality of dynamic variable parts among the ULR by a plurality of value of definition.
Table 3 has just been represented the user of subscriber identity information for " 1 ", has visited 3 following URL pages in order:
(1)/www.kk.com/
(2)/www.kk.com/category/12345.html
(3)/www.kk.com/category/23456.html
Step s8 obtains related data according to statistical condition.
The attribute difference of the data of obtaining according to hope, the condition of carrying out analytic statistics will be different.Generally can carry out: the total visit capacity in statistics website, several pages of statistical analysis visit capacity maximum, statistical analysis visitor's area, the time period of statistical analysis Accessor Access website, page average handling time and page mean residence time and on a time period, the regional analysis page browse number.Describing with object lesson below, is statistical condition with URL, adds up the visit capacity of this page.It is as shown in the table that URL in the log record is carried out the user capture situation database that obtains after the cutting:
Session_id Parent_id url_order url_id Valuel
1 -1 1 1
1 1 2 2 1
1 1 3 2 2
2 -1 1 2 2
2 1 2 2 4
Table 4
Need the URL of statistics to be:
http://www.kk.com/category/1.html
http://www.kk.com/ca tegory/2.html
Then, only need in database, extract,
During statistics http://www.kk.com/category/1.html, extract url_id=2, the related data information of valuel=1 gets final product, and this page has the page access amount one time.
During statistics http://www.kk.com/category/2.html, extract url_id=2, the related data information of valuel=2 gets final product, and this page has the page access amount twice.
The rest may be inferred gets final product.That is to say that after the present invention carried out cutting to the URL of log record, the storage relevant information when then carrying out statistical analysis, only needed as required the URL of statistics to extract related data information from database and adds up and get final product to described database.
The process of said extracted can adopt following step: will need the URL that adds up to carry out cutting according to separator; Mate with the URL that stores in the described URL dimension table, obtain corresponding memory address; If the described URL of statistics that needs contains non-existent word in the corresponding level of antistop list, mate after then the unified symbol of employing replaces, obtain corresponding memory address; Non-existent word in the corresponding level of antistop list according to memory address and described URL contain obtains related data.For example, need the URL that adds up to be in the above-mentioned example: http://www.kk.com/category/1.html, then obtain static part http://www.kk.com/category/*.html and dynamic part 1 (valuel=1) after the cutting, the URL that stores in static part and the described URL dimension table mated obtain corresponding memory address (url_id=2), thereby can extract url_id=2, the related data information of valuel=1 is finished statistical analysis.
Add up the order of user " 1 " accession page if desired, promptly user " 1 " is carried out path analysis, then in database, extract the related data information of Session_id=l, wherein url_order field, url_id field and Value1 field are added up getting final product.Data message shown in the table 4 has just been represented the user of subscriber identity information for " 1 ", has visited 3 following URL pages in order:
(1)/www.kk.com/
(2)/www.kk.com/category/1.html
(3)/www.kk.com/category/2.html
By the user capture process is carried out path analysis, behavior that can well analysis user, thus be business decision data for referencial use.For example: can analyze the advertisement of throwing on google, bring how many Page View to the website, simultaneously, the visitor who brings has the member who how much becomes the website, has how many visitors to send purchase information or the like.These information has been arranged, and this website just can better be analyzed and throw in advertising effect on google.
Increase new URL if desired as statistical condition, then the URL that needs are increased carries out cutting according to separator, if the word that cutting obtains does not exist in the corresponding level of described antistop list, then the word that described cutting is obtained is stored in the described antistop list, has promptly finished increase.If the word that cutting obtains exists, then need not any operation in the corresponding level of described antistop list.For example, comprise http://www.kk.com/category/1.html in the url list that originally needs to add up, the existing needs increases http://www.kk.com/category/12345.html, then need not any operation, can directly analyze to get final product.
If desired historical data is analyzed, then the present invention does not need the log record of history is carried out operations such as cutting coupling yet, because the present invention has preserved all URL of Accessor Access website in user capture situation database, then for historical data, the direct multidate information by preserving in the database obtains the historical visit capacity of this URL.
For example: for URL:/www.kk.com/category/12345.html, if before analyzing, category does not exist in keyword dimension table, so according to segmentation rules, this URL is cut into :/www.kk.com/*/* .html and dynamic part 1:category, dynamic part 2:12345.The URL that stores in the static part of this URL and the described URL dimension table mated obtain corresponding memory address, suppose to obtain URL_ID=3.For the historical visit capacity of this URL, we only need to extract in the database URL_ID=3, valuel=category, the record of value2=12345 so.
The present invention also provides a kind of log analysis system, comprising: keyword database is used to store keyword and corresponding level; Url database is used for the URL that storing daily record writes down, and the non-existent word in the corresponding level of antistop list that described URL contains is replaced by unified symbol; User capture situation database is used for storing the identity information of non-existent word, calling party in the corresponding level of antistop list that described URL contains at the address information of url database, described URL; The cutting module is used for URL is carried out cutting according to certain rule; Statistical analysis module is used for obtaining related data according to statistical condition.Preset keyword database, by the cutting module URL being carried out cutting according to keyword handles, and stored information is obtained related data according to statistical condition by statistical analysis module then to keyword database, url database or user capture situation database from user capture situation database.Described user capture situation database can also comprise the path parameter of user to access pages URL, can also comprise access time, visitor's information such as area.
More than the analytical method and the system of a kind of journal file provided by the present invention is described in detail, used specific case herein principle of the present invention and execution mode are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1, a kind of log analysis method is characterized in that, comprising:
A, preset antistop list;
B, to the uniform resource locator URL of log record according to the separator cutting;
C, judge whether described URL contains non-existent word in antistop list;
D, if conform to and do not have, then described URL is stored to URL dimension table, and the memory address of described URL in URL dimension table is saved to user capture situation database;
If the described URL of E contains non-existent word in antistop list, then adopt unified symbol to replace being stored to URL dimension table behind this word, and the memory address in URL dimension table is saved to user capture situation database with described word and described URL;
F, obtain related data according to statistical condition.
2, log analysis method as claimed in claim 1 is characterized in that, also comprises:
The path parameter of preserving the user capture website is to user capture situation database.
3, log analysis method as claimed in claim 1 or 2 is characterized in that, further comprises:
Adopt unified symbol to replace the dynamic marks of the URL in the website;
URL after the arrangement is carried out cutting according to separator;
The cutting result is stored the formation antistop list.
4, log analysis method as claimed in claim 3 is characterized in that, described cutting result comprises word and corresponding level.
5, log analysis method as claimed in claim 4 is characterized in that, also comprises:
URL to the needs statistics carries out cutting according to separator, if the word that cutting obtains does not exist in the corresponding level of described antistop list, then;
The word that described cutting is obtained is stored in the described antistop list.
6, log analysis method as claimed in claim 1 or 2 is characterized in that, also comprises:
If have identical URL in the described URL dimension table, then keep original URL;
Otherwise, described URL is stored in the described URL dimension table.
7, log analysis method as claimed in claim 4 is characterized in that, described step F comprises:
The URL of needs statistics is carried out cutting according to separator;
Mate with the URL that stores in the described URL dimension table, obtain corresponding memory address; If the described URL of statistics that needs contains non-existent word in the corresponding level of antistop list, then adopt unified symbol to replace mating behind this word, obtain corresponding memory address;
Non-existent word in the corresponding level of antistop list according to memory address and described URL contain obtains related data.
8, log analysis method as claimed in claim 1 or 2 is characterized in that, described step F comprises:
The storage subscriber identity information is to user capture situation database;
Tong Ji user's identity information obtains related data as required.
9, a kind of log analysis system is characterized in that, comprising:
Keyword database is used to store keyword and corresponding level;
Url database is used for the URL that storing daily record writes down, and the non-existent word in the corresponding level of antistop list that described URL contains is replaced by unified symbol;
User capture situation database is used for storing the identity information of non-existent word, calling party in the corresponding level of antistop list that described URL contains at the address information of url database, described URL;
The cutting module is used for URL is carried out cutting according to certain rule;
Statistical analysis module is used for obtaining related data according to statistical condition.
10, log analysis as claimed in claim 9 system is characterized in that described user capture situation database also comprises the path parameter of user to access pages URL.
CNB2005101324865A 2005-12-26 2005-12-26 Log analyzing method and system Active CN100394727C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005101324865A CN100394727C (en) 2005-12-26 2005-12-26 Log analyzing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005101324865A CN100394727C (en) 2005-12-26 2005-12-26 Log analyzing method and system

Publications (2)

Publication Number Publication Date
CN1791022A true CN1791022A (en) 2006-06-21
CN100394727C CN100394727C (en) 2008-06-11

Family

ID=36788545

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005101324865A Active CN100394727C (en) 2005-12-26 2005-12-26 Log analyzing method and system

Country Status (1)

Country Link
CN (1) CN100394727C (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008064593A1 (en) * 2006-11-30 2008-06-05 Alibaba Group Holding Limited A log analyzing method and system based on distributed compute network
CN101232399B (en) * 2008-02-18 2010-06-23 刘峰 Analytical method of website abnormal visit
CN101329687B (en) * 2008-07-31 2010-06-23 清华大学 Method for positioning news web page
CN101309292B (en) * 2008-06-06 2012-02-15 中国联合网络通信集团有限公司 Wireless internet SP service URL recording method and system
CN102393849A (en) * 2011-07-18 2012-03-28 电子科技大学 Web log data preprocessing method
CN102768636A (en) * 2011-05-05 2012-11-07 阿里巴巴集团控股有限公司 Log analysis method and log analysis device
CN103001796A (en) * 2012-11-13 2013-03-27 北界创想(北京)软件有限公司 Method and device for processing weblog data by server
CN103605738A (en) * 2013-11-19 2014-02-26 北京国双科技有限公司 Webpage access data statistical method and webpage access data statistical device
CN103646113A (en) * 2013-12-26 2014-03-19 北京西塔网络科技股份有限公司 Keyword restoration method and device
CN103713987A (en) * 2012-10-08 2014-04-09 尤尼西斯公司 Keyword-based log processing method
CN104391881A (en) * 2014-10-30 2015-03-04 杭州安恒信息技术有限公司 Word segmentation algorithm-based log parsing method and word segmentation algorithm-based log parsing system
US9087081B2 (en) 2009-10-30 2015-07-21 International Business Machines Corporation Method and system of saving and querying context data for online applications
CN107066510A (en) * 2017-01-22 2017-08-18 南方科技大学 A kind of information processing method and device
CN109299054A (en) * 2018-09-30 2019-02-01 维沃移动通信有限公司 A kind of data statistical approach and terminal device
CN110377854A (en) * 2019-05-31 2019-10-25 平安科技(深圳)有限公司 User access activity information monitoring method and device, computer equipment
CN111240948A (en) * 2019-11-18 2020-06-05 北京博睿宏远数据科技股份有限公司 Experience data processing method and device, computer equipment and storage medium
CN111814092A (en) * 2020-07-21 2020-10-23 上海数鸣人工智能科技有限公司 Data preprocessing method for artificial intelligence algorithm based on user internet behavior
CN112087414A (en) * 2019-06-14 2020-12-15 北京奇虎科技有限公司 Detection method and device for mining trojans

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1150717C (en) * 2001-06-21 2004-05-19 华为技术有限公司 Journal management system of integrated network manager
JP2005190065A (en) * 2003-12-25 2005-07-14 Nippon Telegr & Teleph Corp <Ntt> User terminal for information retrieval and collection, information retrieval and collection system, and information retrieval and collection method
CN100518076C (en) * 2004-01-02 2009-07-22 联想(北京)有限公司 Journal accounting method and system

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8671097B2 (en) 2006-11-30 2014-03-11 Alibaba Group Holdings Limited Method and system for log file analysis based on distributed computing network
WO2008064593A1 (en) * 2006-11-30 2008-06-05 Alibaba Group Holding Limited A log analyzing method and system based on distributed compute network
CN101192227B (en) * 2006-11-30 2011-05-25 阿里巴巴集团控股有限公司 Log file analytical method and system based on distributed type computing network
CN101232399B (en) * 2008-02-18 2010-06-23 刘峰 Analytical method of website abnormal visit
CN101309292B (en) * 2008-06-06 2012-02-15 中国联合网络通信集团有限公司 Wireless internet SP service URL recording method and system
CN101329687B (en) * 2008-07-31 2010-06-23 清华大学 Method for positioning news web page
US9087081B2 (en) 2009-10-30 2015-07-21 International Business Machines Corporation Method and system of saving and querying context data for online applications
CN102768636A (en) * 2011-05-05 2012-11-07 阿里巴巴集团控股有限公司 Log analysis method and log analysis device
CN102768636B (en) * 2011-05-05 2016-02-10 阿里巴巴集团控股有限公司 A kind of daily record analytic method and device
CN102393849A (en) * 2011-07-18 2012-03-28 电子科技大学 Web log data preprocessing method
CN103713987A (en) * 2012-10-08 2014-04-09 尤尼西斯公司 Keyword-based log processing method
CN103001796A (en) * 2012-11-13 2013-03-27 北界创想(北京)软件有限公司 Method and device for processing weblog data by server
CN103605738A (en) * 2013-11-19 2014-02-26 北京国双科技有限公司 Webpage access data statistical method and webpage access data statistical device
CN103605738B (en) * 2013-11-19 2017-03-15 北京国双科技有限公司 Web page access data statistical method and device
CN103646113A (en) * 2013-12-26 2014-03-19 北京西塔网络科技股份有限公司 Keyword restoration method and device
CN104391881A (en) * 2014-10-30 2015-03-04 杭州安恒信息技术有限公司 Word segmentation algorithm-based log parsing method and word segmentation algorithm-based log parsing system
CN104391881B (en) * 2014-10-30 2017-06-27 杭州安恒信息技术有限公司 A kind of daily record analytic method and system based on segmentation methods
CN107066510A (en) * 2017-01-22 2017-08-18 南方科技大学 A kind of information processing method and device
CN107066510B (en) * 2017-01-22 2021-12-03 南方科技大学 Information processing method and device
CN109299054A (en) * 2018-09-30 2019-02-01 维沃移动通信有限公司 A kind of data statistical approach and terminal device
CN109299054B (en) * 2018-09-30 2020-09-15 维沃移动通信有限公司 Data statistical method and terminal equipment
CN110377854A (en) * 2019-05-31 2019-10-25 平安科技(深圳)有限公司 User access activity information monitoring method and device, computer equipment
CN112087414A (en) * 2019-06-14 2020-12-15 北京奇虎科技有限公司 Detection method and device for mining trojans
CN111240948A (en) * 2019-11-18 2020-06-05 北京博睿宏远数据科技股份有限公司 Experience data processing method and device, computer equipment and storage medium
CN111814092A (en) * 2020-07-21 2020-10-23 上海数鸣人工智能科技有限公司 Data preprocessing method for artificial intelligence algorithm based on user internet behavior

Also Published As

Publication number Publication date
CN100394727C (en) 2008-06-11

Similar Documents

Publication Publication Date Title
CN1791022A (en) Log analyzing method and system
CN103823883B (en) Analysis method and system for website user access path
CN107563725B (en) Recruitment system for optimizing fussy talent recruitment process
US20080282186A1 (en) Keyword generation system and method for online activity
CN1559044A (en) Content information analyzing method and apparatus
US20110119268A1 (en) Method and system for segmenting query urls
CN108090104B (en) Method and device for acquiring webpage information
US20080275901A1 (en) System and method for detecting a web page
CN1912872A (en) Method and system for abstracting new word
CN1784653A (en) Systems and methods for generating concept units from search queries
CN1877583A (en) Accessing identification index system and accessing identification index library generation method
CN101079063A (en) Method, system and apparatus for transmitting advertisement based on scene information
US10878020B2 (en) Automated extraction tools and their use in social content tagging systems
CN1934569A (en) Search systems and methods with integration of user annotations
CN1459064A (en) Method for searching and analying information in data networks
CN1794239A (en) Automatic generating system of template network station possessing searching function and its method
US20150341771A1 (en) Hotspot aggregation method and device
CN104869009A (en) Website data statistics system and method
CN112015962A (en) Government affair intelligent big data center system architecture
US7949646B1 (en) Method and apparatus for building sales tools by mining data from websites
CN102436448A (en) Search method and search system
CN101399716A (en) Distributed audit system and method for monitoring using state of office computer
CN106484742A (en) Log processing method and device
CN101188521B (en) A method for digging user behavior data and website server
CN110955855B (en) Information interception method, device and terminal

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C56 Change in the name or address of the patentee

Owner name: ALIBABA GROUP HOLDINGS LIMITED

Free format text: FORMER NAME OR ADDRESS: ALIBABA CO.

CP03 "change of name, title or address"

Address after: Cayman Islands Grand Cayman capital building, a four storey box No. 847

Patentee after: Alibaba Group Holding Co., Ltd.

Address before: Grand Cayman, Georgetown, Cayman Islands

Patentee before: Alibaba Co.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20201224

Address after: Room 701-2, 528 Yan'an Road, Xiacheng District, Hangzhou City, Zhejiang Province

Patentee after: ZHEJIANG INTIME E-COMMERCE Co.,Ltd.

Address before: Cayman Islands Grand Cayman capital building, a four storey box No. 847

Patentee before: Alibaba Group Holding Ltd.