Summary of the invention
In view of this, the object of the invention is to provide a kind of log information management method and system, the latency issue of effective settlement server when the analysis operation to daily record, makes operator understand the operation situation of website in the very first time.
To achieve these goals, the invention provides following technical scheme:
A kind of web log file information management system, comprising: log information management unit, log information extract formatting unit, data unify centralized unit and web log file information provider unit, wherein:
Described log information management unit is used for: the form configuring the log information of same server end is preset format, and timing intercepts log information and preserves, and therefrom selects web log file information and is stored in the daily record of setting up in advance and reclaim in server;
Described log information extracts formatting unit and is used for: formatted log reclaims the web log file information stored in server;
Described data unify centralized unit for: by through format web log file information classify, form multiple Data Mart and store;
Described web log file information provider unit, for receive check web log file information request time, corresponding web log file information is provided.
Preferably, described daily record recovery server comprises a level logs recovery point and two level logs recovery points;
A described level logs recovery point is for storing the good web log file information of the bandwidth situation selected in all-network log information;
Described two level logs recovery points are for storing the web log file information except the web log file information that a described level logs recovery point stores.
Preferably, the storage mode of a described level logs recovery point and/or two level logs recovery points is RAID6 and divides virtual volume mode.
Preferably, described log information extraction formatting unit comprises:
Extracting unit, for extracting network log information;
Converting unit, the network log information for being extracted by described extracting unit converts the network log information of predetermined format to;
Load units, for storing the network log information of described predetermined format.
Preferably, described log information extracts formatting unit and also comprises trigger, for generation of the triggering signal controlling described extracting unit, converting unit and load units work.
Preferably, described trigger comprises line trigger and table trigger.
Preferably, the network log information that described log information extraction formatting unit carries out processing comprises basic data layer data, granularity amplification layer data and Data Mart layer data.
A kind of web log file approaches to IM, comprising:
Configure same server log information format;
Timing intercepts log information and preserves;
From the log information of described intercepting and capturing, choosing web log file information and being stored in the daily record of setting up in advance reclaims in server;
Described daily record is reclaimed the web log file information format process stored in server, convert the network log information meeting predetermined format to;
Web log file information through described predetermined format is carried out classification and forms multiple Data Mart, and to store, be convenient to receive check web log file information request time, provide corresponding web log file information to operator.
Preferably, network log information is stored in the daily record recovery server set up in advance to comprise:
Preset one-level recovery point and secondary recovery point, reclaim in order to carry out classification to log information;
The good web log file information of the bandwidth situation stored in all data is chosen in a described level logs recovery point;
Described two level logs recovery points store the data except the data that a described level logs recovery point stores.
Preferably, also comprise before described web log file information being carried out format process:
Network log information is divided into basic data layer data, granularity amplification layer data and Data Mart layer data; Above-mentioned every layer data is carried out layering again, in each layer all data ordered series of numbers is connected, be convenient to the process to data.
As can be seen from technique scheme, the present invention is by unifying configuration to the daily record in server, and preset daily record recovery server, server timing store website log information is reclaimed by daily record, the log information reclaimed in server is carried out format process, and store in the log information Unified Set processed, obtain at random for website maintenance person's (or being called website operators), analyze.That is, when website maintenance person needs to understand the current traffic-operating period in website, directly can obtain the web log file information after format process and analyze, thus reducing processing delay, and then enabling website maintenance person understand the current traffic-operating period in website in time.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
The invention discloses a kind of log information management system as shown in Figure 1, comprising: log information management unit 1, log information extract formatting unit 2, data unify centralized unit 3 and web log file information provider unit 4.
Described log information management unit 1 is according to the classification of Internet Server, for type of server difference (web server of apache squid caching server ftp document management server Streaming Media source server the broadcasting server of Streaming Media) configure respective unified journal format, form when can guarantee the generation of daily record is like this monolithic in similar server.The configuration of daily record adopts the standard W3C form of the Internet as far as possible, guarantee the ratio that the reduction of the aftertreatment energy maximum possible of daily record is extracted and changed, and depositing of daily record will be placed on the larger memory space in space.In the operation of website makes daily record be in constantly to upgrade, timing is carried out to web log file and intercepts, thus guarantee provides the traffic-operating period of each period of website for operator.
Server is when accessed, the generation of daily record is unavoidable will fixing position, and be unique position, this causes certain trouble with regard to giving the recovery of daily record, because need in continual daily record recovery system and uninterruptedly externally provide the server of service to look for the mechanism be independent of each other, therefore we can under guarantee affect the prerequisite of server, create an independently daily record recovery system and passage, a place balanced is looked in the physical distribution of server and network quality two dimensions, place the server that daily record is reclaimed, and the whole network multiple such daily record recovery server constitutes complete daily record recovery server group, add us and wish that the preservation of daily record will accomplish redundancy, therefore whole network can be divided into several large region by us, between each region and among all place the log server of master slave relation, namely a level logs recovery point is set up, choose backbone node (the bandwidth foot in all data centers, relatively minimum to the time delay of each data center), create two level logs recovery points, the memory space of each daily record recovery point guarantees to ensure the space of 50 times of collector journal day maximum quantum of output, and accomplish that the storage of each daily record recovery point adopts RAID6 and the technology dividing virtual volume guarantees the fail safe that stores, each daily record recovery point needs a strange land hot stand-by, adopt rsync technology real-time synchronization, guarantee that the fault of the single-point inaccessible of daily record recovery point can not have any impact to the preservation of daily record.
The extraction formatting unit 2 of data comprises further: extracting unit, converting unit and load units, extracts the various daily records collected, conversion and loading and ETL technology.Extracting unit, for extracting network log information; Converting unit, the network log information for being extracted by described extracting unit converts the network log information of predetermined format to; Load units, for storing the network log information of described predetermined format.
The process of whole derivation is divided into some levels by us, and according to succession, leap property and irreversibility, can specify that all Job can only call downwards, absolutely not allow upwards to call, but allow downward cross-layer to call.Like this in this ETL instrument of DataStage, in the stage that each Job is residing in ETL process according to it, can put at all levels respectively, utilize the relation between level to retrain the relation of Job, thus guarantee that the call relation of each Job is able to clearly.
Therefore ETL can be divided into three levels to data processing: basic data layer, granularity amplification layer, several little levels have been segmented again in data set city level (Job namely in basic data layer just can be the Job that granularity is amplified after completing, just can be the Job of Data Mart after granularity is amplified) every layer.In each layer all Job are all coupled together by a sequence, like this when every day, operation was distributed in all Job of each layer time, only need according to order from low to high, sequence in running every layer, just can ensure that the data that each Job runs out are consistent, thus avoid because Job calls the chaotic situation appearance causing data inconsistent of order.
Due to process in ETL implementation procedure is the log information of magnanimity, and relate to multiple system, these systems are all often core systems, the use of technology needing consider drops to the impact of the Performance And Reliability on origin system minimum, therefore specifically can have employed the technology of the following aspects in implementation procedure:
Trigger is a kind of storing process of specific type, is mainly undertaken triggering being performed by event.In the extraction process of daily record, create different trigger mechanisms according to different types of journal format, and realize the log information being converted to consolidation form.
Trigger can have line trigger and table two kinds, trigger:
Line trigger: this kind of trigger only pins the row of triggering when triggering, and other row or manipulable in table, but this type of trigger cannot when triggering change table itself.
Show trigger: this kind of trigger can pin table when triggering, thus the now all operation of his-and-hers watches except retrieval will be locked, but this type of trigger cannot obtain the data after upgrading front or renewal.
Therefore daily record segmentation just can intercept at the initial stage of the generation of daily record by we, guarantee, in ETL implementation procedure below, can not run into the renewal of log information, and it is just passable only to need to ensure that journal file can be sent to ETL place in time.
Simultaneously in order to ensure the concurrent processing of ETL, we have selected line trigger, so just allow the possibility that each file of multiple ETL process one becomes, thus to the efficiency of system improve the highest.The fault also reducing separate unit ETL may on the impact of overall ETL implementation procedure.
In order to improve the efficiency of derivation, increase derivation scheme is have employed in implementation procedure, the starting time of increment will be set up in order to carry out increase derivation, there is this starting time, system has just had the starting point finding new change record, as long as the record changed after later at every turn deriving this time point, and after confirming successfully derivation, upgrade this time point just passable.
In order to ensure the quality of data, need automatically to process, the principle of process is automatically: after the successful time point of increment, delete this record, then the record that increase derivation is new again.
Described data are unified centralized unit 3 and the web log file information through format are classified, and form multiple Data Mart and store.
ETL be the foundation of Data Mart create one complete, can reflecting history change the platform consistent with structure, being established as of such Data Warehouse Platform is laid a good foundation according to the Data Mart of the requirement developing subject-oriented of user.
The value of the analytical statement of all daily records is embodied by the design of user, user is the expert of business, and expert should the angle that realizes in system of active stations, the customer analysis of each inquiry form is helped to go out the role of each data in form, then design suitable data structure, material is thus formed Data Mart.
The requirement source of the analytical statement of usual daily record is mainly from two aspects, one is industry specialists, one is use and the user of query analysis form, no matter demand is from where, it is all the definition on the role of different dimensions to different pieces of information, and the definition of these roles just constitutes the combination of a sets of data, and according to the reasonable combination to these data, just define the Data Mart of complete set, as long as guarantee that any one data can get in original log, so just it can be focused on Data Mart by the implementation procedure of ETL, and the implementation procedure of ETL is relatively independent, guarantee the Quick Extended of global analysis system.
Web log file information provider unit 4, for receive check web log file information request time, corresponding web log file information is provided.
When operator needs to safeguard website or manage, need the log information of retrieving for examination website, by making adjustment to website to the analysis of log information.The order of daily record is checked by operator by submitting to, after web log file information provider unit receives order, the log information that called data is unified to store in centralized unit is supplied to operator.
The invention also discloses a kind of method corresponding with said system in addition, comprise the following steps:
Step 21, configures same server log information format.
The data source of operator's analyzing web site operation situation comes from the access log of the various servers of the Internet, therefore daily record design and configuration in the future concentrate extract and conversion very crucial, the various servers of the Internet are when designing, just consider daily record standard in the industry, the possibility that the uniform operational therefore allowing the Log Source of analytical system configure becomes.
According to the classification of Internet Server, the difference for type of server configures respective unified journal format, and form when can guarantee the generation of daily record is like this monolithic in similar server.
Step 22, timing intercepts log information, and is left in memory space by the log information of intercepting.
Because daily record constantly upgrades, extremely important to the intercepting work of daily record, by obtaining website operation situation at that time to the real-time interception of daily record.
Step 23, chooses web log file information from the log information of described intercepting and capturing and is stored in daily record and reclaim server.
Server is when accessed, the generation of daily record is unavoidable will fixing position, and be unique position, this causes certain trouble with regard to giving the recovery of daily record, because need in continual daily record recovery system and uninterruptedly externally provide the server of service to look for the mechanism be independent of each other, therefore we can under guarantee affect the prerequisite of server, create an independently daily record recovery system and passage, a place balanced is looked in the physical distribution of server and network quality two dimensions, place the server that daily record is reclaimed, and the whole network multiple such daily record recovery server constitutes complete daily record recovery server group, add us and wish that the preservation of daily record will accomplish redundancy, therefore whole network can be divided into several large region by us, between each region and among all place the log server of master slave relation, the daily record avoiding Single Point of Faliure to cause is reclaimed and is interrupted, reclaim the storage security of server in order to ensure single-point daily record simultaneously, we adopt the mode sata hard disk of latest generation and raid6 dividing virtual volume to set up storage architecture.
Daily record recovery system reclaims server and central log storage server composition primarily of the daily record in each region of distribution, can double counting in order to what ensure original log, we preserve 7 days at the daily record in each region, and central log storage server is then permanent storage.
Step 24, reclaims described daily record the web log file information stored in server and converts the network log information meeting predetermined format to.
Daily record reclaims server once be sent to daily record, will face and how to split, extract, with conversion and the problem being loaded into data warehouse, because Log Analysis System requires partial analysis content, as flow bandwidth, visitors etc. have ageing, therefore the efficiency of this step is particularly important, therefore we are in this link, technology can be adopted to carry out multistage fractionation, all daily records are first carried out duplicate removal, go mistake, format, this link performs once for every 5 minutes, timestamp is stamped in the daily record that central collection comes, then process, simultaneously in the face of dissimilar daily record, we adopt different extractions and the standard of format, through the fractionation of one-level, extract, we are by different types, different regions, different service-domain name classification, and then carry out second decimation for the daily record of classification, by the information of daily record according to different grain-size classification, formatting lines of going forward side by side operates, computing for Data Mart provides as far as possible accurately and the data of format.
Step 25, carries out classification and forms multiple Data Mart, and to store by the web log file information through described predetermined format, be convenient to receive check web log file information request time, provide corresponding web log file information to operator.
It should be noted that those skilled in the art should readily understand, the above-mentioned website maintenance person, website operators, website operator, manager etc. mentioned is all identical concept, and it all needs to process web log file, specifically repeats no more herein.
It can be seen from the above, embodiment of the present invention timing acquisition web log file information also concentrates storage after carrying out format, make website maintenance person when the current traffic-operating period in needs understanding website, directly can obtain the web log file information through format and analyze, and without the need to carrying out format manipulation, thus reduce processing delay.Thus make website maintenance person can understand the current traffic-operating period in website in time, make more effective migration efficiency.
To the above-mentioned explanation of the disclosed embodiments, this area professional technique user person is enable to realize or use the present invention.To be apparent to the multiple amendment of these embodiments concerning the professional technique user person of this area, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.