CN103595571B - Preprocess method, the apparatus and system of web log - Google Patents

Preprocess method, the apparatus and system of web log Download PDF

Info

Publication number
CN103595571B
CN103595571B CN201310591082.7A CN201310591082A CN103595571B CN 103595571 B CN103595571 B CN 103595571B CN 201310591082 A CN201310591082 A CN 201310591082A CN 103595571 B CN103595571 B CN 103595571B
Authority
CN
China
Prior art keywords
daily record
log
data
stream
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310591082.7A
Other languages
Chinese (zh)
Other versions
CN103595571A (en
Inventor
何恺铎
饶峰云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201310591082.7A priority Critical patent/CN103595571B/en
Publication of CN103595571A publication Critical patent/CN103595571A/en
Application granted granted Critical
Publication of CN103595571B publication Critical patent/CN103595571B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of preprocess method of web log, apparatus and system.Wherein, this method includes:Original log is read from cluster server;Original log is merged and sequence obtains middle log stream;Middle daily record stream is split to obtain pretreatment daily record.Using the present invention, solve multiple read-write operation in the prior art to cause the pretreatment of web log file time-consuming, cause the problem of speed of log processing is slow, efficiency is low, realize pretreatment of the single read-write completion to daily record data, reduce the intermediate file of processing time and processing, so as to improve the treatment effeciency of daily record.

Description

Preprocess method, the apparatus and system of web log
Technical field
The present invention relates to data processing field, in particular to a kind of preprocess method of web log, device And system.
Background technology
With the development of internet, number of netizens constantly increases, and the visit capacity of website constantly rises, single server Substantial amounts of website visiting amount can not be met, common way is to use load balancing cluster, by before one or more Load equalizer is held, workload is distributed on one group of server of rear end, back-end server receives request and log. As visit capacity constantly rises, the size of journal file constantly expands, but the processing time of corresponding journal file will Asking does not reduce but.Therefore, the treatment effeciency of journal file how is improved, turning into this area must problems faced.
Earliest log processing method is to directly read raw log files, and then the data in raw log files are entered Row analysis, such efficiency is very low, because different analyses will re-read all original logs every time.Currently used day Will processing method includes pretreatment and subsequent statistical analysis two parts, and wherein preprocessing part is that all subsequent statistical analysis are total to , generally include three main process of splitting, merge and sort.Wherein, the demand of segmentation is because subsequently may be just for The daily record of certain special identifier carries out statistical analysis;The demand of merging is to be distributed in multiple cluster servers because of original log, Need united analysis;The demand of sequence is sequencing and causality since it is desired that analysis event occurs.These three are needed Seeking Truth is very universal, and existing way is:By original log according to certain like-identified first on computer cluster(For example User identifies)It is divided into multiple mark files;Afterwards on log processing server, the mark text on cluster server is read Part, using the mark Piece file mergence of like-identified be a file as file destination;Finally, it is Sino-Japan to file destination according to the time Will information is ranked up, and generates pretreated journal file.
The process of segmentation, merging and sequence of the prior art to journal file is all isolated, is generated in the process The intermediate files such as mark file, file destination, it result in multiple file and read and write operation, and due to journal file number Big according to amount, intermediate file quantity is more, reads and write-in expends the time very much, reduce the efficiency of overall log integrity.
Causing the pretreatment of web log file for multiple read-write operation in the prior art, time-consuming, causes daily record The problem of speed of processing is slow, efficiency is low, not yet proposes effective solution at present.
The content of the invention
Causing the pretreatment of web log file for multiple read-write operation in correlation technique, time-consuming, causes daily record The problem of speed of processing is slow, efficiency is low, not yet proposes effective solution, therefore, the main object of the present invention exists at present In providing a kind of preprocess method of web log, apparatus and system, to solve the above problems.
To achieve these goals, according to an aspect of the invention, there is provided a kind of pretreatment of web log Method, this method include:Original log is read from cluster server;Original log is merged and sequence obtains interim date Will stream;Middle daily record stream is split to obtain pretreatment daily record.
Further, the step of original log is read from cluster server includes:Row is pressed from cluster with stream socket Original log is read in server parallel;All daily record datas read parallel are stored in into daily record to concentrate.
Further, original log is merged and included the step of sequence obtains middle log stream:Daily record is concentrated Daily record data sort to obtain data sequence;Time earliest daily record data in output data sequence;It will be read from origin server Next daily record data of time for taking earliest daily record data fills into data sequence;Return and perform the time in output data sequence The step of earliest daily record data, until daily record data output is finished, obtain middle log stream;Wherein, it is the time is earliest The server of the data source of daily record data is as origin server.
Further, next daily record data of the time read from origin server earliest daily record data is filled into number Also include according to the step of sequence:After the daily record data reading in origin server is finished, the day of origin server is closed Will stream.
Further, the step of splitting and obtain pretreatment daily record middle daily record stream includes:Among obtaining in log stream User identifies;Interim date will stream is split according to user's mark to obtain pretreatment daily record.
To achieve these goals, according to another aspect of the present invention, there is provided a kind of pretreatment of web log Device, the device include:First read module, for reading original log from cluster server;Ordering by merging module, is used for Original log is merged and sequence obtains middle log stream;Split module, for splitting to obtain pre- place middle daily record stream Manage daily record.
Further, the first read module includes:Parallel read module, for pressing row from cluster service with stream socket Original log is read in device parallel;Preserving module, concentrated for all daily record datas read parallel to be stored in into daily record.
Further, ordering by merging module includes:Order module, the daily record data for daily record to be concentrated, which sorts, to be counted According to sequence;Output module, the daily record data earliest for the time in output data sequence;Complementary module, for will be taken from source Next daily record data of the time earliest daily record data that business device is read fills into data sequence;Execution module is returned to, for returning In receipt row output data sequence the step of time earliest daily record data, until daily record data output is finished, centre is obtained Log stream;Wherein, using the server of the data source of time earliest daily record data as origin server.
Further, complementary module includes:Closedown module, for being finished by the daily record data reading in origin server Afterwards, the log stream of origin server is closed.
Further, segmentation module includes:Acquisition module, for obtaining the mark of the user in middle log stream;Segmentation Module, for being split to obtain pretreatment daily record to interim date will stream according to user's mark.
To achieve these goals, according to another aspect of the present invention, there is provided a kind of pretreatment of web log System, the system include:Multiple cluster servers;Log integrity server, it is connected with multiple cluster servers, for from collection Original log is read in group's server, and after being merged to original log and sequence obtains middle log stream, by centre Log stream is split to obtain pretreatment daily record.
Further, log integrity server includes:Reading device, it is connected with multiple cluster servers, for number According to stream mode original log is read parallel from cluster server by row.
Further, log integrity server includes:Processor, it is connected with reading device, for concentrate daily record Daily record data sorts to obtain data sequence, and time earliest daily record data in output data sequence, then will be from source service Next daily record data of the time earliest daily record data that device is read fills into data sequence, returns and performs in output data sequence The step of time earliest daily record data, until daily record data output is finished, obtain middle log stream;Wherein, by the time most The server of the data source of early daily record data is as origin server.
Using the present invention, after original log is read from cluster server, daily record is merged and sorted two processes Combine, the middle daily record stream merged and sequence obtains is split to obtain pretreatment daily record, finally will pretreatment daily record write-in Disk, eliminate the generation of intermediate file, it is only necessary to the reading and write-in of single, improve the efficiency of log integrity.Solve Multiple read-write operation causes the pretreatment of web log file time-consuming in the prior art, causes the speed of log processing Slowly the problem of, efficiency is low, pretreatment of the single read-write completion to daily record data is realized, is reduced in processing time and processing Between file, so as to improve the treatment effeciency of daily record.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the structural representation of the pretreatment system of web log according to embodiments of the present invention;
Fig. 2 is a kind of structure chart of optional pretreatment system applicable system according to embodiments of the present invention;
Fig. 3 is a kind of structure chart of optional pretreatment system according to embodiments of the present invention;
Fig. 4 is the block diagram of the pretreatment system processing daily record data of web log according to embodiments of the present invention;
Fig. 5 is the flow chart of the preprocess method of web log according to embodiments of the present invention;And
Fig. 6 is the structural representation of the pretreatment unit of web log according to embodiments of the present invention.
Embodiment
It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the present invention in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is the structural representation of the pretreatment system of web log according to embodiments of the present invention.Such as Fig. 1 institutes Show, the system can include:Multiple cluster servers 2 and a log integrity server 1.
Wherein, multiple cluster servers 2, for recording original log.
Log integrity server 1, it is connected with multiple cluster servers 2, for reading original day from cluster server Will, and after being merged to original log and sequence obtains middle log stream, middle daily record stream is split and pre-processed Daily record.
Using said system, after log integrity server reads original log from cluster server, by daily record Merge and two processes of sequence combine, the middle daily record stream merged and sequence obtains is split to obtain pretreatment daily record, saved The generation of intermediate file is gone, it is only necessary to the reading and write-in of single, improve the efficiency of log integrity.Solves existing skill Multiple read-write operation causes the pretreatment of web log file time-consuming in art, causes that the speed of log processing is slow, efficiency The problem of low, pretreatment of the single read-write completion to daily record data is realized, reduces the intermediate file of processing time and processing, So as to improve the treatment effeciency of daily record.
In this embodiment, original log can be stored in the journal file on disk;Middle log stream can use The form of expression of data, can be with store one or more data in middle log stream;It is also to data processing to pre-process daily record Result, in this embodiment, obtain using data as the pretreatment daily record of the form of expression after, can be write as file Enter disk.Reading and processing mode as a result of data flow, can be by original log by log integrity server While merging and sequence obtain middle log stream, middle daily record stream is split to obtain pretreatment daily record write-in disk, it is whole Individual process is synchronously carried out, and intermediate data is not accumulated, it is not necessary to is write file and is again read off file.
Embodiment as shown in Figure 2 to Figure 3, user(Client)Access request is sent to load equalizer, load balancing Access request is distributed to the cluster server of rear end by device(The cluster server shown in figure only illustrates, and its quantity is not It is limited to three shown in figure), cluster server receives access request, and records web log generation original log.
Then, log integrity server reads original log parallel from each cluster server, then will be from each cluster The original log read in server obtains pre-processing daily record after merging, sort, split, then will pretreatment daily record write-in magnetic Disk.
According to the abovementioned embodiments of the present invention, log integrity server 1 can include:Reading device, with multiple clusters Server connects, for reading original log parallel from cluster server by row with stream socket, by what is read parallel All daily record datas are stored in daily record concentration.
Specifically, as shown in figure 3, log integrity server reads cluster with stream socket by network by row is parallel Journal file in server, form the log stream of cluster server(That is original log).Due to log integrity server with Cluster server is in same LAN, and the reading is efficient and stable.
In this embodiment, the mode of data flow is employed by the parallel original log for reading cluster server of row, is read Each cluster server, i.e., a daily record data is read from every cluster server simultaneously, it is only necessary to take the internal memory of very little Resource, even if considering arithmetic speed, for example 10 daily record datas can be read every time and are cached, and day of the prior art Will file size is compared, and is also still the EMS memory occupation of very little.Space complexity is constant rank(That is buffer size), not Increase with the expansion of raw log files, greatly reduce the data of storage, reduce resource occupation, improve processing effect Rate.
In the above embodiment of the present invention, log integrity server 1 can also include:Processor, with reading device Connection, the daily record data for daily record to be concentrated sort to obtain data sequence, and time earliest daily record in output data sequence Data, next daily record data of the time read from origin server earliest daily record data is then filled into data sequence, Return performs the step of earliest daily record data of time in output data sequence, until daily record data output is finished, in obtaining Between log stream;Wherein, using the server of the data source of time earliest daily record data as origin server.
In this embodiment, while pretreatment daily record is got, the pretreatment daily record is write into disk file.
More specifically, the pretreatment system of web log can be connected with an analysis processor, and specifically, analysis Processor is connected with log integrity server, and analysis result is obtained for analyzing pretreatment daily record.
Specifically, the step of performing time earliest original log in output data sequence is being returned to, until by original day Will output finishes, and during the step of obtaining middle log stream, preprocess method also includes:In origin server has been read All data after, close the log stream of origin server.
In the above-described embodiments, the original log of reading forms log stream, and original log is merged, while to original Every daily record data in daily record carries out time-sequencing and obtains data sequence.In this embodiment, because every cluster server On daily record all produce and record sequentially in time, so be all in itself during the log of every cluster server by According to time-sequencing.
In the system using the preprocess method of embodiment, it is assumed that have n platform cluster servers, then log integrity service Device can realize the pretreatment to daily record by the following method:
(1)A daily record data is read from the journal file of every cluster server(One i.e. in raw log files Style of writing is originally), the original log of n bars daily record data composition is obtained, wherein, n is natural number.
(2)Time-sequencing is carried out to this n bars daily record data, obtains data sequence.
(3)The earliest daily record data of time in the data sequence is exported to middle log stream, it is assumed that i-th cluster service The time of daily record data in device is earliest, then this daily record data is sent into middle log stream, then remaining n -1 in data sequence Bar daily record, turns into residue sequence.
(4)Next daily record data iNext is read from i-th server, daily record data iNext is filled into above-mentioned residue N -1 daily record datas composition residue sequence, then again as there is the data sequences of n bar daily record datas in residue sequence, equally Sort the simultaneously earliest daily record data of output time.
Specifically, due to step(3)In remaining n -1 daily record datas be sorted, so perform step (4)Sequence when, it is not necessary to sequence is re-started to n bar daily record datas.
More specifically, it can use the algorithm of binary chop that next daily record data iNext is inserted into residue sequence. Using the algorithm of binary chop, time complexity is reduced to O (logn).
(5)Circulation performs step(4), finished until reading and exporting the original log on the cluster server.
Specifically, step is being performed(5)During, finished if the original log of certain cluster server is read, The corresponding log stream in order module of cluster server that reading is finished is closed and removed, to original on remaining server Daily record continues to repeat step(4), finished until the original log of Servers-all is read, the interim date of such final output Will stream realizes two purposes of merging and sequence.
Specifically, a daily record circulation road is formed when reading the original log on every cluster server, in certain cluster After original log reading on server finishes, corresponding daily record circulation road is closed.
Wherein, the log stream in order module, which is closed and removed, does not delete original journal file, is examined for security Consider, usual original log is all archive backups.In this embodiment, due to the original log sheet being recorded on cluster server Body is having time sequence, and log integrity server reads original log from cluster server, will not broken one by one The time sequencing of daily record data on bad every cluster server, not only saves the time for being repeated several times and reading and repeating sequence, And the original log of time sequencing intrinsic on cluster server is make use of, the time for reading sequence every time is also saved, and And daily record data merging and two processes that sort are combined into a process, time complexity is reduced to O (logn), greatly Improve computational efficiency.
In the above embodiment of the present invention, user mark of the log integrity server 1 among obtaining in log stream Afterwards, it can be identified according to user and interim date will stream is split to obtain pretreatment daily record.
Specifically, middle log stream can be split according to user's mark, and by the pretreatment log write after segmentation Enter multiple pre-processed results files, contain identical customer ID in each destination file.The writable number of this destination file According to storehouse, as used in follow-up log statistic and analysis.
In the above embodiment of the present invention, the file of single is only needed to read in whole preprocessing process(Read former Beginning journal file), and the file write-in of single(Write pretreated pretreatment journal file).Due to original log Data volume is very big, and pretreated journal file quantity is a lot, reads and write-in occupies whole pretreatment most of the time, By reducing reading and the write-in number of file, the time of pretreatment is saved, improves the efficiency of pretreatment.Wherein, the reality Apply original log and pretreatment daily record in example and may each be a file.
Fig. 5 is the flow chart of the preprocess method of web log according to embodiments of the present invention, the party as shown in Figure 5 Method may include steps of:
Step S102, reads original log from cluster server.
Step S104, is merged to original log and sequence obtains middle log stream.
Step S106, middle daily record stream is split to obtain pretreatment daily record.
Using the present invention, after original log is read from cluster server, daily record is merged and sorted two processes Combine, the middle daily record stream merged and sequence obtains is split to obtain pretreatment daily record, finally will pretreatment daily record write-in Disk, eliminate the generation of intermediate file, it is only necessary to the reading and write-in of single, improve the efficiency of log integrity.Solve Multiple read-write operation causes the pretreatment of web log file time-consuming in the prior art, causes the speed of log processing Slowly the problem of, efficiency is low, pretreatment of the single read-write completion to daily record data is realized, is reduced in processing time and processing Between file, so as to improve the treatment effeciency of daily record.
Embodiment as shown in Figure 2 to Figure 3, user send access request by load equalizer, and load equalizer please Seek the cluster server for being distributed to rear end(The cluster server shown in figure only illustrates, and its quantity is not limited to institute in figure Three shown)On, cluster server receives access request, and records web log generation original log.
Then, log integrity server reads the original log under each cluster server, then will be taken from each cluster The original log read in business device exports the pretreatment daily record after merging, sort, split, then will pretreatment daily record write-in magnetic Disk.
In the above embodiment of the present invention, it can include the step of reading original log from cluster server:With number According to stream mode original log is read parallel from cluster server by row;All daily record datas read parallel are stored in day Will is concentrated.
Specifically, as shown in figure 3, log server reads cluster server with stream socket by network by row is parallel In journal file, form the log stream of cluster server(That is original log).Because log integrity server and cluster take Business device is in same LAN, and the reading is efficient and stable.
In this embodiment, original log can be from the log file " journal file " being stored on disk;Interim date Will stream can use the form of expression of data, can be with store one or more data in middle log stream;Pre-process daily record The result to data processing, in this embodiment, obtain using data as the pretreatment daily record of the form of expression after, can will It writes disk as file.Reading and processing mode as a result of data flow, can be with by log integrity server While original log is merged and sequence obtains middle log stream, middle daily record stream is split to obtain pretreatment daily record Disk is write, whole process is synchronously carried out, and intermediate data is not accumulated, it is not necessary to write file and again read off file.
According to the abovementioned embodiments of the present invention, original log is merged and wrapped the step of sequence obtains middle log stream Include:The daily record data that daily record is concentrated is sorted to obtain data sequence;Time earliest daily record data in output data sequence;Will be from Next daily record data of the time earliest daily record data that origin server is read fills into data sequence;Return and perform output number The step of according to the earliest daily record data of time in sequence, until daily record data output is finished, obtain middle log stream;Wherein, Using the server of the data source of time earliest daily record data as origin server.
Specifically, next daily record data of the time read from origin server earliest daily record data is filled into data The step of sequence, can include:After the daily record data reading in origin server is finished, the day of origin server is closed Will stream.
In the above-described embodiments, the original log of all readings forms log stream, all original logs is merged, together When in log stream every daily record data carry out time-sequencing obtain data sequence.In this embodiment, because every cluster Original log on server is all produced and recorded sequentially in time, so during the log of every cluster server Itself all it is according to time-sequencing.
In the system using the preprocess method of embodiment, it is assumed that there are n platform cluster servers, then it is real by the following method Existing above-described embodiment:
(1)A daily record data is read from the journal file of every cluster server(One i.e. in raw log files Style of writing is originally), the original log of n bars daily record data composition is obtained, wherein, n is natural number.
(2)Time-sequencing is carried out to this n bars daily record data, obtains data sequence.
(3)The earliest daily record data of time in the data sequence is exported to middle log stream, it is assumed that i-th cluster service The time of daily record data in device is earliest, then this daily record data is sent into middle log stream, then remaining n -1 in data sequence Bar daily record, turns into residue sequence.
(4)Next daily record data iNext is read from i-th server, daily record data iNext is filled into above-mentioned residue N -1 daily record datas composition residue sequence, then again as there is the data sequences of n bar daily record datas in residue sequence, equally Sort the simultaneously earliest daily record data of output time.
Specifically, due to step(3)In remaining n -1 daily record datas be sorted, so perform step (4)Sequence when, it is not necessary to sequence is re-started to n bar daily record datas.
More specifically, it can use the algorithm of binary chop that next daily record data iNext is inserted into residue sequence. Using the algorithm of binary chop, time complexity is reduced to O (logn).
(5)Circulation performs step(4), finished until reading and exporting the original log on the cluster server.
Specifically, step is being performed(5)During, finished if the original log of certain cluster server is read, The corresponding log stream in order module of cluster server that reading is finished is closed and removed, to original on remaining server Daily record continues to repeat step(4), finished until the original log of Servers-all is read, the interim date of such final output Will stream realizes two purposes of merging and sequence.
Wherein, the log stream in order module, which is closed and removed, does not delete original journal file, is examined for security Consider, usual original log is all archive backups.In this embodiment, due to the original log sheet being recorded on cluster server Body is having time sequence, and log integrity server reads original log from cluster server, will not broken one by one The time sequencing of daily record data on bad every cluster server, not only saves the time for being repeated several times and reading and repeating sequence, And the original log of time sequencing intrinsic on cluster server is make use of, the time for reading sequence every time is also saved, and And daily record data merging and two processes that sort are combined into a process, time complexity is reduced to O (logn), greatly Improve computational efficiency.
In the above embodiment of the present invention, the step of splitting to obtain pretreatment daily record by middle daily record stream, can include: User's mark among obtaining in log stream;Interim date will stream is split according to user's mark to obtain pretreatment daily record.
Specifically, middle log stream can be split according to user's mark, and by the pretreatment log write after segmentation Enter multiple pre-processed results files, contain identical customer ID in each destination file.The writable number of this destination file According to storehouse, as used in follow-up log statistic and analysis.
In the above embodiment of the present invention, the file of single is only needed to read in whole preprocessing process(Read former Beginning daily record), and the file write-in of single(Write pretreated pretreatment daily record).Due to original log data volume very Greatly, pretreated target journaling quantity of documents is a lot, reads and write-in occupies whole pretreatment most of the time, pass through Reading and the write-in number of file are reduced, saves the time of pretreatment, improves the efficiency of pretreatment.
It should be noted that can be in such as one group of computer executable instructions the flow of accompanying drawing illustrates the step of Performed in computer system, although also, show logical order in flow charts, in some cases, can be with not The order being same as herein performs shown or described step.
Fig. 6 is the structural representation of the pretreatment unit of web log according to embodiments of the present invention.Such as Fig. 6 institutes Show, the device can include:First read module 10, for reading original log from cluster server;Ordering by merging module 30, for being merged to original log and sequence obtains middle log stream;Split module 50, for middle daily record stream to be split Obtain pre-processing daily record.
Using the present invention, after original log is read from cluster server, daily record is merged and sorted two processes Combine, the middle daily record stream merged and sequence obtains is split to obtain pretreatment daily record, finally will pretreatment daily record write-in Disk, eliminate the generation of intermediate file, it is only necessary to the reading and write-in of single, improve the efficiency of log integrity.Solve Multiple read-write operation causes the pretreatment of web log file time-consuming in the prior art, causes the speed of log processing Slowly the problem of, efficiency is low, pretreatment of the single read-write completion to daily record data is realized, is reduced in processing time and processing Between file, so as to improve the treatment effeciency of daily record.
Embodiment as shown in Figure 2 to Figure 3, the device in above-described embodiment can be arranged on log integrity server On, user(That is client)Access request is sent by load equalizer, load equalizer distributes the request to the cluster of rear end Server(The cluster server shown in figure only illustrates, and its quantity is not limited to three shown in figure)On, cluster clothes Business device receives access request, and records web log generation original log.
Then, the first read module of log integrity server reads the original log under each cluster server, so The original log read from each cluster server is exported into the pretreatment daily record after merging, sort, split afterwards, then will be pre- Handle daily record write-in disk.
According to the abovementioned embodiments of the present invention, the first read module 10 can include:Parallel read module, for data Stream mode reads original log parallel by row from cluster server;Preserving module, for all daily records that will be read parallel Data are stored in daily record concentration.
In the above embodiment of the present invention, ordering by merging module 30 can include:Order module, for daily record to be concentrated Daily record data sort to obtain data sequence;Output module, the daily record data earliest for the time in output data sequence;Supplement Module, for next daily record data of the time read from origin server earliest daily record data to be filled into data sequence; Execution module is returned to, for returning to the step of performing time earliest daily record data in output data sequence, until by daily record number Finished according to output, obtain middle log stream;Wherein, using the server of the data source of time earliest daily record data as source Server.
Specifically, complementary module can include:Closedown module, for having been read by the daily record data in origin server After finishing, the log stream of origin server is closed.
In the above-described embodiments, the original log of all readings forms log stream, and daily record data is merged, while right Daily record data in log stream carries out time-sequencing and obtains data sequence.In this embodiment, because on every cluster server Daily record all produce and record sequentially in time, so be all in itself during the log of every cluster server according to Time-sequencing.
In the system using the preprocess method of embodiment, it is assumed that there are n platform cluster servers, then it is real by the following method Existing above-described embodiment:
(1)A daily record data is read from the journal file of every cluster server(One i.e. in raw log files Style of writing is originally), the original log of n bars daily record data composition is obtained, wherein, n is natural number.
(2)Time-sequencing is carried out to this n bars daily record data, obtains data sequence.
(3)The earliest daily record data of time in the data sequence is exported to middle log stream, it is assumed that i-th cluster service The time of daily record data in device is earliest, then this daily record data is sent into middle log stream, then remaining n -1 in data sequence Bar daily record, turns into residue sequence.
(4)Next daily record data iNext is read from i-th server, daily record data iNext is filled into above-mentioned residue N -1 daily record datas composition residue sequence, then again as there is the data sequences of n bar daily record datas in residue sequence, equally Sort the simultaneously earliest daily record data of output time.
Specifically, due to step(3)In remaining n -1 daily record datas be sorted, so perform step (4)Sequence when, it is not necessary to sequence is re-started to n bar daily record datas.
More specifically, it can use the algorithm of binary chop that next daily record data iNext is inserted into residue sequence. Using the algorithm of binary chop, time complexity is reduced to O (logn).
(5)Circulation performs step(4), finished until reading and exporting the original log on the cluster server.
Specifically, step is being performed(5)During, finished if the original log of certain cluster server is read, The corresponding log stream in order module of cluster server that reading is finished is closed and removed, to original on remaining server Daily record continues to repeat step(4), finished until the original log of Servers-all is read, the interim date of such final output Will stream realizes two purposes of merging and sequence.
Wherein, the log stream in order module, which is closed and removed, does not delete original journal file, is examined for security Consider, usual original log is all archive backups.In this embodiment, due to the original log sheet being recorded on cluster server Body is having time sequence, and log integrity server reads original log from cluster server, will not broken one by one The time sequencing of daily record data on bad every cluster server, not only saves the time for being repeated several times and reading and repeating sequence, And the original log of time sequencing intrinsic on cluster server is make use of, the time for reading sequence every time is also saved, and And daily record data merging and two processes that sort are combined into a process, time complexity is reduced to O (logn), greatly Improve computational efficiency.
In the above embodiment of the present invention, segmentation module 50 can include:Acquisition module, for obtaining middle log stream In user mark;Split submodule, for being split to obtain pretreatment daily record to interim date will stream according to user's mark.
The corresponding deletion that above-mentioned closedown module, acquisition module and segmentation submodule correspond in above method embodiment respectively is former The method that beginning daily record and acquisition pretreatment daily record are realized, the example and application that above three module is realized with corresponding step Scene is identical, but is not limited to above method embodiment disclosure of that.Above-mentioned closedown module, acquisition module and segmentation submodule Terminal is operated in, can be realized by software or hardware.
Using the above embodiment of the present invention, a daily record data can be preferably read from cluster server, this is counted Middle log stream is obtained according to ordering by merging(There was only a data in middle log stream in the embodiment), then by the data point Write-in disk is cut, the generation for not having " intermediate file " in this processing procedure.
As can be seen from the above description, the present invention realizes following technique effect:Daily record is merged and sorted two Individual process is combined into a process, and time complexity is reduced to O (logn), improves computational efficiency;Employ the mode of data flow Read by row, space complexity is constant rank, reduces resource occupation;First merge, sort and split again, between each server With data flow transmission, without the generation of intermediate file, the time cost for reading and writing file is saved.
Obviously, those skilled in the art should be understood that above-mentioned each module of the invention or each step can be with general Computing device realize that they can be concentrated on single computing device, or be distributed in multiple computing devices and formed Network on, alternatively, they can be realized with the program code that computing device can perform, it is thus possible to they are stored Performed in the storage device by computing device, either they are fabricated to respectively each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.So, the present invention is not restricted to any specific Hardware and software combines.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims (7)

  1. A kind of 1. preprocess method of web log, it is characterised in that including:
    Original log is read from cluster server;
    The original log is merged and sequence obtains middle log stream;
    The middle daily record stream is split to obtain pretreatment daily record;
    Wherein, the step of original log is read from cluster server includes:Row is pressed from the cluster service with stream socket The original log is read in device parallel;All daily record datas read parallel are stored in into daily record to concentrate;
    The original log is merged and included the step of sequence obtains middle log stream:Described in the daily record is concentrated Daily record data sorts to obtain data sequence;Export the daily record data of time earliest in the data sequence;It will be taken from source The time that business device is read next daily record data of the daily record data earliest fills into the data sequence;Return and perform The step of exporting the earliest daily record data of time in the data sequence, until daily record data output is finished, obtains To the middle log stream;Wherein, the server of the data source of the daily record data using the time earliest is as described in Origin server.
  2. 2. preprocess method according to claim 1, it is characterised in that by the time read from origin server most The step of next daily record data of the early daily record data fills into the data sequence also includes:
    After the daily record data reading in the origin server is finished, the daily record of the origin server is closed Stream.
  3. 3. preprocess method as claimed in any of claims 1 to 2, it is characterised in that by the middle log stream Segmentation obtains including the step of pre-processing daily record:
    Obtain user's mark in the middle log stream;
    The middle daily record stream is split to obtain the pretreatment daily record according to user mark.
  4. A kind of 4. pretreatment unit of web log, it is characterised in that including:
    First read module, for reading original log from cluster server;
    Ordering by merging module, for being merged to the original log and sequence obtains middle log stream;
    Split module, for splitting the middle daily record stream to obtain pretreatment daily record;
    Wherein, first read module includes:Parallel read module, for pressing row from the cluster service with stream socket The original log is read in device parallel;Preserving module, for all daily record datas read parallel to be stored in into daily record collection In;
    The ordering by merging module includes:Order module, the daily record data for the daily record to be concentrated, which sorts, to be counted According to sequence;Output module, for exporting the daily record data of time earliest in the data sequence;Complementary module,
    The next daily record data for the daily record data by the time read from origin server earliest fills into institute State data sequence;Execution module is returned, the daily record data of time earliest in the data sequence is exported for returning to perform The step of, until daily record data output is finished, obtain the middle log stream;Wherein, by the time earliest institute The server of the data source of daily record data is stated as the origin server.
  5. 5. pretreatment unit according to claim 4, it is characterised in that the complementary module includes:
    Closedown module, for after the daily record data reading in the origin server is finished, closing the source The log stream of server.
  6. 6. pretreatment unit according to claim 5, it is characterised in that the segmentation module includes:
    Acquisition module, for obtaining the mark of the user in the middle log stream;
    Split submodule, for being split to obtain the pretreatment day to the middle daily record stream according to user mark Will.
  7. A kind of 7. pretreatment system of web log, it is characterised in that including:
    Multiple cluster servers;
    Log integrity server, it is connected with the multiple cluster server, it is original for being read from the cluster server Daily record, and after being merged to the original log and sequence obtains middle log stream, the middle daily record stream is split Obtain pre-processing daily record;
    Wherein, the log integrity server includes:Reading device, it is connected with the multiple cluster server, for number The original log is read parallel from the cluster server by row, all daily record datas that will be read parallel according to stream mode It is stored in daily record concentration;
    The log integrity server includes:Processor, it is connected with the reading device, for the institute for concentrating the daily record State daily record data to sort to obtain data sequence, and export the daily record data of time earliest in the data sequence,
    Then next daily record data of the daily record data by the time read from origin server earliest fills into institute State data sequence, return and perform the step of exporting the earliest daily record data of time in the data sequence, until will described in Daily record data output finishes, and obtains the middle log stream;Wherein, the data of the daily record data by the time earliest are come The server in source is as the origin server.
CN201310591082.7A 2013-11-20 2013-11-20 Preprocess method, the apparatus and system of web log Active CN103595571B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310591082.7A CN103595571B (en) 2013-11-20 2013-11-20 Preprocess method, the apparatus and system of web log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310591082.7A CN103595571B (en) 2013-11-20 2013-11-20 Preprocess method, the apparatus and system of web log

Publications (2)

Publication Number Publication Date
CN103595571A CN103595571A (en) 2014-02-19
CN103595571B true CN103595571B (en) 2018-02-02

Family

ID=50085562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310591082.7A Active CN103595571B (en) 2013-11-20 2013-11-20 Preprocess method, the apparatus and system of web log

Country Status (1)

Country Link
CN (1) CN103595571B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391954B (en) * 2014-11-27 2019-04-09 北京国双科技有限公司 The processing method and processing device of database journal
CN106230618A (en) * 2016-07-21 2016-12-14 柳州龙辉科技有限公司 A kind of system journal centralized processing system
CN107480277B (en) * 2017-08-22 2021-01-26 北京京东尚科信息技术有限公司 Method and device for collecting website logs
CN107729375B (en) * 2017-09-13 2021-11-23 微梦创科网络科技(中国)有限公司 Log data sorting method and device
CN108228797A (en) * 2017-12-29 2018-06-29 上海全成通信技术有限公司 A kind of high efficiency, low cost processing method of massive logs data
CN108363649B (en) * 2017-12-29 2021-04-16 微梦创科网络科技(中国)有限公司 Distributed log access amount statistical method and device
CN109255069A (en) * 2018-07-31 2019-01-22 阿里巴巴集团控股有限公司 A kind of discrete text content risks recognition methods and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5084815A (en) * 1986-05-14 1992-01-28 At&T Bell Laboratories Sorting and merging of files in a multiprocessor
CN101192227A (en) * 2006-11-30 2008-06-04 阿里巴巴公司 Log file analytical method and system based on distributed type computing network
CN102830950A (en) * 2012-08-03 2012-12-19 苏州迈科网络安全技术股份有限公司 Method and system for sorting monitoring data
CN102831181A (en) * 2012-07-31 2012-12-19 北京光泽时代通信技术有限公司 Directory refreshing method for cache files and caching proxy server for implementing directory refreshing method
CN102968496A (en) * 2012-12-04 2013-03-13 天津神舟通用数据技术有限公司 Parallel sequencing method based on task derivation and double buffering mechanism
CN103178982A (en) * 2011-12-23 2013-06-26 阿里巴巴集团控股有限公司 Method and device for analyzing log

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5084815A (en) * 1986-05-14 1992-01-28 At&T Bell Laboratories Sorting and merging of files in a multiprocessor
CN101192227A (en) * 2006-11-30 2008-06-04 阿里巴巴公司 Log file analytical method and system based on distributed type computing network
CN103178982A (en) * 2011-12-23 2013-06-26 阿里巴巴集团控股有限公司 Method and device for analyzing log
CN102831181A (en) * 2012-07-31 2012-12-19 北京光泽时代通信技术有限公司 Directory refreshing method for cache files and caching proxy server for implementing directory refreshing method
CN102830950A (en) * 2012-08-03 2012-12-19 苏州迈科网络安全技术股份有限公司 Method and system for sorting monitoring data
CN102968496A (en) * 2012-12-04 2013-03-13 天津神舟通用数据技术有限公司 Parallel sequencing method based on task derivation and double buffering mechanism

Also Published As

Publication number Publication date
CN103595571A (en) 2014-02-19

Similar Documents

Publication Publication Date Title
CN103595571B (en) Preprocess method, the apparatus and system of web log
CN109254733B (en) Method, device and system for storing data
CN104391954B (en) The processing method and processing device of database journal
CN106528717B (en) Data processing method and system
US20150378619A1 (en) Storage system, recording medium for storing control program and control method for storage system
CN103955530B (en) Data reconstruction and optimization method of on-line repeating data deletion system
CN110674154B (en) Spark-based method for inserting, updating and deleting data in Hive
US20140101167A1 (en) Creation of Inverted Index System, and Data Processing Method and Apparatus
CN102456076A (en) Massive fragment data aggregation system and method
US8019765B2 (en) Identifying files associated with a workflow
CN106126352A (en) The asynchronous method and device reporting and submitting information
CN107391544A (en) Processing method, device, equipment and the computer storage media of column data storage
Zhou et al. Improving big data storage performance in hybrid environment
CN105068875A (en) Intelligence data processing method and apparatus
CN113918532A (en) Portrait label aggregation method, electronic device and storage medium
CN106909623B (en) A kind of data set and date storage method for supporting efficient mass data to analyze and retrieve
CN103236938A (en) Method and system for user action collection based on cache memory and asynchronous processing technology
CN113032621A (en) Data sampling method and device, computer equipment and storage medium
CN112037874A (en) Distributed data processing method based on mapping reduction
Wu et al. Streaming Approach to In Situ Selection of Key Time Steps for Time‐Varying Volume Data
CN109600413A (en) A kind of data management and transmission method based on high-energy physics example
US11907531B2 (en) Optimizing storage-related costs with compression in a multi-tiered storage device
Palit et al. Exploratory Research on Developing Hadoop-based Data Analytics Tools
Li et al. EStore: An effective optimized data placement structure for Hive
Abhari Web object-based policies for managing proxy caches

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Preprocessing method, device and system for website access logs

Effective date of registration: 20190531

Granted publication date: 20180202

Pledgee: Shenzhen Black Horse World Investment Consulting Co., Ltd.

Pledgor: Beijing Guoshuang Technology Co.,Ltd.

Registration number: 2019990000503

CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: Beijing Guoshuang Technology Co.,Ltd.