Embodiment
It should be noted that, in the situation that not conflicting, embodiment and the feature in embodiment in the application can combine mutually.Describe below with reference to the accompanying drawings and in conjunction with the embodiments the present invention in detail.
Fig. 1 is according to the structural representation of the pretreatment system of the web log of the embodiment of the present invention.As shown in Figure 1, this system can comprise: a plurality of cluster servers 2 and a daily record preprocessing server 1.
Wherein, a plurality of cluster servers 2, for recording original log.
Daily record preprocessing server 1, is connected with a plurality of cluster servers 2, for reading original log from cluster server, and original log is merged and sequence obtain in the middle of after daily record stream, middle daily record stream is cut apart and is obtained preliminary treatment daily record.
Adopt said system, after daily record preprocessing server reads original log from cluster server, daily record merging and two processes that sort are combined, the middle daily record stream merging and sequence obtains is cut apart and obtained preliminary treatment daily record, saved the generation of intermediate file, only need reading and writing of single, improved the pretreated efficiency of daily record.Solved in prior art repeatedly the preliminary treatment length consuming time that read-write operation makes web log file, cause the speed of log processing slow, inefficient problem, realize single and read and write the preliminary treatment to daily record data, reduce the intermediate file of processing time and processing, thereby improved the treatment effeciency of daily record.
In this embodiment, original log can be the journal file being stored on disk; Middle daily record stream can adopt the form of expression of data, can in store one or more data in middle daily record stream; Preliminary treatment daily record is also the result to data processing, in this embodiment, after obtaining take the preliminary treatment daily record that data are the form of expression, it can be write to disk as file.Owing to having adopted reading and processing mode of data flow, by daily record preprocessing server can original log is merged and sequence obtain in the middle of in daily record stream, middle daily record stream is cut apart and obtained preliminary treatment daily record and write disk, whole process is synchronously carried out, intermediate data is not piled up, and does not need writing in files and file reading again.
Embodiment as shown in Figure 2 to Figure 3, user's (client) sends access request to load equalizer, load equalizer is distributed to the cluster server of rear end by access request, and (cluster server shown in figure is only done example explanation, its quantity is not limited to three shown in figure), cluster server receives access request, and records web log generation original log.
Then, daily record preprocessing server reads original log from each cluster server is parallel, then the original log from reading each cluster server is obtained after merging, sort, cutting apart to preliminary treatment daily record, then preliminary treatment daily record is write to disk.
According to the abovementioned embodiments of the present invention, daily record preprocessing server 1 can comprise: reading device, be connected with a plurality of cluster servers, for walking abreast and read original log from cluster server by row with stream socket, the parallel all daily record datas that read be kept to daily record and concentrate.
Particularly, as shown in Figure 3, daily record preprocessing server is walked abreast and reads the journal file in cluster server by row with stream socket by network, forms the daily record stream (being original log) of cluster server.Because daily record preprocessing server and cluster server are in same local area network (LAN), this reads is efficient and stable.
In this embodiment, adopted the mode of data flow by the parallel original log that reads cluster server of row, read each cluster server, from every cluster server, read a daily record data simultaneously, only need to take very little memory source, even if consider arithmetic speed, can read for example 10 daily record datas at every turn and carry out buffer memory, comparing with log file size of the prior art, is still also very little EMS memory occupation.Space complexity is constant rank (being buffer size), and the expansion with original log file does not increase, and has greatly reduced the data of storage, has reduced resource occupation, has improved treatment effeciency.
In the above embodiment of the present invention, daily record preprocessing server 1 can also comprise: processor, be connected with reading device, for the concentrated daily record data sequence of daily record is obtained to data sequence, and export time daily record data the earliest in data sequence, then next daily record data of the time that always source server reads daily record data the earliest is filled into data sequence, return to the step of carrying out time daily record data the earliest in output data sequence, until daily record data is exported complete, daily record stream in the middle of obtaining; Wherein, using the server of the Data Source of time daily record data the earliest as carrying out source server.
In this embodiment, when getting preliminary treatment daily record, this preliminary treatment daily record is write to disk file.
More specifically, the pretreatment system of web log can be connected with an analysis processor, and particularly, analysis processor is connected with daily record preprocessing server, for analyzing preliminary treatment daily record, obtains analysis result.
Particularly, the step of time original log the earliest in returning to execution output data sequence, until original log is exported complete, in the process of the step that in the middle of obtaining, daily record is flowed, preprocess method also comprises: after all data in having read source server, close the daily record stream of source server.
In the above-described embodiments, the original log reading forms daily record stream, and original log is merged, and every daily record data in original log is carried out to time-sequencing simultaneously and obtains data sequence.In this embodiment because the daily record on every cluster server all produces and records according to time sequencing, so during the log of every cluster server itself all according to time-sequencing.
In the system of preprocess method of using embodiment, suppose to have n platform cluster server, daily record preprocessing server can realize the preliminary treatment to daily record by the following method:
(1) from the journal file of every cluster server, read a daily record data (being a line text in original log file), obtain the original log that n bar daily record data forms, wherein, n is natural number.
(2) this n bar daily record data is carried out to time-sequencing, obtain data sequence.
(3) time daily record data the earliest in this data sequence is exported to middle daily record stream, suppose the time of the daily record data in i platform cluster server the earliest, in the middle of this daily record data being sent to, daily record stream, remains 1 daily record of n – in data sequence, becomes residue sequence.
(4) from i station server, read next daily record data iNext, this daily record data iNext is filled into the residue sequence that 1 daily record data of above-mentioned remaining n – forms, in residue sequence, become again the data sequence that has n bar daily record data, same sequence output time daily record data the earliest.
Particularly, because 1 daily record data of remaining n – in step (3) is sorted, so when the sequence of execution step (4), do not need n bar daily record data to re-start sequence.
More specifically, can use the algorithm of binary chop that next daily record data iNext is inserted in residue sequence.The algorithm that uses binary chop, time complexity is reduced to O (logn).
(5) circulation execution step (4), until read and export complete by the original log on described cluster server.
Particularly, in the process of execution step (5), if it is complete that the original log of certain cluster server reads, by reading the daily record stream of complete cluster server correspondence in order module, close and remove, original log on residue server is continued to repeated execution of steps (4), until that the original log of Servers-all reads is complete, the middle daily record stream of so final output realized and having been merged and two objects that sort.
Particularly, while reading the original log on every cluster server, form a daily record circulation road, the original log on certain cluster server read complete after, corresponding daily record circulation road is closed.
Wherein, the daily record stream in order module is closed and is removed and do not delete original journal file, and for security consideration, original log is all the backup of filing conventionally.In this embodiment, because the original log itself being recorded on cluster server is free sequence, daily record preprocessing server reads original log one by one from cluster server, can not destroy the time sequencing of the daily record data on every cluster server, not only saved and repeatedly repeated to read the time that repeats sequence, and utilized the original log of intrinsic time sequencing on cluster server, also saved the time of at every turn reading sequence, and daily record data merging and two processes that sort are combined into a process, time complexity is reduced to O (logn), improved widely computational efficiency.
In the above embodiment of the present invention, daily record preprocessing server 1 after the user ID in daily record stream, can be cut apart and obtain preliminary treatment daily record interim date will stream according to user ID in the middle of obtaining.
Particularly, middle daily record stream can be cut apart according to user ID, and the preliminary treatment daily record after cutting apart is write to a plurality of preliminary treatment destination files, described in each, in destination file, contain identical customer ID.This destination file can write into Databasce, as follow-up log statistic with analyze used.
In the above embodiment of the present invention, in whole preprocessing process, only need the file of single to read (reading original log file), and the file of single write (writing pretreated preliminary treatment journal file).Because the data volume of original log is very large, pretreated journal file quantity is a lot, reads and write to have occupied whole preliminary treatment most of the time, by what reduce file, reads and write indegree, has saved the pretreated time, has improved pretreated efficiency.Wherein, in this embodiment, original log and preliminary treatment daily record can be all files.
Fig. 5 is according to the flow chart of the preprocess method of the web log of the embodiment of the present invention, and the method can comprise the steps: as shown in Figure 5
Step S102 reads original log from cluster server.
Step S104, to original log merge and sequence obtain in the middle of daily record stream.
Step S106, cuts apart middle daily record stream to obtain preliminary treatment daily record.
Adopt the present invention, read original log from cluster server after, daily record merging and two processes that sort are combined, the middle daily record stream merging and sequence obtains is cut apart and obtained preliminary treatment daily record, finally preliminary treatment daily record is write to disk, save the generation of intermediate file, only needed reading and writing of single, improved the pretreated efficiency of daily record.Solved in prior art repeatedly the preliminary treatment length consuming time that read-write operation makes web log file, cause the speed of log processing slow, inefficient problem, realize single and read and write the preliminary treatment to daily record data, reduce the intermediate file of processing time and processing, thereby improved the treatment effeciency of daily record.
Embodiment as shown in Figure 2 to Figure 3, user sends access request by load equalizer, load equalizer is distributed to the cluster server of rear end by request, and (cluster server shown in figure is only done example explanation, its quantity is not limited to three shown in figure) on, cluster server receives access request, and records web log generation original log.
Then, daily record preprocessing server reads the original log under each cluster server, and the then preliminary treatment daily record after merging, sort, cutting apart by the original log output from reading each cluster server, then writes disk by preliminary treatment daily record.
In the above embodiment of the present invention, the step that reads original log from cluster server can comprise: with stream socket, by row, from cluster server, walk abreast and read original log; The parallel all daily record datas that read are kept to daily record to be concentrated.
Particularly, as shown in Figure 3, log server is walked abreast and reads the journal file in cluster server by row with stream socket by network, forms the daily record stream (being original log) of cluster server.Because daily record preprocessing server and cluster server are in same local area network (LAN), this reads is efficient and stable.
In this embodiment, original log can be from being stored in the log file " journal file " disk; Middle daily record stream can adopt the form of expression of data, can in store one or more data in middle daily record stream; Preliminary treatment daily record is also the result to data processing, in this embodiment, after obtaining take the preliminary treatment daily record that data are the form of expression, it can be write to disk as file.Owing to having adopted reading and processing mode of data flow, by daily record preprocessing server can original log is merged and sequence obtain in the middle of in daily record stream, middle daily record stream is cut apart and obtained preliminary treatment daily record and write disk, whole process is synchronously carried out, intermediate data is not piled up, and does not need writing in files and file reading again.
According to the abovementioned embodiments of the present invention, original log is merged and sort obtain in the middle of the step of daily record stream comprise: the daily record data sequence that daily record is concentrated obtains data sequence; Time daily record data the earliest in output data sequence; By the time that always source server reads, next daily record data of daily record data the earliest fills into data sequence; Return to the step of carrying out time daily record data the earliest in output data sequence, until daily record data is exported complete, daily record stream in the middle of obtaining; Wherein, using the server of the Data Source of time daily record data the earliest as carrying out source server.
Particularly, the step that next daily record data of the time that always source server reads daily record data the earliest is filled into data sequence can comprise: the daily record data in future source server read complete after, close the daily record stream of source server.
In the above-described embodiments, all original log that read form daily record stream, and all original log are merged, and every daily record data in daily record stream are carried out to time-sequencing simultaneously and obtain data sequence.In this embodiment because the original log on every cluster server all produces and records according to time sequencing, so during the log of every cluster server itself all according to time-sequencing.
In the system of preprocess method of using embodiment, suppose to have n platform cluster server, realize by the following method above-described embodiment:
(1) from the journal file of every cluster server, read a daily record data (being a line text in original log file), obtain the original log that n bar daily record data forms, wherein, n is natural number.
(2) this n bar daily record data is carried out to time-sequencing, obtain data sequence.
(3) time daily record data the earliest in this data sequence is exported to middle daily record stream, suppose the time of the daily record data in i platform cluster server the earliest, in the middle of this daily record data being sent to, daily record stream, remains 1 daily record of n – in data sequence, becomes residue sequence.
(4) from i station server, read next daily record data iNext, this daily record data iNext is filled into the residue sequence that 1 daily record data of above-mentioned remaining n – forms, in residue sequence, become again the data sequence that has n bar daily record data, same sequence output time daily record data the earliest.
Particularly, because 1 daily record data of remaining n – in step (3) is sorted, so when the sequence of execution step (4), do not need n bar daily record data to re-start sequence.
More specifically, can use the algorithm of binary chop that next daily record data iNext is inserted in residue sequence.The algorithm that uses binary chop, time complexity is reduced to O (logn).
(5) circulation execution step (4), until read and export complete by the original log on described cluster server.
Particularly, in the process of execution step (5), if it is complete that the original log of certain cluster server reads, by reading the daily record stream of complete cluster server correspondence in order module, close and remove, original log on residue server is continued to repeated execution of steps (4), until that the original log of Servers-all reads is complete, the middle daily record stream of so final output realized and having been merged and two objects that sort.
Wherein, the daily record stream in order module is closed and is removed and do not delete original journal file, and for security consideration, original log is all the backup of filing conventionally.In this embodiment, because the original log itself being recorded on cluster server is free sequence, daily record preprocessing server reads original log one by one from cluster server, can not destroy the time sequencing of the daily record data on every cluster server, not only saved and repeatedly repeated to read the time that repeats sequence, and utilized the original log of intrinsic time sequencing on cluster server, also saved the time of at every turn reading sequence, and daily record data merging and two processes that sort are combined into a process, time complexity is reduced to O (logn), improved widely computational efficiency.
In the above embodiment of the present invention, middle daily record stream is cut apart to the step that obtains preliminary treatment daily record can be comprised: the user ID in the middle of obtaining in daily record stream; According to user ID, interim date will stream is cut apart and obtained preliminary treatment daily record.
Particularly, middle daily record stream can be cut apart according to user ID, and the preliminary treatment daily record after cutting apart is write to a plurality of preliminary treatment destination files, described in each, in destination file, contain identical customer ID.This destination file can write into Databasce, as follow-up log statistic with analyze used.
In the above embodiment of the present invention, in whole preprocessing process, only need the file of single to read (reading original log), and the file of single write (writing pretreated preliminary treatment daily record).Because the data volume of original log is very large, pretreated target journaling quantity of documents is a lot, reads and write to have occupied whole preliminary treatment most of the time, by what reduce file, reads and write indegree, save the pretreated time, improved pretreated efficiency.
It should be noted that, in the step shown in the flow chart of accompanying drawing, can in the computer system such as one group of computer executable instructions, carry out, and, although there is shown logical order in flow process, but in some cases, can carry out shown or described step with the order being different from herein.
Fig. 6 is according to the structural representation of the pretreatment unit of the web log of the embodiment of the present invention.As shown in Figure 6, this device can comprise: the first read module 10, for reading original log from cluster server; Ordering by merging module 30, for original log is merged and sequence obtain in the middle of daily record stream; Cut apart module 50, for middle daily record stream is cut apart and obtained preliminary treatment daily record.
Adopt the present invention, read original log from cluster server after, daily record merging and two processes that sort are combined, the middle daily record stream merging and sequence obtains is cut apart and obtained preliminary treatment daily record, finally preliminary treatment daily record is write to disk, save the generation of intermediate file, only needed reading and writing of single, improved the pretreated efficiency of daily record.Solved in prior art repeatedly the preliminary treatment length consuming time that read-write operation makes web log file, cause the speed of log processing slow, inefficient problem, realize single and read and write the preliminary treatment to daily record data, reduce the intermediate file of processing time and processing, thereby improved the treatment effeciency of daily record.
Embodiment as shown in Figure 2 to Figure 3, device in above-described embodiment can be arranged in daily record preprocessing server, user's (being client) sends access request by load equalizer, load equalizer is distributed to the cluster server of rear end by request, and (cluster server shown in figure is only done example explanation, its quantity is not limited to three shown in figure) on, cluster server receives access request, and records web log generation original log.
Then, the first read module of daily record preprocessing server reads the original log under each cluster server, then the preliminary treatment daily record after merging, sort, cutting apart by the original log output from reading each cluster server, then writes disk by preliminary treatment daily record.
According to the abovementioned embodiments of the present invention, the first read module 10 can comprise: parallel read module, for walking abreast and read original log from cluster server by row with stream socket; Preserve module, for the parallel all daily record datas that read are kept to daily record, concentrate.
In the above embodiment of the present invention, ordering by merging module 30 can comprise: order module, for the concentrated daily record data sequence of daily record is obtained to data sequence; Output module, for exporting data sequence time daily record data the earliest; Complementary module, for filling into data sequence by next daily record data of the time that always source server reads daily record data the earliest; Return to Executive Module, for returning to the step of carrying out output data sequence time daily record data the earliest, until daily record data is exported complete, daily record stream in the middle of obtaining; Wherein, using the server of the Data Source of time daily record data the earliest as carrying out source server.
Particularly, complementary module can comprise: closing module, for future source server daily record data read complete after, close the daily record stream of source server.
In the above-described embodiments, all original log that read form daily record stream, and daily record data is merged, and the daily record data in daily record stream are carried out to time-sequencing simultaneously and obtain data sequence.In this embodiment because the daily record on every cluster server all produces and records according to time sequencing, so during the log of every cluster server itself all according to time-sequencing.
In the system of preprocess method of using embodiment, suppose to have n platform cluster server, realize by the following method above-described embodiment:
(1) from the journal file of every cluster server, read a daily record data (being a line text in original log file), obtain the original log that n bar daily record data forms, wherein, n is natural number.
(2) this n bar daily record data is carried out to time-sequencing, obtain data sequence.
(3) time daily record data the earliest in this data sequence is exported to middle daily record stream, suppose the time of the daily record data in i platform cluster server the earliest, in the middle of this daily record data being sent to, daily record stream, remains 1 daily record of n – in data sequence, becomes residue sequence.
(4) from i station server, read next daily record data iNext, this daily record data iNext is filled into the residue sequence that 1 daily record data of above-mentioned remaining n – forms, in residue sequence, become again the data sequence that has n bar daily record data, same sequence output time daily record data the earliest.
Particularly, because 1 daily record data of remaining n – in step (3) is sorted, so when the sequence of execution step (4), do not need n bar daily record data to re-start sequence.
More specifically, can use the algorithm of binary chop that next daily record data iNext is inserted in residue sequence.The algorithm that uses binary chop, time complexity is reduced to O (logn).
(5) circulation execution step (4), until read and export complete by the original log on described cluster server.
Particularly, in the process of execution step (5), if it is complete that the original log of certain cluster server reads, by reading the daily record stream of complete cluster server correspondence in order module, close and remove, original log on residue server is continued to repeated execution of steps (4), until that the original log of Servers-all reads is complete, the middle daily record stream of so final output realized and having been merged and two objects that sort.
Wherein, the daily record stream in order module is closed and is removed and do not delete original journal file, and for security consideration, original log is all the backup of filing conventionally.In this embodiment, because the original log itself being recorded on cluster server is free sequence, daily record preprocessing server reads original log one by one from cluster server, can not destroy the time sequencing of the daily record data on every cluster server, not only saved and repeatedly repeated to read the time that repeats sequence, and utilized the original log of intrinsic time sequencing on cluster server, also saved the time of at every turn reading sequence, and daily record data merging and two processes that sort are combined into a process, time complexity is reduced to O (logn), improved widely computational efficiency.
In the above embodiment of the present invention, cutting apart module 50 can comprise: acquisition module, for the user ID of daily record stream in the middle of obtaining; Cut apart submodule, for interim date will stream being cut apart and obtained preliminary treatment daily record according to user ID.
Above-mentioned closing module, acquisition module and cut apart submodule respectively the correspondence in corresponding said method embodiment delete original log and obtain the method that preliminary treatment daily record realizes, example and application scenarios that above-mentioned three modules realize with corresponding step are identical, but are not limited to the disclosed content of said method embodiment.Above-mentioned closing module, acquisition module and cut apart submodule and operate in terminal, can realize by software or hardware.
Adopt the above embodiment of the present invention, preferably can from cluster server, read a daily record data, daily record stream in the middle of this data ordering by merging is obtained (only having data in the middle daily record stream in this embodiment), then this Data Segmentation is write to disk, in this processing procedure, there is no the generation of " intermediate file ".
As can be seen from the above description, the present invention has realized following technique effect: daily record merging and two processes that sort are combined into a process, and time complexity is reduced to O (logn), has improved computational efficiency; Adopted the mode of data flow to read by row, space complexity is constant rank, has reduced resource occupation; First merge, sort and cut apart again, between each server, use data flow transmission, there is no the generation of intermediate file, saved the time cost reading with writing in files.
Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in storage device and be carried out by calculation element, or they are made into respectively to each integrated circuit modules, or a plurality of modules in them or step are made into single integrated circuit module to be realized.Like this, the present invention is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.