CN103595571A - Preprocessing method, device and system for website access logs - Google Patents

Preprocessing method, device and system for website access logs Download PDF

Info

Publication number
CN103595571A
CN103595571A CN201310591082.7A CN201310591082A CN103595571A CN 103595571 A CN103595571 A CN 103595571A CN 201310591082 A CN201310591082 A CN 201310591082A CN 103595571 A CN103595571 A CN 103595571A
Authority
CN
China
Prior art keywords
daily record
data
stream
server
read
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310591082.7A
Other languages
Chinese (zh)
Other versions
CN103595571B (en
Inventor
何恺铎
饶峰云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201310591082.7A priority Critical patent/CN103595571B/en
Publication of CN103595571A publication Critical patent/CN103595571A/en
Application granted granted Critical
Publication of CN103595571B publication Critical patent/CN103595571B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Computer And Data Communications (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a preprocessing method, device and system for website access logs. The method comprises the steps that original logs are read from a cluster server; the original logs are combined and ordered to obtain a middle log stream; the middle log stream is segmented to obtain a preprocessing log. By the adoption of the preprocessing method, device and system, the problems that due to repeated read-write operation in the prior art, preprocessing for the website access logs consumes a long time, so the speed and efficiency of log processing are low are resolved, preprocessing for log data is completed by means of single-time read-write, processing time and the number of processed intermediate files are reduced, and therefore log processing efficiency is improved.

Description

The preprocess method of web log, Apparatus and system
Technical field
The present invention relates to data processing field, in particular to a kind of preprocess method, Apparatus and system of web log.
Background technology
Development along with the Internet, number of netizens constantly increases, the visit capacity of website constantly rises, separate unit server cannot meet a large amount of website visiting amounts, common way is to adopt load balancing cluster, by one or more front end load equalizer, operating load is distributed on one group of server of rear end to the back-end server request of receiving log.Along with visit capacity constantly rises, the size of journal file constantly expands, but the processing time of corresponding journal file requires but not reduce.Therefore, how to improve the treatment effeciency of journal file, become the problem that this area must face.
Log processing method is the earliest directly to read original log file, and then to the data analysis in original log file, efficiency is very low like this, because each different analysis all will be read all original log again.Conventional log processing method comprises preliminary treatment and follow-up statistical analysis two parts at present, and wherein preprocessing part is that all follow-up statistical analyses share, and generally includes and cuts apart, merges and three main process that sort.Wherein, the demand of cutting apart is because follow-up possibility only be carried out statistical analysis for the daily record of certain special identifier; The demand merging is because original log is distributed in a plurality of cluster servers, needs unified analysis; The demand of sequence is because sequencing and the causality need to analysis event occurring.These three kinds of demands are very general, and existing way is: first on computer cluster, original log is divided into a plurality of identification documents according to certain like-identified (such as user ID); On log processing server, read the identification document on cluster server afterwards, the identification document of like-identified is merged into a file as file destination; Finally, log information in file destination is sorted according to the time, generate pretreated journal file.
Prior art all isolates the process of cutting apart, merging and sort of journal file, the intermediate files such as identification document, file destination in this process, have been generated, caused file repeatedly to read and write operation, and because journal file data volume is large, intermediate file quantity is many, read and write very and expend time in, reduced the pretreated efficiency of whole daily record.
For in prior art repeatedly read-write operation make the preliminary treatment length consuming time of web log file, cause the speed of log processing slow, inefficient problem, effective solution is not yet proposed at present.
Summary of the invention
For in correlation technique repeatedly read-write operation make the preliminary treatment length consuming time of web log file, cause the speed of log processing slow, inefficient problem, effective solution is not yet proposed at present, for this reason, main purpose of the present invention is to provide a kind of preprocess method, Apparatus and system of web log, to address the above problem.
To achieve these goals, according to an aspect of the present invention, provide a kind of preprocess method of web log, the method comprises: from cluster server, read original log; To original log merge and sequence obtain in the middle of daily record stream; Middle daily record stream is cut apart and obtained preliminary treatment daily record.
Further, the step that reads original log from cluster server comprises: with stream socket by row parallel original log that reads from cluster server; The parallel all daily record datas that read are kept to daily record to be concentrated.
Further, original log is merged and sort obtain in the middle of the step of daily record stream comprise: the daily record data sequence that daily record is concentrated obtains data sequence; Time daily record data the earliest in output data sequence; By the time that always source server reads, next daily record data of daily record data the earliest fills into data sequence; Return to the step of carrying out time daily record data the earliest in output data sequence, until daily record data is exported complete, daily record stream in the middle of obtaining; Wherein, using the server of the Data Source of time daily record data the earliest as carrying out source server.
Further, the step that next daily record data of the time that always source server reads daily record data the earliest is filled into data sequence also comprises: the daily record data in future source server read complete after, close the daily record stream of source server.
Further, middle daily record stream being cut apart to the step that obtains preliminary treatment daily record comprises: the user ID in the middle of obtaining in daily record stream; According to user ID, interim date will stream is cut apart and obtained preliminary treatment daily record.
To achieve these goals, according to a further aspect in the invention, provide a kind of pretreatment unit of web log, this device comprises: the first read module, for reading original log from cluster server; Ordering by merging module, for original log is merged and sequence obtain in the middle of daily record stream; Cut apart module, for middle daily record stream is cut apart and obtained preliminary treatment daily record.
Further, the first read module comprises: parallel read module, for walking abreast and read original log from cluster server by row with stream socket; Preserve module, for the parallel all daily record datas that read are kept to daily record, concentrate.
Further, ordering by merging module comprises: order module, for the concentrated daily record data sequence of daily record is obtained to data sequence; Output module, for exporting data sequence time daily record data the earliest; Complementary module, for filling into data sequence by next daily record data of the time that always source server reads daily record data the earliest; Return to Executive Module, for returning to the step of carrying out output data sequence time daily record data the earliest, until daily record data is exported complete, daily record stream in the middle of obtaining; Wherein, using the server of the Data Source of time daily record data the earliest as carrying out source server.
Further, complementary module comprises: closing module, for future source server daily record data read complete after, close the daily record stream of source server.
Further, cut apart module and comprise: acquisition module, for the user ID of daily record stream in the middle of obtaining; Cut apart submodule, for interim date will stream being cut apart and obtained preliminary treatment daily record according to user ID.
To achieve these goals, according to a further aspect in the invention, provide a kind of pretreatment system of web log, this system comprises: a plurality of cluster servers; Daily record preprocessing server, is connected with a plurality of cluster servers, for reading original log from cluster server, and original log is merged and sequence obtain in the middle of after daily record stream, middle daily record stream is cut apart and is obtained preliminary treatment daily record.
Further, daily record preprocessing server comprises: reading device, is connected with a plurality of cluster servers, for walking abreast and read original log from cluster server by row with stream socket.
Further, daily record preprocessing server comprises: processor, be connected with reading device, for the concentrated daily record data sequence of daily record is obtained to data sequence, and export time daily record data the earliest in data sequence, and then next daily record data of the time that always source server reads daily record data the earliest is filled into data sequence, return to the step of carrying out time daily record data the earliest in output data sequence, until daily record data is exported complete, daily record stream in the middle of obtaining; Wherein, using the server of the Data Source of time daily record data the earliest as carrying out source server.
Adopt the present invention, read original log from cluster server after, daily record merging and two processes that sort are combined, the middle daily record stream merging and sequence obtains is cut apart and obtained preliminary treatment daily record, finally preliminary treatment daily record is write to disk, save the generation of intermediate file, only needed reading and writing of single, improved the pretreated efficiency of daily record.Solved in prior art repeatedly the preliminary treatment length consuming time that read-write operation makes web log file, cause the speed of log processing slow, inefficient problem, realize single and read and write the preliminary treatment to daily record data, reduce the intermediate file of processing time and processing, thereby improved the treatment effeciency of daily record.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms the application's a part, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is according to the structural representation of the pretreatment system of the web log of the embodiment of the present invention;
Fig. 2 is according to the structure chart of a kind of optional pretreatment system applicable system of the embodiment of the present invention;
Fig. 3 is according to the structure chart of a kind of optional pretreatment system of the embodiment of the present invention;
Fig. 4 processes the block diagram of daily record data according to the pretreatment system of the web log of the embodiment of the present invention;
Fig. 5 is according to the flow chart of the preprocess method of the web log of the embodiment of the present invention; And
Fig. 6 is according to the structural representation of the pretreatment unit of the web log of the embodiment of the present invention.
Embodiment
It should be noted that, in the situation that not conflicting, embodiment and the feature in embodiment in the application can combine mutually.Describe below with reference to the accompanying drawings and in conjunction with the embodiments the present invention in detail.
Fig. 1 is according to the structural representation of the pretreatment system of the web log of the embodiment of the present invention.As shown in Figure 1, this system can comprise: a plurality of cluster servers 2 and a daily record preprocessing server 1.
Wherein, a plurality of cluster servers 2, for recording original log.
Daily record preprocessing server 1, is connected with a plurality of cluster servers 2, for reading original log from cluster server, and original log is merged and sequence obtain in the middle of after daily record stream, middle daily record stream is cut apart and is obtained preliminary treatment daily record.
Adopt said system, after daily record preprocessing server reads original log from cluster server, daily record merging and two processes that sort are combined, the middle daily record stream merging and sequence obtains is cut apart and obtained preliminary treatment daily record, saved the generation of intermediate file, only need reading and writing of single, improved the pretreated efficiency of daily record.Solved in prior art repeatedly the preliminary treatment length consuming time that read-write operation makes web log file, cause the speed of log processing slow, inefficient problem, realize single and read and write the preliminary treatment to daily record data, reduce the intermediate file of processing time and processing, thereby improved the treatment effeciency of daily record.
In this embodiment, original log can be the journal file being stored on disk; Middle daily record stream can adopt the form of expression of data, can in store one or more data in middle daily record stream; Preliminary treatment daily record is also the result to data processing, in this embodiment, after obtaining take the preliminary treatment daily record that data are the form of expression, it can be write to disk as file.Owing to having adopted reading and processing mode of data flow, by daily record preprocessing server can original log is merged and sequence obtain in the middle of in daily record stream, middle daily record stream is cut apart and obtained preliminary treatment daily record and write disk, whole process is synchronously carried out, intermediate data is not piled up, and does not need writing in files and file reading again.
Embodiment as shown in Figure 2 to Figure 3, user's (client) sends access request to load equalizer, load equalizer is distributed to the cluster server of rear end by access request, and (cluster server shown in figure is only done example explanation, its quantity is not limited to three shown in figure), cluster server receives access request, and records web log generation original log.
Then, daily record preprocessing server reads original log from each cluster server is parallel, then the original log from reading each cluster server is obtained after merging, sort, cutting apart to preliminary treatment daily record, then preliminary treatment daily record is write to disk.
According to the abovementioned embodiments of the present invention, daily record preprocessing server 1 can comprise: reading device, be connected with a plurality of cluster servers, for walking abreast and read original log from cluster server by row with stream socket, the parallel all daily record datas that read be kept to daily record and concentrate.
Particularly, as shown in Figure 3, daily record preprocessing server is walked abreast and reads the journal file in cluster server by row with stream socket by network, forms the daily record stream (being original log) of cluster server.Because daily record preprocessing server and cluster server are in same local area network (LAN), this reads is efficient and stable.
In this embodiment, adopted the mode of data flow by the parallel original log that reads cluster server of row, read each cluster server, from every cluster server, read a daily record data simultaneously, only need to take very little memory source, even if consider arithmetic speed, can read for example 10 daily record datas at every turn and carry out buffer memory, comparing with log file size of the prior art, is still also very little EMS memory occupation.Space complexity is constant rank (being buffer size), and the expansion with original log file does not increase, and has greatly reduced the data of storage, has reduced resource occupation, has improved treatment effeciency.
In the above embodiment of the present invention, daily record preprocessing server 1 can also comprise: processor, be connected with reading device, for the concentrated daily record data sequence of daily record is obtained to data sequence, and export time daily record data the earliest in data sequence, then next daily record data of the time that always source server reads daily record data the earliest is filled into data sequence, return to the step of carrying out time daily record data the earliest in output data sequence, until daily record data is exported complete, daily record stream in the middle of obtaining; Wherein, using the server of the Data Source of time daily record data the earliest as carrying out source server.
In this embodiment, when getting preliminary treatment daily record, this preliminary treatment daily record is write to disk file.
More specifically, the pretreatment system of web log can be connected with an analysis processor, and particularly, analysis processor is connected with daily record preprocessing server, for analyzing preliminary treatment daily record, obtains analysis result.
Particularly, the step of time original log the earliest in returning to execution output data sequence, until original log is exported complete, in the process of the step that in the middle of obtaining, daily record is flowed, preprocess method also comprises: after all data in having read source server, close the daily record stream of source server.
In the above-described embodiments, the original log reading forms daily record stream, and original log is merged, and every daily record data in original log is carried out to time-sequencing simultaneously and obtains data sequence.In this embodiment because the daily record on every cluster server all produces and records according to time sequencing, so during the log of every cluster server itself all according to time-sequencing.
In the system of preprocess method of using embodiment, suppose to have n platform cluster server, daily record preprocessing server can realize the preliminary treatment to daily record by the following method:
(1) from the journal file of every cluster server, read a daily record data (being a line text in original log file), obtain the original log that n bar daily record data forms, wherein, n is natural number.
(2) this n bar daily record data is carried out to time-sequencing, obtain data sequence.
(3) time daily record data the earliest in this data sequence is exported to middle daily record stream, suppose the time of the daily record data in i platform cluster server the earliest, in the middle of this daily record data being sent to, daily record stream, remains 1 daily record of n – in data sequence, becomes residue sequence.
(4) from i station server, read next daily record data iNext, this daily record data iNext is filled into the residue sequence that 1 daily record data of above-mentioned remaining n – forms, in residue sequence, become again the data sequence that has n bar daily record data, same sequence output time daily record data the earliest.
Particularly, because 1 daily record data of remaining n – in step (3) is sorted, so when the sequence of execution step (4), do not need n bar daily record data to re-start sequence.
More specifically, can use the algorithm of binary chop that next daily record data iNext is inserted in residue sequence.The algorithm that uses binary chop, time complexity is reduced to O (logn).
(5) circulation execution step (4), until read and export complete by the original log on described cluster server.
Particularly, in the process of execution step (5), if it is complete that the original log of certain cluster server reads, by reading the daily record stream of complete cluster server correspondence in order module, close and remove, original log on residue server is continued to repeated execution of steps (4), until that the original log of Servers-all reads is complete, the middle daily record stream of so final output realized and having been merged and two objects that sort.
Particularly, while reading the original log on every cluster server, form a daily record circulation road, the original log on certain cluster server read complete after, corresponding daily record circulation road is closed.
Wherein, the daily record stream in order module is closed and is removed and do not delete original journal file, and for security consideration, original log is all the backup of filing conventionally.In this embodiment, because the original log itself being recorded on cluster server is free sequence, daily record preprocessing server reads original log one by one from cluster server, can not destroy the time sequencing of the daily record data on every cluster server, not only saved and repeatedly repeated to read the time that repeats sequence, and utilized the original log of intrinsic time sequencing on cluster server, also saved the time of at every turn reading sequence, and daily record data merging and two processes that sort are combined into a process, time complexity is reduced to O (logn), improved widely computational efficiency.
In the above embodiment of the present invention, daily record preprocessing server 1 after the user ID in daily record stream, can be cut apart and obtain preliminary treatment daily record interim date will stream according to user ID in the middle of obtaining.
Particularly, middle daily record stream can be cut apart according to user ID, and the preliminary treatment daily record after cutting apart is write to a plurality of preliminary treatment destination files, described in each, in destination file, contain identical customer ID.This destination file can write into Databasce, as follow-up log statistic with analyze used.
In the above embodiment of the present invention, in whole preprocessing process, only need the file of single to read (reading original log file), and the file of single write (writing pretreated preliminary treatment journal file).Because the data volume of original log is very large, pretreated journal file quantity is a lot, reads and write to have occupied whole preliminary treatment most of the time, by what reduce file, reads and write indegree, has saved the pretreated time, has improved pretreated efficiency.Wherein, in this embodiment, original log and preliminary treatment daily record can be all files.
Fig. 5 is according to the flow chart of the preprocess method of the web log of the embodiment of the present invention, and the method can comprise the steps: as shown in Figure 5
Step S102 reads original log from cluster server.
Step S104, to original log merge and sequence obtain in the middle of daily record stream.
Step S106, cuts apart middle daily record stream to obtain preliminary treatment daily record.
Adopt the present invention, read original log from cluster server after, daily record merging and two processes that sort are combined, the middle daily record stream merging and sequence obtains is cut apart and obtained preliminary treatment daily record, finally preliminary treatment daily record is write to disk, save the generation of intermediate file, only needed reading and writing of single, improved the pretreated efficiency of daily record.Solved in prior art repeatedly the preliminary treatment length consuming time that read-write operation makes web log file, cause the speed of log processing slow, inefficient problem, realize single and read and write the preliminary treatment to daily record data, reduce the intermediate file of processing time and processing, thereby improved the treatment effeciency of daily record.
Embodiment as shown in Figure 2 to Figure 3, user sends access request by load equalizer, load equalizer is distributed to the cluster server of rear end by request, and (cluster server shown in figure is only done example explanation, its quantity is not limited to three shown in figure) on, cluster server receives access request, and records web log generation original log.
Then, daily record preprocessing server reads the original log under each cluster server, and the then preliminary treatment daily record after merging, sort, cutting apart by the original log output from reading each cluster server, then writes disk by preliminary treatment daily record.
In the above embodiment of the present invention, the step that reads original log from cluster server can comprise: with stream socket, by row, from cluster server, walk abreast and read original log; The parallel all daily record datas that read are kept to daily record to be concentrated.
Particularly, as shown in Figure 3, log server is walked abreast and reads the journal file in cluster server by row with stream socket by network, forms the daily record stream (being original log) of cluster server.Because daily record preprocessing server and cluster server are in same local area network (LAN), this reads is efficient and stable.
In this embodiment, original log can be from being stored in the log file " journal file " disk; Middle daily record stream can adopt the form of expression of data, can in store one or more data in middle daily record stream; Preliminary treatment daily record is also the result to data processing, in this embodiment, after obtaining take the preliminary treatment daily record that data are the form of expression, it can be write to disk as file.Owing to having adopted reading and processing mode of data flow, by daily record preprocessing server can original log is merged and sequence obtain in the middle of in daily record stream, middle daily record stream is cut apart and obtained preliminary treatment daily record and write disk, whole process is synchronously carried out, intermediate data is not piled up, and does not need writing in files and file reading again.
According to the abovementioned embodiments of the present invention, original log is merged and sort obtain in the middle of the step of daily record stream comprise: the daily record data sequence that daily record is concentrated obtains data sequence; Time daily record data the earliest in output data sequence; By the time that always source server reads, next daily record data of daily record data the earliest fills into data sequence; Return to the step of carrying out time daily record data the earliest in output data sequence, until daily record data is exported complete, daily record stream in the middle of obtaining; Wherein, using the server of the Data Source of time daily record data the earliest as carrying out source server.
Particularly, the step that next daily record data of the time that always source server reads daily record data the earliest is filled into data sequence can comprise: the daily record data in future source server read complete after, close the daily record stream of source server.
In the above-described embodiments, all original log that read form daily record stream, and all original log are merged, and every daily record data in daily record stream are carried out to time-sequencing simultaneously and obtain data sequence.In this embodiment because the original log on every cluster server all produces and records according to time sequencing, so during the log of every cluster server itself all according to time-sequencing.
In the system of preprocess method of using embodiment, suppose to have n platform cluster server, realize by the following method above-described embodiment:
(1) from the journal file of every cluster server, read a daily record data (being a line text in original log file), obtain the original log that n bar daily record data forms, wherein, n is natural number.
(2) this n bar daily record data is carried out to time-sequencing, obtain data sequence.
(3) time daily record data the earliest in this data sequence is exported to middle daily record stream, suppose the time of the daily record data in i platform cluster server the earliest, in the middle of this daily record data being sent to, daily record stream, remains 1 daily record of n – in data sequence, becomes residue sequence.
(4) from i station server, read next daily record data iNext, this daily record data iNext is filled into the residue sequence that 1 daily record data of above-mentioned remaining n – forms, in residue sequence, become again the data sequence that has n bar daily record data, same sequence output time daily record data the earliest.
Particularly, because 1 daily record data of remaining n – in step (3) is sorted, so when the sequence of execution step (4), do not need n bar daily record data to re-start sequence.
More specifically, can use the algorithm of binary chop that next daily record data iNext is inserted in residue sequence.The algorithm that uses binary chop, time complexity is reduced to O (logn).
(5) circulation execution step (4), until read and export complete by the original log on described cluster server.
Particularly, in the process of execution step (5), if it is complete that the original log of certain cluster server reads, by reading the daily record stream of complete cluster server correspondence in order module, close and remove, original log on residue server is continued to repeated execution of steps (4), until that the original log of Servers-all reads is complete, the middle daily record stream of so final output realized and having been merged and two objects that sort.
Wherein, the daily record stream in order module is closed and is removed and do not delete original journal file, and for security consideration, original log is all the backup of filing conventionally.In this embodiment, because the original log itself being recorded on cluster server is free sequence, daily record preprocessing server reads original log one by one from cluster server, can not destroy the time sequencing of the daily record data on every cluster server, not only saved and repeatedly repeated to read the time that repeats sequence, and utilized the original log of intrinsic time sequencing on cluster server, also saved the time of at every turn reading sequence, and daily record data merging and two processes that sort are combined into a process, time complexity is reduced to O (logn), improved widely computational efficiency.
In the above embodiment of the present invention, middle daily record stream is cut apart to the step that obtains preliminary treatment daily record can be comprised: the user ID in the middle of obtaining in daily record stream; According to user ID, interim date will stream is cut apart and obtained preliminary treatment daily record.
Particularly, middle daily record stream can be cut apart according to user ID, and the preliminary treatment daily record after cutting apart is write to a plurality of preliminary treatment destination files, described in each, in destination file, contain identical customer ID.This destination file can write into Databasce, as follow-up log statistic with analyze used.
In the above embodiment of the present invention, in whole preprocessing process, only need the file of single to read (reading original log), and the file of single write (writing pretreated preliminary treatment daily record).Because the data volume of original log is very large, pretreated target journaling quantity of documents is a lot, reads and write to have occupied whole preliminary treatment most of the time, by what reduce file, reads and write indegree, save the pretreated time, improved pretreated efficiency.
It should be noted that, in the step shown in the flow chart of accompanying drawing, can in the computer system such as one group of computer executable instructions, carry out, and, although there is shown logical order in flow process, but in some cases, can carry out shown or described step with the order being different from herein.
Fig. 6 is according to the structural representation of the pretreatment unit of the web log of the embodiment of the present invention.As shown in Figure 6, this device can comprise: the first read module 10, for reading original log from cluster server; Ordering by merging module 30, for original log is merged and sequence obtain in the middle of daily record stream; Cut apart module 50, for middle daily record stream is cut apart and obtained preliminary treatment daily record.
Adopt the present invention, read original log from cluster server after, daily record merging and two processes that sort are combined, the middle daily record stream merging and sequence obtains is cut apart and obtained preliminary treatment daily record, finally preliminary treatment daily record is write to disk, save the generation of intermediate file, only needed reading and writing of single, improved the pretreated efficiency of daily record.Solved in prior art repeatedly the preliminary treatment length consuming time that read-write operation makes web log file, cause the speed of log processing slow, inefficient problem, realize single and read and write the preliminary treatment to daily record data, reduce the intermediate file of processing time and processing, thereby improved the treatment effeciency of daily record.
Embodiment as shown in Figure 2 to Figure 3, device in above-described embodiment can be arranged in daily record preprocessing server, user's (being client) sends access request by load equalizer, load equalizer is distributed to the cluster server of rear end by request, and (cluster server shown in figure is only done example explanation, its quantity is not limited to three shown in figure) on, cluster server receives access request, and records web log generation original log.
Then, the first read module of daily record preprocessing server reads the original log under each cluster server, then the preliminary treatment daily record after merging, sort, cutting apart by the original log output from reading each cluster server, then writes disk by preliminary treatment daily record.
According to the abovementioned embodiments of the present invention, the first read module 10 can comprise: parallel read module, for walking abreast and read original log from cluster server by row with stream socket; Preserve module, for the parallel all daily record datas that read are kept to daily record, concentrate.
In the above embodiment of the present invention, ordering by merging module 30 can comprise: order module, for the concentrated daily record data sequence of daily record is obtained to data sequence; Output module, for exporting data sequence time daily record data the earliest; Complementary module, for filling into data sequence by next daily record data of the time that always source server reads daily record data the earliest; Return to Executive Module, for returning to the step of carrying out output data sequence time daily record data the earliest, until daily record data is exported complete, daily record stream in the middle of obtaining; Wherein, using the server of the Data Source of time daily record data the earliest as carrying out source server.
Particularly, complementary module can comprise: closing module, for future source server daily record data read complete after, close the daily record stream of source server.
In the above-described embodiments, all original log that read form daily record stream, and daily record data is merged, and the daily record data in daily record stream are carried out to time-sequencing simultaneously and obtain data sequence.In this embodiment because the daily record on every cluster server all produces and records according to time sequencing, so during the log of every cluster server itself all according to time-sequencing.
In the system of preprocess method of using embodiment, suppose to have n platform cluster server, realize by the following method above-described embodiment:
(1) from the journal file of every cluster server, read a daily record data (being a line text in original log file), obtain the original log that n bar daily record data forms, wherein, n is natural number.
(2) this n bar daily record data is carried out to time-sequencing, obtain data sequence.
(3) time daily record data the earliest in this data sequence is exported to middle daily record stream, suppose the time of the daily record data in i platform cluster server the earliest, in the middle of this daily record data being sent to, daily record stream, remains 1 daily record of n – in data sequence, becomes residue sequence.
(4) from i station server, read next daily record data iNext, this daily record data iNext is filled into the residue sequence that 1 daily record data of above-mentioned remaining n – forms, in residue sequence, become again the data sequence that has n bar daily record data, same sequence output time daily record data the earliest.
Particularly, because 1 daily record data of remaining n – in step (3) is sorted, so when the sequence of execution step (4), do not need n bar daily record data to re-start sequence.
More specifically, can use the algorithm of binary chop that next daily record data iNext is inserted in residue sequence.The algorithm that uses binary chop, time complexity is reduced to O (logn).
(5) circulation execution step (4), until read and export complete by the original log on described cluster server.
Particularly, in the process of execution step (5), if it is complete that the original log of certain cluster server reads, by reading the daily record stream of complete cluster server correspondence in order module, close and remove, original log on residue server is continued to repeated execution of steps (4), until that the original log of Servers-all reads is complete, the middle daily record stream of so final output realized and having been merged and two objects that sort.
Wherein, the daily record stream in order module is closed and is removed and do not delete original journal file, and for security consideration, original log is all the backup of filing conventionally.In this embodiment, because the original log itself being recorded on cluster server is free sequence, daily record preprocessing server reads original log one by one from cluster server, can not destroy the time sequencing of the daily record data on every cluster server, not only saved and repeatedly repeated to read the time that repeats sequence, and utilized the original log of intrinsic time sequencing on cluster server, also saved the time of at every turn reading sequence, and daily record data merging and two processes that sort are combined into a process, time complexity is reduced to O (logn), improved widely computational efficiency.
In the above embodiment of the present invention, cutting apart module 50 can comprise: acquisition module, for the user ID of daily record stream in the middle of obtaining; Cut apart submodule, for interim date will stream being cut apart and obtained preliminary treatment daily record according to user ID.
Above-mentioned closing module, acquisition module and cut apart submodule respectively the correspondence in corresponding said method embodiment delete original log and obtain the method that preliminary treatment daily record realizes, example and application scenarios that above-mentioned three modules realize with corresponding step are identical, but are not limited to the disclosed content of said method embodiment.Above-mentioned closing module, acquisition module and cut apart submodule and operate in terminal, can realize by software or hardware.
Adopt the above embodiment of the present invention, preferably can from cluster server, read a daily record data, daily record stream in the middle of this data ordering by merging is obtained (only having data in the middle daily record stream in this embodiment), then this Data Segmentation is write to disk, in this processing procedure, there is no the generation of " intermediate file ".
As can be seen from the above description, the present invention has realized following technique effect: daily record merging and two processes that sort are combined into a process, and time complexity is reduced to O (logn), has improved computational efficiency; Adopted the mode of data flow to read by row, space complexity is constant rank, has reduced resource occupation; First merge, sort and cut apart again, between each server, use data flow transmission, there is no the generation of intermediate file, saved the time cost reading with writing in files.
Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in storage device and be carried out by calculation element, or they are made into respectively to each integrated circuit modules, or a plurality of modules in them or step are made into single integrated circuit module to be realized.Like this, the present invention is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (13)

1. a preprocess method for web log, is characterized in that, comprising:
From cluster server, read original log;
To described original log merge and sequence obtain in the middle of daily record stream;
Daily record stream in the middle of described is cut apart and obtained preliminary treatment daily record.
2. preprocess method according to claim 1, is characterized in that, the step that reads original log from cluster server comprises:
With stream socket, by row, from described cluster server, walk abreast and read described original log;
The parallel all daily record datas that read are kept to daily record to be concentrated.
3. preprocess method according to claim 2, is characterized in that, to described original log merge and sort obtain in the middle of the step of daily record stream comprise:
The described daily record data sequence that described daily record is concentrated obtains data sequence;
Export time described daily record data the earliest in described data sequence;
By the described time that always source server reads, next daily record data of described daily record data the earliest fills into described data sequence;
Return to the step of carrying out time described daily record data the earliest in the described data sequence of output, until described daily record data output is complete, obtain described middle daily record stream;
Wherein, the server of the Data Source of described daily record data the earliest of described time is carried out to source server as described.
4. preprocess method according to claim 3, is characterized in that, the step that next daily record data of described daily record data the earliest of the described time that always source server reads is filled into described data sequence also comprises:
By described come described daily record data in source server read complete after, close the described daily record stream that carrys out source server.
5. according to the preprocess method described in any one in claim 1 to 4, it is characterized in that, daily record stream in the middle of described cut apart to the step that obtains preliminary treatment daily record and comprise:
Obtain the user ID in described middle daily record stream;
According to described user ID, daily record stream in the middle of described is cut apart and obtained described preliminary treatment daily record.
6. a pretreatment unit for web log, is characterized in that, comprising:
The first read module, for reading original log from cluster server;
Ordering by merging module, for described original log is merged and sequence obtain in the middle of daily record stream;
Cut apart module, for daily record stream in the middle of described is cut apart and obtained preliminary treatment daily record.
7. pretreatment unit according to claim 6, is characterized in that, described the first read module comprises:
Parallel read module, for walking abreast and read described original log from described cluster server by row with stream socket;
Preserve module, for the parallel all daily record datas that read are kept to daily record, concentrate.
8. pretreatment unit according to claim 7, is characterized in that, described ordering by merging module comprises:
Order module, obtains data sequence for the described daily record data sequence that described daily record is concentrated;
Output module, for exporting described data sequence time described daily record data the earliest;
Complementary module, for filling into described data sequence by next daily record data of described daily record data the earliest of the described time that always source server reads;
Return to Executive Module, for returning to the step of carrying out output described data sequence time described daily record data the earliest, until described daily record data output is complete, obtain described middle daily record stream;
Wherein, the server of the Data Source of described daily record data the earliest of described time is carried out to source server as described.
9. pretreatment unit according to claim 8, is characterized in that, described complementary module comprises:
Closing module, for the described described daily record data that carrys out source server is read complete after, close the described daily record stream that carrys out source server.
10. according to the pretreatment unit described in any one in claim 6 to 9, it is characterized in that, described in cut apart module and comprise:
Acquisition module, for obtaining the user ID of described middle daily record stream;
Cut apart submodule, for daily record stream in the middle of described being cut apart and obtained described preliminary treatment daily record according to described user ID.
The pretreatment system of 11. 1 kinds of web logs, is characterized in that, comprising:
A plurality of cluster servers;
Daily record preprocessing server, be connected with described a plurality of cluster servers, for reading original log from described cluster server, and described original log is merged and sequence obtain in the middle of after daily record stream, daily record stream in the middle of described is cut apart and is obtained preliminary treatment daily record.
12. pretreatment systems according to claim 11, is characterized in that, described daily record preprocessing server comprises:
Reading device, is connected with described a plurality of cluster servers, for walking abreast and read described original log from described cluster server by row with stream socket, the parallel all daily record datas that read is kept to daily record and concentrates.
13. pretreatment systems according to claim 12, is characterized in that, described daily record preprocessing server comprises:
Processor, be connected with described reading device, for the described daily record data sequence that described daily record is concentrated, obtain data sequence, and export time described daily record data the earliest in described data sequence, then next daily record data of described daily record data the earliest of the described time that always source server reads is filled into described data sequence, return to the step of carrying out time described daily record data the earliest in the described data sequence of output, until described daily record data output is complete, obtain described middle daily record stream;
Wherein, the server of the Data Source of described daily record data the earliest of described time is carried out to source server as described.
CN201310591082.7A 2013-11-20 2013-11-20 Preprocess method, the apparatus and system of web log Active CN103595571B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310591082.7A CN103595571B (en) 2013-11-20 2013-11-20 Preprocess method, the apparatus and system of web log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310591082.7A CN103595571B (en) 2013-11-20 2013-11-20 Preprocess method, the apparatus and system of web log

Publications (2)

Publication Number Publication Date
CN103595571A true CN103595571A (en) 2014-02-19
CN103595571B CN103595571B (en) 2018-02-02

Family

ID=50085562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310591082.7A Active CN103595571B (en) 2013-11-20 2013-11-20 Preprocess method, the apparatus and system of web log

Country Status (1)

Country Link
CN (1) CN103595571B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391954A (en) * 2014-11-27 2015-03-04 北京国双科技有限公司 Database log processing method and device
CN106230618A (en) * 2016-07-21 2016-12-14 柳州龙辉科技有限公司 A kind of system journal centralized processing system
CN107480277A (en) * 2017-08-22 2017-12-15 北京京东尚科信息技术有限公司 Method and device for web log file collection
CN107729375A (en) * 2017-09-13 2018-02-23 微梦创科网络科技(中国)有限公司 A kind of method and device of daily record data sequence
CN108228797A (en) * 2017-12-29 2018-06-29 上海全成通信技术有限公司 A kind of high efficiency, low cost processing method of massive logs data
CN108363649A (en) * 2017-12-29 2018-08-03 微梦创科网络科技(中国)有限公司 A kind of method and device of distribution statistical log visit capacity
CN109255069A (en) * 2018-07-31 2019-01-22 阿里巴巴集团控股有限公司 A kind of discrete text content risks recognition methods and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5084815A (en) * 1986-05-14 1992-01-28 At&T Bell Laboratories Sorting and merging of files in a multiprocessor
CN101192227A (en) * 2006-11-30 2008-06-04 阿里巴巴公司 Log file analytical method and system based on distributed type computing network
CN102831181A (en) * 2012-07-31 2012-12-19 北京光泽时代通信技术有限公司 Directory refreshing method for cache files and caching proxy server for implementing directory refreshing method
CN102830950A (en) * 2012-08-03 2012-12-19 苏州迈科网络安全技术股份有限公司 Method and system for sorting monitoring data
CN102968496A (en) * 2012-12-04 2013-03-13 天津神舟通用数据技术有限公司 Parallel sequencing method based on task derivation and double buffering mechanism
CN103178982A (en) * 2011-12-23 2013-06-26 阿里巴巴集团控股有限公司 Method and device for analyzing log

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5084815A (en) * 1986-05-14 1992-01-28 At&T Bell Laboratories Sorting and merging of files in a multiprocessor
CN101192227A (en) * 2006-11-30 2008-06-04 阿里巴巴公司 Log file analytical method and system based on distributed type computing network
CN103178982A (en) * 2011-12-23 2013-06-26 阿里巴巴集团控股有限公司 Method and device for analyzing log
CN102831181A (en) * 2012-07-31 2012-12-19 北京光泽时代通信技术有限公司 Directory refreshing method for cache files and caching proxy server for implementing directory refreshing method
CN102830950A (en) * 2012-08-03 2012-12-19 苏州迈科网络安全技术股份有限公司 Method and system for sorting monitoring data
CN102968496A (en) * 2012-12-04 2013-03-13 天津神舟通用数据技术有限公司 Parallel sequencing method based on task derivation and double buffering mechanism

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391954A (en) * 2014-11-27 2015-03-04 北京国双科技有限公司 Database log processing method and device
CN104391954B (en) * 2014-11-27 2019-04-09 北京国双科技有限公司 The processing method and processing device of database journal
CN106230618A (en) * 2016-07-21 2016-12-14 柳州龙辉科技有限公司 A kind of system journal centralized processing system
CN107480277A (en) * 2017-08-22 2017-12-15 北京京东尚科信息技术有限公司 Method and device for web log file collection
CN107729375A (en) * 2017-09-13 2018-02-23 微梦创科网络科技(中国)有限公司 A kind of method and device of daily record data sequence
CN107729375B (en) * 2017-09-13 2021-11-23 微梦创科网络科技(中国)有限公司 Log data sorting method and device
CN108228797A (en) * 2017-12-29 2018-06-29 上海全成通信技术有限公司 A kind of high efficiency, low cost processing method of massive logs data
CN108363649A (en) * 2017-12-29 2018-08-03 微梦创科网络科技(中国)有限公司 A kind of method and device of distribution statistical log visit capacity
CN109255069A (en) * 2018-07-31 2019-01-22 阿里巴巴集团控股有限公司 A kind of discrete text content risks recognition methods and system

Also Published As

Publication number Publication date
CN103595571B (en) 2018-02-02

Similar Documents

Publication Publication Date Title
CN103595571A (en) Preprocessing method, device and system for website access logs
US11823072B2 (en) Customer behavior predictive modeling
CN107123047B (en) Data acquisition system based on bond transaction and data acquisition method thereof
CN104391954A (en) Database log processing method and device
CN111339073A (en) Real-time data processing method and device, electronic equipment and readable storage medium
CN108874558A (en) News subscribing method, electronic device and the readable storage medium storing program for executing of distributed transaction
US8019765B2 (en) Identifying files associated with a workflow
CN103207882A (en) Shop visiting data processing method and system
KR101679050B1 (en) Personalized log analysis system using rule based log data grouping and method thereof
CN104408190A (en) Spark based data processing method and device
US9380126B2 (en) Data collection and distribution management
CN113010542B (en) Service data processing method, device, computer equipment and storage medium
CN106919566A (en) A kind of query statistic method and system based on mass data
CN106570005A (en) Database cleaning method and device
CN111143461A (en) Mapping relation processing system and method and electronic equipment
CN114297236A (en) Data blood relationship analysis method, terminal equipment and storage medium
US11588728B2 (en) Tree structure-based smart inter-computing routing model
Yang et al. Fast and scalable vector similarity joins with MapReduce
US11954424B2 (en) Automatic domain annotation of structured data
CN109376191A (en) Financial report data processing method, device, computer equipment and storage medium
CN112764908B (en) Network data acquisition processing method and device and electronic equipment
CN111782610B (en) Log processing method, device, server, system network and storage medium
CN109815270B (en) Relation calculation method and device, computer storage medium and terminal
CN106649354B (en) Webpage crawling request processing method and device
CN118277267A (en) Method and device for optimally configuring dual-regularization engine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Preprocessing method, device and system for website access logs

Effective date of registration: 20190531

Granted publication date: 20180202

Pledgee: Shenzhen Black Horse World Investment Consulting Co.,Ltd.

Pledgor: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Registration number: 2019990000503

CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Patentee after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Patentee before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

PP01 Preservation of patent right

Effective date of registration: 20240604

Granted publication date: 20180202