CN108737549A

CN108737549A - A kind of log analysis method and device of big data quantity

Info

Publication number: CN108737549A
Application number: CN201810513830.2A
Authority: CN
Inventors: 娈锋旦; 殷浩
Original assignee: Jiangsu Lianmeng Information Engineering Co Ltd
Current assignee: Jiangsu Lianmeng Information Engineering Co Ltd
Priority date: 2018-05-25
Filing date: 2018-05-25
Publication date: 2018-11-02

Abstract

The invention discloses a kind of log analysis methods of big data quantity, and this approach includes the following steps：Log collection module collection procedure running log within the predetermined time, and the program running log being collected into is compressed；The compressed program running log of log collection module is uploaded in distributed file system and is stored by daily record release module；Log analysis module is sliced program running log, constitutes multiple slice tasks, and analyze the corresponding journal file of each slice task；Data preparation module presses request interface path to the corresponding journal file of each slice task after analysis, and data, which are sorted out statistical result, imported into database.The present invention by daily record by being stored in distributed file system, solve each dispersion daily record leaves management concentratedly, and daily record is analyzed and sorted out in large-scale data processing system after being sliced to daily record, both facilitate data analysis, hardware resource is saved again to use, and ensures not losing for useful information.

Description

A kind of log analysis method and device of big data quantity

Technical field

The present invention relates to a kind of log analysis method and devices, and in particular to a kind of log analysis method of big data quantity and Device.

Background technology

Big data recent years develop rapidly, with increasingly complicated, information security the requirement day of hoc network environment It is increasingly acute, original record daily record of internet site during operation, in project development, a large amount of fortune of operation program generation Row daily record, these daily records are to solving the problems, such as that website or program development play a crucial role, but often these daily records Can be more dispersed, check that analysis is very inconvenient, and daily record increases daily, volume growth is very fast, and total volume accumulates over a long period non- Chang great occupies excessive hardware resource, and problem discovery has lag, cannot receive warning message notice at the first time.

Invention content

Goal of the invention：For overcome the deficiencies in the prior art, the present invention provides a kind of log analysis method of big data quantity And device, it solves the problems, such as daily record dispersion, analysis difficulty and occupies excessive hardware resource.

Technical solution：On the one hand, the present invention provides the log analysis method for big data quantity, this method includes following Step：

(1) log collection module collection procedure running log, and the operation day of the program to being collected within the predetermined time Will is compressed；

(2) the compressed program running log of the log collection module is uploaded to distributed document by daily record release module It is stored in system；

(3) log analysis module is sliced described program running log, constitutes multiple slice tasks, and cut to each The corresponding journal file of piece task is analyzed；

(4) data preparation module presses request interface path to the corresponding journal file of each slice task after analysis, Data sort out statistical result and imported into database.

Preferably, the scheduled time is daily zero or so in the step (1).

Preferably, the method compressed to the program running log being collected into the step (1) is referred to by shell It enables control be packaged synchronous with data, is packaged the tar programs for executing and using, transmits the rsync synchronous transfers used.

Preferably, the length of a slice task is 30,000-5 ten thousand in described program running log in the step (3) Row.

Preferably, log analysis includes in the step (3)：

At the beginning of extracting this log analysis and the end time, the duration of this log analysis is calculated；API connects Mouth frequency of use analysis；Total frequency of abnormity statistics, the statistics of each abnormal frequency；It counts each api interface and executes time, meter Each api interface execution is calculated to take；It counts each api interface and uploads downloading data amount, calculate each api interface consumption bandwidth situation；System File Upload and Download data volume is counted, calculation document, which uploads, downloads consumption bandwidth situation；Statistics IP address source, IP request number of times, IP total quantitys.

On the other hand, the present invention also provides the analytical equipment for big data quantity daily record, described device includes：Daily record is received Collect module, distributed file system, daily record release module, data preparation module and database；

The log collection module is used for collection procedure running log, and the daily record to being collected within the predetermined time It is compressed；

The distributed file system carries out quality point for receiving the program running log uploaded after server decompression Analysis；

The daily record release module, for the compressed program running log of the log collection module to be uploaded to distribution It is stored in formula file system；

The log analysis module constitutes multiple slice tasks, and right for being sliced to described program running log The corresponding journal file of each slice task is analyzed；

The data preparation module presses request interface path to the corresponding journal file of each slice task after analysis, Data sort out statistics, generate data report；

The database, the analysis result for receiving the data preparation module statistics.

Preferably, the scheduled time is daily zero or so.

Preferably, the method that the described pair of program running log being collected into is compressed is beaten by shell instruction controls Packet is synchronous with data, is packaged the tar programs for executing and using, transmits the rsync synchronous transfers used.

Preferably, the length of one slice task is 30,000-5 ten thousand rows in described program running log.

Preferably, the analysis of the journal file includes：

Advantageous effect：The present invention solves each dispersion day by the way that program running log is stored in distributed file system Will leaves management concentratedly, and daily record is analyzed and returned in large-scale data processing system after being sliced to daily record Class maximizes extraction and preserves valid data information, not only facilitates data analysis, but also saves hardware resource and use, and ensures useful information Do not lose.

Description of the drawings

Fig. 1 is the log analysis method flow chart of the big data quantity described in one embodiment of the invention；

Fig. 2 is the structure chart of the log analysis device of the big data quantity described in one embodiment of the invention；

Fig. 3 is the structural schematic diagram of the log analysis device described in further embodiment of this invention.

Specific implementation mode

Embodiment 1

Explanation of technical terms in the present invention：

Distributed file system：HDFS (Hadoop Distributed File System) is the core of Hadoop projects Sub-project is the basis of data storage management in Distributed Calculation, is based on flow data mode access and processing super large file Demand and develop, can run on cheap commercial server.High fault-tolerant, high reliability, Highly Scalable possessed by it Property, high acquired, high-throughput etc. provide the storage for not being afraid of failure characterized by mass data, for super large data set (Large Data Set) using processing bring many facilities.

Spark：Apache Spark are one around speed, the big data processing block of ease for use and complicated analysis structure Frame is initially developed in the AMPLab by University of California Berkeley in 2009, and the item of increasing income for becoming Apache in 2010 One of mesh, compared with other big datas such as Hadoop and Storm and MapReduce technologies, Spark has following advantage：

Spark provides a comprehensive, unified frame various has heterogeneity (text data, chart for managing Data etc.) data set and data source (batch data or real-time flow data) big data processing demand

Official's data, which introduces Spark, to promote 100 times by the speed of service of the application in Hadoop clusters in memory, The speed of service that can be even applied on disk promotes 10 times

Spark itself is not provided with distributed file system, therefore the analysis of spark is mostly dependent on point of Hadoop Cloth file system HDFS.

REST：REST full name are Representational State Transfer, and Chinese means that declarative state turns It moves.It is first appeared in the doctoral thesis of Roy Fielding in 2000, and Roy Fielding are the main volumes of HTTP specifications One of writer.REST refers to one group of framework constraints and principle.If " framework meet the constraints and original of REST Then, we just it is referred to as RESTful frameworks.REST itself does not create new technology, component or service, and is hidden in The theory of the behinds RESTful is exactly the existing feature and ability using Web, some better used in existing Web standards are accurate Then and constrain.Although REST originally experiences, the influence of Web technologies is very deep, and theoretically REST frameworks style is not to be bundled in On HTTP, only current HTTP be uniquely with the relevant examples of REST.So our REST described herein are also to pass through The REST that HTTP is realized

RESTful frameworks follow unified interface principle, and unified interface contains one group of limited predefined operation, no matter Which type of resource is all the access that resource is carried out by using identical interface.Interface should use the HTTP method of standard Such as GET, PUT and POST, and follow the semanteme of these methods.

MySQL：MySQL is the relational database management system of an open source code.Former developer is the MySQL of Sweden AB companies are to enter the visual field of administrator in MySQL3.23 in 2001 and be widely applied later earliest.2008 The version MySQL5.1 after first purchase is purchased and issued in MySQL companies by Sun Microsystems, which introduces subregion, is based on Row replicates and plugin API.Original BerkeyDB engines are removed, meanwhile, Oracle purchases InnoDB Oy are issued InnoDB plugin, this is developed into famous InnoDB engines.Oracle in 2010 purchases Sun Microsystems, this but also MySQL is included under Oracle, and Oracle has issued the later first version 5.5 of purchase later, which mainly improves concentration In performance, autgmentability, duplication, subregion and to the support of windows.

RSYNC：Rsync can be achieved on the tool of incremental backup.Coordinate task scheduling, rsync can realize timing or Every synchronization, coordinates inotify or sersync, the real-time synchronization of trigger-type may be implemented.

Telecopy (rsync do not support remotely to arrive long-range copy, but scp is supported), the cp of scp may be implemented in rsync Local copy, rm is deleted and " ls-l " shows the functions such as listed files.But should be noted that rsync final purpose or Person says that its original purpose is to realize the file synchronization of two end main frames, therefore the functions such as scp/cp/rm realized are only merely synchronous Supplementary means, and rsync realizes the mode of these functions and these orders are different.

The present invention by the way that program running log is stored in distributed file system, deposit by the concentration for solving each dispersion daily record Management is put, and daily record is analyzed and sorted out in large-scale data processing system after being sliced to daily record, maximization carries It goes bail for and deposits valid data information, not only facilitate data analysis, but also save hardware resource and use, ensure not losing for useful information.

As shown in Figure 1, for the log analysis method of big data quantity, this approach includes the following steps：

The collection procedure running log, and the operation day of the program to being collected within the predetermined time of S01 log collections module 1 Will is compressed；

The preferred predetermined time is daily zero or so, and compress mode is to instruct control to be packaged sum number by shell first According to synchronization, the tar programs for executing and using are packaged, the rsync synchronous transfers used are transmitted.This method is applicable not only to from each clothes The synchronous program running log of business device extraction, when there are the addressable internet site of user, client is sent out to internet site Access request, internet site is sent to access database according to access request, obtain corresponding information, and information is returned to client Access request is preserved successful access daily record, has a client to measure the access log of prodigious internet site by end, internet site It is also required to carry out OA operation analysis, method of the present invention is still applicable in.

The compressed program running log of log collection module is uploaded to distributed field system by S02 daily records release module 3 It is stored in system HDFS2, the daily record being stored in HDFS2 can be used for the analysis of subsequent project quality or web log And supervision；

S03 log analysis module 4 is sliced described program running log, constitutes multiple slice tasks, and to each The corresponding journal file of slice task is analyzed；For the ease of analysis, the length of a slice task is to be set as program fortune 30,000-5 ten thousand rows in row daily record, that is, form M task slice, and log analysis module 4 is divided as unit of a slice task Analysis and statistics.

The content of log analysis generally comprises：At the beginning of extracting this log analysis and the end time, this is calculated The duration of daily record；The total call numbers of API are counted, the call number of each API is analyzed for API frequency of use；Statistics is total Frequency of abnormity, each abnormal frequency, for executing anomaly analysis；It counts each API and executes the time, calculate each API and execute consumption When；It counts each API and uploads downloading data amount, calculate each API consumption bandwidth situation；Statistics file uploads downloading data amount, calculates File Upload and Download consumes bandwidth situation；Statistics IP address source, IP request number of times, IP total quantitys, for analyzing userbase, User sources, malicious attack etc..

S04 data preparations module 5 presses request interface path to the corresponding journal file of each slice task after analysis, Data sort out statistical result and imported into database 6.Data report is generated, the user having permission can pass through the browser access number It was reported that and then checking that program operating condition or website traffic-operating period, preferred database 6 are MySQL in detail.

Embodiment 2

As shown in Fig. 2, the present invention also provides the analytical equipment for big data quantity daily record, described device includes：Daily record Collection module 1, distributed file system 2, daily record release module 3, data preparation module 4 and database 5.

Log collection module 1, for collection procedure running log within the predetermined time, and the daily record to being collected into carries out Compression；

Distributed file system 2 carries out quality analysis for receiving the program running log uploaded after server decompression； The daily record being stored in HDFS can be used for the analysis and supervision of subsequent project quality or web log；

Daily record release module 3, for 1 compressed program running log of the log collection module to be uploaded to distribution It is stored in file system 2；

Log analysis module 4 constitutes multiple slice tasks, and to each slice task pair for being sliced to daily record The journal file answered is analyzed；

Data preparation module 5 presses request interface path to the corresponding journal file of each slice task after analysis, number It is counted according to sorting out, generates data report；

For the ease of analysis, the length of a slice task is 30,000-5 ten thousand rows being set as in program running log, i.e. shape It is sliced at M task, log analysis module 4 is analyzed and counted as unit of a slice task.

Database 6, the analysis result counted for receiving the data preparation module 5.

Data report is generated, the user having permission can be by the browser access data report, and then checks journey in detail Sort run situation or website traffic-operating period, preferred database are MySQL.

Embodiment 3

As shown in figure 3, further embodiment of this invention provides the detailed process of this method under big data processing environment It is used with extension, includes first multiple servers, be denoted as server 1, server 2 ..., server N, these servers are used for Extraction and synchrodata source, i.e. program running log or access log, in zero or so, server utilization rate is relatively low at this time, fortune Scanning frequency degree is fast.

Daily record is pushed to by the SHELL scripts of RSYNC modules by distributed document after server sync completion data source It is stored in system HDFS.The Log Shipping of storage to Spark, is carried out daily record slice, day by distributed file system in Spark Will is analyzed and daily record arranges, and for the ease of analysis, the length of a slice task is 30,000-5 be set as in program running log Wan Hang, that is, form M task slice, and Spark is analyzed and counted as unit of a slice task.

Spark presses request interface path to the corresponding journal file of each slice task after analysis, and data are sorted out and are united Meter generates data report.And store the result into MySQL database, by calling restful server interfaces, mobile terminal Or the multimodes such as ends PC log in have permission client and can open browser and check data report.The multimode of support includes mainly： Web PC, Android, IOS, chat software, third party's short message interface and third party's calling etc..The journaling being stored in HDFS, In the SHELL script synchronizations to web server that these journalings pass through RSYNC modules, while the file distribution of generation being arrived In application server, viewing daily record situation can be opened by browser, it can be by API Calls, as long as having permission It can see relevant journaling.

Method according to the present invention analyzes data can carry out safe shared, convenient and other industry by RESTFUL Business system is docked well；Daily record delta compression based on RSYNC synchronizes, and minimizes and reduces bandwidth use, improves efficiency of transmission； It supports mobile terminal multimode to notify immediately, improves issue handling efficiency, monitoring is facilitated to use.

Claims

1. a kind of log analysis method of big data quantity, which is characterized in that this approach includes the following steps：

(1) log collection module (1) collection procedure running log, and the program running log to being collected within the predetermined time It is compressed；

(2) the compressed program running log of the log collection module is uploaded to distributed document by daily record release module (3) System is stored in (2)；

(3) log analysis module (4) is sliced described program running log, constitutes multiple slice tasks, and cut to each The corresponding journal file of piece task is analyzed；

(4) data preparation module (5) presses request interface path to the corresponding journal file of each slice task after analysis, number It imported into database (6) according to statistical result is sorted out.

2. the log analysis method of big data quantity according to claim 1, which is characterized in that make a reservation in the step (1) Time be daily zero or so.

3. the log analysis method of big data quantity according to claim 1, which is characterized in that receiving in the step (1) The method that the program running log collected is compressed is to instruct control packing synchronous with data by shell, and being packaged to execute makes Tar programs transmit the rsync synchronous transfers used.

4. the log analysis method of big data quantity according to claim 1, which is characterized in that one in the step (3) The length of slice task is 30,000-5 ten thousand rows in described program running log.

5. the log analysis method of big data quantity according to claim 1, which is characterized in that daily record in the step (3) Analysis includes：

At the beginning of extracting this log analysis and the end time, the duration of this log analysis is calculated；Api interface makes Use frequency analysis；Total frequency of abnormity statistics, the statistics of each abnormal frequency；It counts each api interface and executes the time, calculate each Api interface, which executes, to be taken；It counts each api interface and uploads downloading data amount, calculate each api interface consumption bandwidth situation；Statistics text Part uploads downloading data amount, and calculation document, which uploads, downloads consumption bandwidth situation；Statistics IP address source, IP request number of times, IP are total Quantity.

6. a kind of log analysis device of big data quantity, which is characterized in that described device includes：Log collection module (1), distribution Formula file system (2), daily record release module (3), log analysis module (4), data preparation module (5) and database (6)；

The log collection module (1), for collection procedure running log within the predetermined time, and the daily record to being collected into Row compression；

The distributed file system (2) carries out quality point for receiving the program running log uploaded after server decompression Analysis；

The daily record release module (3), for uploading to point the compressed program running log of log collection module (1) Cloth file system is stored in (2)；

The log analysis module (4) constitutes multiple slice tasks, and right for being sliced to described program running log The corresponding journal file of each slice task is analyzed；

The data preparation module (5) presses request interface path to the corresponding journal file of each slice task after analysis, Data sort out statistics, generate data report；

The database (6), the analysis result for receiving the data preparation module statistics.

7. the log analysis device of big data quantity according to claim 6, which is characterized in that the scheduled time is every Its zero or so.

8. the log analysis device of big data quantity according to claim 6, which is characterized in that the described pair of program being collected into The method that running log is compressed is to instruct control packing synchronous with data by shell, is packaged the tar journeys for executing and using Sequence transmits the rsync synchronous transfers used.

9. the log analysis device of big data quantity according to claim 6, which is characterized in that one slice task Length is 30,000-5 ten thousand rows in described program running log.

10. the log analysis device of big data quantity according to claim 6, which is characterized in that point of the journal file Analysis includes：