CN105677842A

CN105677842A - Log analysis system based on Hadoop big data processing technique

Info

Publication number: CN105677842A
Application number: CN201610006805.6A
Authority: CN
Inventors: 许丹霞; 刘寅; 汪伟; 郑宇�
Original assignee: Beijing Huishang Rongtong Information Technology Co Ltd
Current assignee: Beijing Huishang Rongtong Information Technology Co Ltd
Priority date: 2016-01-05
Filing date: 2016-01-05
Publication date: 2016-06-15

Abstract

The invention discloses an enterprise website log analysis system developed on the basis of an Hadoop platform. The system mainly comprises a file uploading module, a data cleaning module, a data statistical analysis module, a data export module and a data exhibition module. Accordingly, the website key indexes such as the page view (PV), the registered user number, the ip number and the bounce rate can be obtained through calculation, and data exhibition can achieve millisecond query of mass data.

Description

Log Analysis System based on the big data processing technique of Hadoop

Technical field

The present invention relates to log analysis technology, particularly relate to a kind of log analysis technology based on the big data processing technique of Hadoop.

Background technology

Today, we live in data age, by various packets round. This is the epoch of an information explosion, whole world phone in units of hundred million, Internet user are constantly be generated mass data every day, make a phone call between people, send short messages, chat on line, uploaded videos, forwarding microblogging etc., information with the speed increment of geometry level every day so that market Shang Ge great Internet firm all suffers from stern challenge. They need the TB even analysis of PB DBMS, excavate the merchandise news that sales volume is high, the space of a whole page that website pouplarity is high, the advertisement etc. that on website, click volume is high, and the data of such scale just can only be deeply aware of one's own helplessness when faced with a great task by traditional solution and method.

Increase income the birth of big data processing platform (DPP) Hadoop under organization's Apache foundation, breach the bottleneck of traditional data processing mode so that the collection of mass data, storage, calculating become to be more prone to, more efficient. Hadoop system is a distributed data storage and the platform processed, can be embodied on cheap computer cluster, provide the framework of a mass data distributed storage and calculating, file system HDFS and Computational frame MapReduce, the Large Copacity space storage mass data and the cluster total score that enable users to make full use of cluster are total (namely total: data collection merging; Point: distributed storage and calculating; Total: result of calculation merge) high-speed computational capability develop distributed application program, it is achieved the Millisecond high speed processing of mass data. Owing to this platform adopts OO programming language written in Java, therefore it has well portable and extensibility. It is developed so far, has expanded some outstanding frameworks, framework such as Flume, ZooKeeper, HBase, Pig, Hive, Sqoop etc. that the comparison of enterprise is many, it is achieved that the encapsulation of some service logics, simplify the use of Hadoop.

Traditional data processing mode memory space and operational capability are limited, such as, run tradition APP on one computer, data volume only about 3,000, operation is also required to about general half an hour, and the utilization rate of CPU can reach about 85%, if computer hardware configuration is lower, then can run the longer time, and must artificially collect and process data, clean data, expend substantial amounts of manpower and materials, and it is extremely inefficient, so prior art is difficult to meet the demand of big data quantity, efficiency must be improved by every means, more advanced technology is used to solve the process of mass data.

Summary of the invention

For traditional data processing mode, the data collected are placed in relevant database, there is various association between data, even produce data dependence, and data process in single computer, it is subject to the interference of the various factors such as the configuration of computer, network and affects the efficiency that data process.

The present invention is based on the enterprise web site log analysis solution of Hadoop platform exploitation, is broadly divided into five modules, is files passe module, data cleansing module, data statistic analysis module, data derivation module, data exhibiting module respectively. Files passe uses Flume framework, data cleansing uses MapReduce core algorithm, the statistical analysis of data uses Hive framework, can calculate and obtain each big key index in website, such as pageview PV, registration number of users, ip number, jump out rate, for network operator's decision-making, the derivation of data uses SQOOP framework, each index obtained being exported in the relevant database MySql outside cluster, representing of data uses ZooKeeper and HBase framework, it is possible to achieve the Millisecond inquiry of mass data.

For realizing the purpose of the present invention, it is achieved by the following technical solutions:

A kind of Log Analysis System, including: files passe module, data cleansing module, data statistic analysis module, data derive module and data display module, wherein

Files passe module, is used for uploading journal file, and first files passe module gathers journal file, afterwards journal file is uploaded to distributed file system;

Data cleansing module, for the log file data in distributed file system is carried out conversion, cleans the data after conversion and leaves in distributed file system;

Data statistic analysis module, for the journal file in distributed file system is carried out statistical analysis by data, obtains the statistical data needed, is left in by statistical data in distributed file system;

Data derive module, for the data of storage in distributed file system are exported in the data base of outside;

Data exhibiting module, for the data of storage in outside data base are inquired about, and shows Query Result.

Described Log Analysis System, it is preferred that:

Described distributed file system is HDFS;

Described journal file is the journal file of application cluster.

Described Log Analysis System, it is preferred that:

Data cleansing includes checking data consistency, processes invalid value and missing values.

Described Log Analysis System, it is preferred that:

Described statistical data includes PV, registration number of users, independent ip number, jumps out rate.

Described Log Analysis System, it is preferred that:

The data base of described outside is Mysql data base.

A kind of log analysis method, comprises the following steps:

Step 1. files passe: first gather journal file, afterwards journal file is uploaded to distributed file system;

Step 2. data cleansing: the log file data in distributed file system is carried out conversion, cleans the data after conversion and leaves in distributed file system;

Step 3. data statistic analysis, carries out statistical analysis to the log file data in distributed file system, obtains the statistical data needed, is left in distributed file system by statistical data.

Step 4. data derive: the data of storage in distributed file system are exported in the data base of outside.

Step 5. data exhibiting: the data of storage in outside data base are inquired about, and shows Query Result.

Described log analysis method, it is preferred that:

Described distributed file system is HDFS;

Described journal file is the journal file of application cluster.

Described log analysis method, it is preferred that:

The data base of described outside is Mysql data base.

The building method of a kind of Log Analysis System, comprises the following steps:

The first step: build distributed type assemblies platform, including following four node:

Metadata node, from metadata node, back end 1, back end 2;

Second step: build required data framework on cluster;

3rd step: create log folder under the root of the Linux system of above four kinds of nodes, is used for depositing journal file and performs order, start cluster;

4th step: create Webpage log file under the root in distributed file system, remote procedure call protocol is passed through by log collection module and cluster) communication interaction, log collection task is allowed to run with background process, monitoring log folder, once file collects journal file, with regard under the Webpage log file in synchronized upload to distributed document;

5th step: data are uploaded after successfully, carries out data cleansing by starting cleaning module; After data cleansing, check file system by the form of webpage in browser end access, view desired data;

6th step: use data statistic analysis module after having cleaned) data are carried out statistical analysis, create external table and quote the data under Webpage log file, including:

Calculate pageview PV, statistic PV;

Calculate registration number of users;

Calculate independent ip number;

Number is jumped out in calculating;

7th step: each statistic obtained is stored in respectively in the table of correspondence, then in the data summarization in each table a to table;

8th step: use data to derive in the relevant database that module exports to the data collected outside, it is achieved the quick search of data.

Accompanying drawing explanation

Fig. 1 is Log Analysis System schematic diagram of the present invention;

Fig. 2 is log analysis method schematic diagram of the present invention.

Detailed description of the invention

As it is shown in figure 1, Log Analysis System of the present invention includes: files passe module, data cleansing module, data statistic analysis module, data derive module and data display module.

Files passe module, is used for uploading journal file, and first files passe module gathers journal file, afterwards journal file is uploaded to distributed file system, such as HDFS file system. Described journal file is the journal file of application cluster.

Data cleansing module, for the log file data in HDFS is carried out conversion, cleans the data after conversion and is placed in HDFS. Data cleansing includes checking data consistency, processes invalid value and missing values etc. Filtering undesirable data, undesirable data are mainly the data three major types of incomplete data, the data of mistake, repetition.

Data statistic analysis module: for the log file data in HDFS is carried out statistical analysis, obtains the statistical data needed, and such as PV (page browsing amount), registration number of users, independent ip number, jumps out rate etc., is left in HDFS by statistical data.

Data derive module: for the data obtained of storage in HDFS are exported in the MySql data base of outside.

Data exhibiting functional module: for the mass data of storage in Mysql data base is carried out Millisecond inquiry, and show Query Result.

Such as Fig. 2, log analysis method of the present invention includes: files passe, data cleansing, data statistic analysis, data derive and data display.

Step 1. files passe, is used for uploading journal file. First gather journal file, afterwards journal file is uploaded to distributed file system, such as HDFS file system. Described journal file is the journal file of application cluster.

Step 2. data cleansing, is carried out conversion to the log file data in HDFS, cleans the data after conversion and is placed in HDFS. Data cleansing includes checking data consistency, processes invalid value and missing values etc. Filtering undesirable data, undesirable data mainly have the data three major types of incomplete data, the data of mistake, repetition.

Step 3. data statistic analysis, carries out statistical analysis to the log file data in HDFS, obtains the statistical data needed, and such as PV (page browsing amount), registration number of users, independent ip number, jumps out rate etc., is left in HDFS by statistical data.

Step 4. data derive, and the data result obtained of storage in HDFS is exported in the MySql data base of outside.

Step 5. data exhibiting, the Millisecond inquiry that the mass data of storage in Mysql data base is carried out, and show Query Result.

The building method of one Log Analysis System of the present invention (particularly a kind of Log Analysis System based on the big data processing technique of Hadoop) comprises the following steps:

The first step: build distributed type assemblies platform (such as Hadoop cluster). Following four node can be included:

Server1 (Master) NameNode, JobTracker: metadata node

Server2 (secondnamenode) SecondaryNameNode: from metadata node

Server3 (slave01) DataNode, TaskTracker: back end

Server4 (slave02) DataNode, TaskTracker: back end

Second step: build required data framework on cluster, such as HBase, Zookeeper etc. First start Hadoop distributed type assemblies, then start ZooKeeper cluster, finally at Master (metadata node) upper startup HBase cluster.

3rd step: create log folder (such as apache_logs) under the root of the Linux system of above four kinds of nodes, is used for depositing journal file and performs order, start cluster.

4th step: create web_logs (Webpage log) file under the HDFS root in HDFS file system, by log collection module (such as Flume) with cluster by RPC (remote procedure call protocol) communication interaction, log collection task is allowed to run with background process, monitoring apache_logs file, once file collects journal file, just it is synchronized in HDFS under web_logs file.

5th step: data are uploaded after successfully, it is possible to carry out data cleansing by starting cleaning module. After data cleansing, it is possible to check file system by the form of webpage in browser end access, view desired data.

6th step: use data statistic analysis module (such as Hive) that data carry out statistical analysis after having cleaned, create external table and quote the data under web_logs, including:

Calculate pageview PV, statistic PV;

Calculate registration number of users;

Calculate independent ip number;

Number is jumped out in calculating;

7th step: each statistic obtained is stored in respectively in the table of correspondence.Then in the data summarization in each table a to table.

8th step: use data to derive in the relevant database MySql that module (such as sqoop) exports to the data collected outside, use HBase to realize the quick search of data.

The present invention breaches the bottleneck of traditional data processing mode so that the collection of mass data, storage, calculating become to be more prone to, more efficient. present invention utilizes the high efficiency of the increasing income property of Hadoop technology and parallel processing, cluster is without expensive minicomputer, only need common computer just can build the cluster of superior performance, make full use of the resource of each computer node, with low cost, technology maturation is stable, so building the Log Analysis System based on Hadoop cluster is great meaning, not only greatly reduce various expense, and the requirement of developer is also very low, one cluster even only needs a developer to be responsible for exploitation with the running safeguarding cluster, and substantial amounts of data can be processed timely, make the collection of mass data, storage, calculating becomes to be more prone to, more efficient, improve work efficiency.

Claims

1. a Log Analysis System, it is characterised in that include files passe module, data cleansing module, data statistic analysis module, data derivation module and data display module;

Wherein:

Data statistic analysis module, for the journal file in distributed file system is carried out statistical analysis, obtains the statistical data needed, is left in by statistical data in distributed file system;

Data derive module, for the data obtained of storage in distributed file system are exported in the data base of outside;

2. Log Analysis System according to claim 1, it is characterised in that:

Described distributed file system is HDFS;

Described journal file is the journal file of application cluster.

3. Log Analysis System according to claim 1, it is characterised in that:

4. Log Analysis System according to claim 1, it is characterised in that:

5. Log Analysis System according to claim 1, it is characterised in that:

The data base of described outside is Mysql data base.

6. a log analysis method, it is characterised in that comprise the following steps:

Step 3. data statistic analysis, carries out statistical analysis to the log file data in distributed file system, obtains the statistical data needed, is left in by statistical data in distributed file system;

Step 4. data derive: the data obtained of storage in distributed file system are exported in the data base of outside;

7. log analysis method according to claim 6, it is characterised in that:

Described distributed file system is HDFS;

Described journal file is the journal file of application cluster.

8. log analysis method according to claim 6, it is characterised in that:

9. log analysis method according to claim 6, it is characterised in that:

10. log analysis method according to claim 6, it is characterised in that:

The data base of described outside is Mysql data base.