CN103166785A

CN103166785A - Distributed type log analysis system based on Hadoop

Info

Publication number: CN103166785A
Application number: CN2011104189589A
Authority: CN
Inventors: 王专; 吴志祥; 张海龙; 马和平; 吴剑; 郭凤林; 王晓钟; 庞绍进
Original assignee: Tongcheng Network Technology Co Ltd
Current assignee: Tongcheng Network Technology Co Ltd
Priority date: 2011-12-15
Filing date: 2011-12-15
Publication date: 2013-06-19

Abstract

The invention relates to a distributed type log analysis system based on Hadoop. The distributed type log analysis system based on the Hadoop is characterized by comprising at least four personal computers (PC) to establish a Hadoop cluster and more than one log server provided with a log system, wherein one of the PCs is a center server provided with a name space node and a service tracking function, the other PCs are assistant servers provided with data nodes and task tracking functions, and a connector open to the outside is arranged on the log server. The distributed type log analysis system based on the Hadoop can solve the bottleneck of aging, storage, computation in large mount of log analysis simply by using the Hadoop to achieve synchronous distributed computation through the algorithm of log data analysis.

Description

Distributed information log analytical system based on Hadoop

Technical field

The present invention relates to a kind of Log Analysis System, relate in particular to a kind of distributed information log analytical system based on Hadoop.

Background technology

Along with the fast development of the universal and the Internet of computer, the system that company reaches the standard grade is more and more, and the system access daily record that thereupon produces also increases rapidly, every day newly-increased magnanimity system journal.For the safety of protection system, facilitate the Check System fault, the supervisory control system running situation is excavated the information of containing in daily record and is improved user's experience, regularly the daily record that produces is analyzed to seem very necessary.The keeper can check institute's event within certain period, also can be by each journal file analysis is obtained knowledge.To have data volume large due to daily record, the characteristics that are difficult for understanding, if the means that only rely on the keeper to check log recording, the useful information that wherein contains also is difficult to find.No matter in timeliness, all run into bottleneck on storage and amount of calculation when using traditional technology that these massive logs are analyzed.

What is Hadoop:

Distributed system architecture of Hadoop is developed by the Apache foundation.The user can be in the situation that do not understand distributed bottom details, the exploitation distributed program.Take full advantage of power high-speed computation and the storage of cluster.Hadoop has realized a distributed file system (HadoopDistributedFileSystem), is called for short HDFS.HDFS has the characteristics of high fault tolerance, and design is used for being deployed on cheap (low-cost) hardware.And its data of providing high transmission rates (highthroughput) to visit application program, being fit to those has the application program of super large data set (largedataset).HDFS has relaxed the data in form access (streamingaccess) file system that the requirement (requirements) of (relax) POSIX can flow like this.

Hadoop is one can carry out to mass data the software frame of distributed treatment.But Hadoop processes in a kind of reliable, efficient, telescopic mode.Hadoop is reliably, because its supposes that calculating element and storage can be failed, so a plurality of operational data copies of its maintenance, guarantee to process for the node redistribution of failure.Hadoop is efficiently, because it works in parallel mode, by the parallel processing speed up processing.Hadoop or telescopic can process the PB DBMS.In addition, Hadoop depends on community server, so its cost compare is low, and anyone can use the framework of Hadoop: Hadoop to have many elements to consist of.Its bottommost is HadoopDistributedFileSystem(HDFS), the file in its storage Hadoop cluster on all memory nodes.HDFS(is for this paper) last layer be the MapReduce engine, this engine is followed the tracks of (TaskTrackers) by JobTrackers and task and is formed.MapReduce derives from Functional Programming.It is comprised of two verb Map and Reduce, " the Map(expansion) " be exactly that a Task-decomposing is become a plurality of tasks, " Reduce " is exactly that the result that will decompose rear multitasking gathers, and draws last analysis result.

For external client, HDFS is just as a traditional hierarchical file system.Can create, delete, move or Rename file, etc.But the framework of HDFS is based on one group of specific node and builds

, this is to be determined by it self characteristics.These nodes comprise only one of name space node), it provides Metadata Service in HDFS inside; DataNode, it provides memory block for HDFS.Owing to only there being a name space node, so this is the shortcoming (single point failure) of HDFS.

Name space node (NameNode) is the software that moves on a common independent machine in the HDFS example.It is in charge of file system title space and controls the access of external client.Whether the name space node determines File Mapping on the copy block to the DataNode.For modal 3 copy block, first copy block is stored on the different nodes of same frame, and last copy block is stored on certain node of different frames.Actual I/O affairs only have the metadata process name space node of the File Mapping of expression DataNode and piece not through the name space node.When external client sent request and requires to create file, the name space node can be with the DataNodeIP address of first copy of block identification and this piece as response.This name space node also can notify other will receive the DataNode of the copy of this piece.

The name space node in a file that is called FsImage storage all about the information in file system title space.This file and a log file (being EditLog here) that comprises all affairs will be stored on the local file system of name space node.FsImage and EditLog file also need reproduction replica, in case file corruption or name space node system are lost.

DataNode is also the software that moves on a common independent machine in the HDFS example.The Hadoop cluster comprises a name space node and a large amount of DataNode.Usually with the form tissue of frame, frame couples together all systems by a switch DataNode.The hypothesis of Hadoop is: the transmission speed between the frame internal node is faster than the transmission speed of frame intermediate node.

The DataNode response is from the read-write requests of HDFS client computer.They are gone back response creation, delete and copy the order from the piece of name space node.The name space node relies on regular heartbeat (heartbeat) message from each DataNode.Every message all comprises a piece report, and the name space node can be according to this report checking piece mapping and alternative document system metadata.If DataNode can not send heartbeat message, the name space node will be taked reclamation activities, again be replicated in the piece of losing on this node.

Summary of the invention

Purpose of the present invention is exactly in order to solve the above-mentioned problems in the prior art, and a kind of distributed information log analytical system based on Hadoop is provided.

Purpose of the present invention is achieved through the following technical solutions:

Distributed information log analytical system based on Hadoop, wherein: comprise that at least 4 PCs build the Hadoop cluster, wherein one as central server, and title space nodes and service following function are arranged, other several machines are dependent server, and back end and task follow-up functionality are arranged; At least one log server that is deployed with log system, described log server is provided with the opening interface, be used for receiving the daily record data that other system produces, the daily record data of different system can be saved as the journal file of consolidation form, daily record data is carried out preliminary treatment; Described journal file is submitted in the Hadoop cluster.

The above-mentioned distributed information log analytical system based on Hadoop, wherein: the daily record data after described log server will format by distributed file system carries out piecemeal, and with the data after piecemeal according to system-computed task memory allocated each DataNode in the system, with key-value pair of the independent formation of all daily records in journal file; The key-value pair that produces is carried out Map to be processed; Then the content of every daily record is resolved and generated a new key-value pair, these key-value pairs are put into the medium processing to be combined of internal memory; DataNode in name space node delegation system carries out Reduce to data respectively to be processed; Data on all DataNode are gathered in the file that obtains final result and write output directory.

Further, the above-mentioned distributed information log analytical system based on Hadoop, wherein: the MapReduce algorithm of described log server by realizing providing in HadoopAPI, the journal file of uploading is before analyzed.

Further, the above-mentioned distributed information log analytical system based on Hadoop, wherein: the result data that described log server will be processed through the MapReduce of Hadoop presents analysis result by form or web page form.

The advantage of technical solution of the present invention is mainly reflected in: because Hadoop is distributed to each DataNode with the data file of magnanimity, each DataNode only needs computing small part data.And, each DataNode of Hadoop parallel processing when carrying out the data computational analysis, the DataNode of Hadoop can linear expansion, and maximum can realize thousands of DataNode cluster configuration.Log server only need be realized the algorithm that daily record data is analyzed, and just can utilize Hadoop to realize that synchronous Distributed Calculation solves timeliness in the massive logs analysis, storage, the bottleneck of calculating.Utilize the distributed computing framework of Hadoop can also expand to easily the analytical work of processing other massive data files.

Purpose of the present invention, advantage and disadvantage will make an explanation by the non-limitative illustration of following preferred embodiment.These embodiment are only the prominent examples of using technical solution of the present invention, and all technical schemes of taking to be equal to replacement or equivalent transformation and forming are within all dropping on the scope of protection of present invention.

Embodiment

Based on the distributed information log analytical system of Hadoop, its special feature is: the PC that comprises at least 4 (or more than 4) is built the Hadoop cluster.Specifically, wherein one has title space nodes and service following function as central server, and other several machines are dependent server, and back end and task follow-up functionality are arranged.Simultaneously, facility for the ease of the data processing, have at least one log server that is deployed with log system, this log server is provided with the opening interface, be used for receiving the daily record data that other system produces, the daily record data of different system can be saved as the journal file of consolidation form, daily record data is carried out preliminary treatment.And, journal file is submitted in the Hadoop cluster.

With regard to the better execution mode of the present invention one, daily record data after the log server that the present invention adopts will format by distributed file system (HDFS) carries out piecemeal, and with the data after piecemeal according to system-computed task memory allocated each DataNode in the system.And, with key-value pair of the independent formation of all daily records in journal file (sequence number, a log content).

Simultaneously, the key-value pair that produces being carried out Map processes.Afterwards, a new key-value pair (sequence number, access IP access the content that needs analysis that waits consuming time) is resolved and generated to the content of every daily record, these key-value pairs are put into the medium processing to be combined of internal memory.During this period, identical key-value pair can be merged.And the DataNode in name space node delegation system carries out Reduce to data respectively to be processed.Thus, the data on all DataNode are gathered in the file that obtains final result and write output directory.

And in said process, these pieces are stored on different dependent servers dispersedly, and each piece can copy several parts and be stored on different dependent servers.With this, can reach the purpose of fault-tolerant disaster tolerance.

Further, the MapReduce algorithm of log server by realizing providing in Hadoop API analyzed the journal file of uploading before.Simultaneously, the result data that log server will be processed through the MapReduce of Hadoop presents analysis result by form or web page form.Like this, can be convenient to follow-up data analysis and often seeing.

Can find out by above-mentioned character express, after adopting the present invention, because Hadoop is distributed to each DataNode with the data file of magnanimity, each DataNode only needs computing small part data.And, each DataNode of Hadoop parallel processing when carrying out the data computational analysis, the DataNode of Hadoop can linear expansion, and maximum can realize thousands of DataNode cluster configuration.Log server only need be realized the algorithm that daily record data is analyzed, and just can utilize Hadoop to realize that synchronous Distributed Calculation solves timeliness in the massive logs analysis, storage, the bottleneck of calculating.Utilize the distributed computing framework of Hadoop can also expand to easily the analytical work of processing other massive data files.

Claims

1. based on the distributed information log analytical system of Hadoop, it is characterized in that: comprise that at least 4 PCs build the Hadoop cluster, wherein one as central server, and title space nodes and service following function are arranged, other several machines are dependent server, and back end and task follow-up functionality are arranged; At least one log server that is deployed with log system, described log server is provided with the opening interface, be used for receiving the daily record data that other system produces, the daily record data of different system can be saved as the journal file of consolidation form, daily record data is carried out preliminary treatment; Described journal file is submitted in the Hadoop cluster.

2. the distributed information log analytical system based on Hadoop according to claim 1, it is characterized in that: the daily record data after described log server will format by distributed file system carries out piecemeal, and with the data after piecemeal according to system-computed task memory allocated each DataNode in the system, with key-value pair of the independent formation of all daily records in journal file; The key-value pair that produces is carried out Map to be processed; Then the content of every daily record is resolved and generated a new key-value pair, these key-value pairs are put into the medium processing to be combined of internal memory; DataNode in name space node delegation system carries out Reduce to data respectively to be processed; Data on all DataNode are gathered in the file that obtains final result and write output directory.

3. the distributed information log analytical system based on Hadoop according to claim 1 is characterized in that: the MapReduce algorithm of described log server by realizing providing in Hadoop API, the journal file of uploading is before analyzed.

4. the distributed information log analytical system based on Hadoop according to claim 1 is characterized in that: the result data that described log server will be processed through the MapReduce of Hadoop presents analysis result by form or web page form.