CN103166785A - Distributed type log analysis system based on Hadoop - Google Patents

Distributed type log analysis system based on Hadoop Download PDF

Info

Publication number
CN103166785A
CN103166785A CN2011104189589A CN201110418958A CN103166785A CN 103166785 A CN103166785 A CN 103166785A CN 2011104189589 A CN2011104189589 A CN 2011104189589A CN 201110418958 A CN201110418958 A CN 201110418958A CN 103166785 A CN103166785 A CN 103166785A
Authority
CN
China
Prior art keywords
hadoop
log
data
datanode
daily record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011104189589A
Other languages
Chinese (zh)
Inventor
王专
吴志祥
张海龙
马和平
吴剑
郭凤林
王晓钟
庞绍进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongcheng Network Technology Co Ltd
Original Assignee
Tongcheng Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongcheng Network Technology Co Ltd filed Critical Tongcheng Network Technology Co Ltd
Priority to CN2011104189589A priority Critical patent/CN103166785A/en
Publication of CN103166785A publication Critical patent/CN103166785A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a distributed type log analysis system based on Hadoop. The distributed type log analysis system based on the Hadoop is characterized by comprising at least four personal computers (PC) to establish a Hadoop cluster and more than one log server provided with a log system, wherein one of the PCs is a center server provided with a name space node and a service tracking function, the other PCs are assistant servers provided with data nodes and task tracking functions, and a connector open to the outside is arranged on the log server. The distributed type log analysis system based on the Hadoop can solve the bottleneck of aging, storage, computation in large mount of log analysis simply by using the Hadoop to achieve synchronous distributed computation through the algorithm of log data analysis.

Description

Distributed information log analytical system based on Hadoop
Technical field
The present invention relates to a kind of Log Analysis System, relate in particular to a kind of distributed information log analytical system based on Hadoop.
Background technology
Along with the fast development of the universal and the Internet of computer, the system that company reaches the standard grade is more and more, and the system access daily record that thereupon produces also increases rapidly, every day newly-increased magnanimity system journal.For the safety of protection system, facilitate the Check System fault, the supervisory control system running situation is excavated the information of containing in daily record and is improved user's experience, regularly the daily record that produces is analyzed to seem very necessary.The keeper can check institute's event within certain period, also can be by each journal file analysis is obtained knowledge.To have data volume large due to daily record, the characteristics that are difficult for understanding, if the means that only rely on the keeper to check log recording, the useful information that wherein contains also is difficult to find.No matter in timeliness, all run into bottleneck on storage and amount of calculation when using traditional technology that these massive logs are analyzed.
What is Hadoop:
Distributed system architecture of Hadoop is developed by the Apache foundation.The user can be in the situation that do not understand distributed bottom details, the exploitation distributed program.Take full advantage of power high-speed computation and the storage of cluster.Hadoop has realized a distributed file system (HadoopDistributedFileSystem), is called for short HDFS.HDFS has the characteristics of high fault tolerance, and design is used for being deployed on cheap (low-cost) hardware.And its data of providing high transmission rates (highthroughput) to visit application program, being fit to those has the application program of super large data set (largedataset).HDFS has relaxed the data in form access (streamingaccess) file system that the requirement (requirements) of (relax) POSIX can flow like this.
Hadoop is one can carry out to mass data the software frame of distributed treatment.But Hadoop processes in a kind of reliable, efficient, telescopic mode.Hadoop is reliably, because its supposes that calculating element and storage can be failed, so a plurality of operational data copies of its maintenance, guarantee to process for the node redistribution of failure.Hadoop is efficiently, because it works in parallel mode, by the parallel processing speed up processing.Hadoop or telescopic can process the PB DBMS.In addition, Hadoop depends on community server, so its cost compare is low, and anyone can use the framework of Hadoop: Hadoop to have many elements to consist of.Its bottommost is HadoopDistributedFileSystem(HDFS), the file in its storage Hadoop cluster on all memory nodes.HDFS(is for this paper) last layer be the MapReduce engine, this engine is followed the tracks of (TaskTrackers) by JobTrackers and task and is formed.MapReduce derives from Functional Programming.It is comprised of two verb Map and Reduce, " the Map(expansion) " be exactly that a Task-decomposing is become a plurality of tasks, " Reduce " is exactly that the result that will decompose rear multitasking gathers, and draws last analysis result.
For external client, HDFS is just as a traditional hierarchical file system.Can create, delete, move or Rename file, etc.But the framework of HDFS is based on one group of specific node and builds
, this is to be determined by it self characteristics.These nodes comprise only one of name space node), it provides Metadata Service in HDFS inside; DataNode, it provides memory block for HDFS.Owing to only there being a name space node, so this is the shortcoming (single point failure) of HDFS.
Name space node (NameNode) is the software that moves on a common independent machine in the HDFS example.It is in charge of file system title space and controls the access of external client.Whether the name space node determines File Mapping on the copy block to the DataNode.For modal 3 copy block, first copy block is stored on the different nodes of same frame, and last copy block is stored on certain node of different frames.Actual I/O affairs only have the metadata process name space node of the File Mapping of expression DataNode and piece not through the name space node.When external client sent request and requires to create file, the name space node can be with the DataNodeIP address of first copy of block identification and this piece as response.This name space node also can notify other will receive the DataNode of the copy of this piece.
The name space node in a file that is called FsImage storage all about the information in file system title space.This file and a log file (being EditLog here) that comprises all affairs will be stored on the local file system of name space node.FsImage and EditLog file also need reproduction replica, in case file corruption or name space node system are lost.
DataNode is also the software that moves on a common independent machine in the HDFS example.The Hadoop cluster comprises a name space node and a large amount of DataNode.Usually with the form tissue of frame, frame couples together all systems by a switch DataNode.The hypothesis of Hadoop is: the transmission speed between the frame internal node is faster than the transmission speed of frame intermediate node.
The DataNode response is from the read-write requests of HDFS client computer.They are gone back response creation, delete and copy the order from the piece of name space node.The name space node relies on regular heartbeat (heartbeat) message from each DataNode.Every message all comprises a piece report, and the name space node can be according to this report checking piece mapping and alternative document system metadata.If DataNode can not send heartbeat message, the name space node will be taked reclamation activities, again be replicated in the piece of losing on this node.
Summary of the invention
Purpose of the present invention is exactly in order to solve the above-mentioned problems in the prior art, and a kind of distributed information log analytical system based on Hadoop is provided.
Purpose of the present invention is achieved through the following technical solutions:
Distributed information log analytical system based on Hadoop, wherein: comprise that at least 4 PCs build the Hadoop cluster, wherein one as central server, and title space nodes and service following function are arranged, other several machines are dependent server, and back end and task follow-up functionality are arranged; At least one log server that is deployed with log system, described log server is provided with the opening interface, be used for receiving the daily record data that other system produces, the daily record data of different system can be saved as the journal file of consolidation form, daily record data is carried out preliminary treatment; Described journal file is submitted in the Hadoop cluster.
The above-mentioned distributed information log analytical system based on Hadoop, wherein: the daily record data after described log server will format by distributed file system carries out piecemeal, and with the data after piecemeal according to system-computed task memory allocated each DataNode in the system, with key-value pair of the independent formation of all daily records in journal file; The key-value pair that produces is carried out Map to be processed; Then the content of every daily record is resolved and generated a new key-value pair, these key-value pairs are put into the medium processing to be combined of internal memory; DataNode in name space node delegation system carries out Reduce to data respectively to be processed; Data on all DataNode are gathered in the file that obtains final result and write output directory.
Further, the above-mentioned distributed information log analytical system based on Hadoop, wherein: the MapReduce algorithm of described log server by realizing providing in HadoopAPI, the journal file of uploading is before analyzed.
Further, the above-mentioned distributed information log analytical system based on Hadoop, wherein: the result data that described log server will be processed through the MapReduce of Hadoop presents analysis result by form or web page form.
The advantage of technical solution of the present invention is mainly reflected in: because Hadoop is distributed to each DataNode with the data file of magnanimity, each DataNode only needs computing small part data.And, each DataNode of Hadoop parallel processing when carrying out the data computational analysis, the DataNode of Hadoop can linear expansion, and maximum can realize thousands of DataNode cluster configuration.Log server only need be realized the algorithm that daily record data is analyzed, and just can utilize Hadoop to realize that synchronous Distributed Calculation solves timeliness in the massive logs analysis, storage, the bottleneck of calculating.Utilize the distributed computing framework of Hadoop can also expand to easily the analytical work of processing other massive data files.
Purpose of the present invention, advantage and disadvantage will make an explanation by the non-limitative illustration of following preferred embodiment.These embodiment are only the prominent examples of using technical solution of the present invention, and all technical schemes of taking to be equal to replacement or equivalent transformation and forming are within all dropping on the scope of protection of present invention.
Embodiment
Based on the distributed information log analytical system of Hadoop, its special feature is: the PC that comprises at least 4 (or more than 4) is built the Hadoop cluster.Specifically, wherein one has title space nodes and service following function as central server, and other several machines are dependent server, and back end and task follow-up functionality are arranged.Simultaneously, facility for the ease of the data processing, have at least one log server that is deployed with log system, this log server is provided with the opening interface, be used for receiving the daily record data that other system produces, the daily record data of different system can be saved as the journal file of consolidation form, daily record data is carried out preliminary treatment.And, journal file is submitted in the Hadoop cluster.
With regard to the better execution mode of the present invention one, daily record data after the log server that the present invention adopts will format by distributed file system (HDFS) carries out piecemeal, and with the data after piecemeal according to system-computed task memory allocated each DataNode in the system.And, with key-value pair of the independent formation of all daily records in journal file (sequence number, a log content).
Simultaneously, the key-value pair that produces being carried out Map processes.Afterwards, a new key-value pair (sequence number, access IP access the content that needs analysis that waits consuming time) is resolved and generated to the content of every daily record, these key-value pairs are put into the medium processing to be combined of internal memory.During this period, identical key-value pair can be merged.And the DataNode in name space node delegation system carries out Reduce to data respectively to be processed.Thus, the data on all DataNode are gathered in the file that obtains final result and write output directory.
And in said process, these pieces are stored on different dependent servers dispersedly, and each piece can copy several parts and be stored on different dependent servers.With this, can reach the purpose of fault-tolerant disaster tolerance.
Further, the MapReduce algorithm of log server by realizing providing in Hadoop API analyzed the journal file of uploading before.Simultaneously, the result data that log server will be processed through the MapReduce of Hadoop presents analysis result by form or web page form.Like this, can be convenient to follow-up data analysis and often seeing.
Can find out by above-mentioned character express, after adopting the present invention, because Hadoop is distributed to each DataNode with the data file of magnanimity, each DataNode only needs computing small part data.And, each DataNode of Hadoop parallel processing when carrying out the data computational analysis, the DataNode of Hadoop can linear expansion, and maximum can realize thousands of DataNode cluster configuration.Log server only need be realized the algorithm that daily record data is analyzed, and just can utilize Hadoop to realize that synchronous Distributed Calculation solves timeliness in the massive logs analysis, storage, the bottleneck of calculating.Utilize the distributed computing framework of Hadoop can also expand to easily the analytical work of processing other massive data files.

Claims (4)

1. based on the distributed information log analytical system of Hadoop, it is characterized in that: comprise that at least 4 PCs build the Hadoop cluster, wherein one as central server, and title space nodes and service following function are arranged, other several machines are dependent server, and back end and task follow-up functionality are arranged; At least one log server that is deployed with log system, described log server is provided with the opening interface, be used for receiving the daily record data that other system produces, the daily record data of different system can be saved as the journal file of consolidation form, daily record data is carried out preliminary treatment; Described journal file is submitted in the Hadoop cluster.
2. the distributed information log analytical system based on Hadoop according to claim 1, it is characterized in that: the daily record data after described log server will format by distributed file system carries out piecemeal, and with the data after piecemeal according to system-computed task memory allocated each DataNode in the system, with key-value pair of the independent formation of all daily records in journal file; The key-value pair that produces is carried out Map to be processed; Then the content of every daily record is resolved and generated a new key-value pair, these key-value pairs are put into the medium processing to be combined of internal memory; DataNode in name space node delegation system carries out Reduce to data respectively to be processed; Data on all DataNode are gathered in the file that obtains final result and write output directory.
3. the distributed information log analytical system based on Hadoop according to claim 1 is characterized in that: the MapReduce algorithm of described log server by realizing providing in Hadoop API, the journal file of uploading is before analyzed.
4. the distributed information log analytical system based on Hadoop according to claim 1 is characterized in that: the result data that described log server will be processed through the MapReduce of Hadoop presents analysis result by form or web page form.
CN2011104189589A 2011-12-15 2011-12-15 Distributed type log analysis system based on Hadoop Pending CN103166785A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104189589A CN103166785A (en) 2011-12-15 2011-12-15 Distributed type log analysis system based on Hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104189589A CN103166785A (en) 2011-12-15 2011-12-15 Distributed type log analysis system based on Hadoop

Publications (1)

Publication Number Publication Date
CN103166785A true CN103166785A (en) 2013-06-19

Family

ID=48589538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104189589A Pending CN103166785A (en) 2011-12-15 2011-12-15 Distributed type log analysis system based on Hadoop

Country Status (1)

Country Link
CN (1) CN103166785A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399887A (en) * 2013-07-19 2013-11-20 蓝盾信息安全技术股份有限公司 Query and statistical analysis system for mass logs
CN103559118A (en) * 2013-10-12 2014-02-05 福建亿榕信息技术有限公司 Security auditing method based on aspect oriented programming (AOP) and annotation information system
CN103593385A (en) * 2013-08-14 2014-02-19 北京觅缘信息科技有限公司 Novel multi-model intelligent internet police detection method for use in big data environments
CN103856354A (en) * 2014-03-07 2014-06-11 浪潮电子信息产业股份有限公司 Method for achieving unified management of logs of cluster storage system
CN103942707A (en) * 2014-04-08 2014-07-23 北京璧合科技有限公司 Advertising effect optimizing system based on real-time bidding
CN104731796A (en) * 2013-12-19 2015-06-24 北京思博途信息技术有限公司 Data storage computing method and system
CN105224445A (en) * 2015-10-28 2016-01-06 北京汇商融通信息技术有限公司 Distributed tracking system
CN105656706A (en) * 2014-11-14 2016-06-08 北京通达无限科技有限公司 Business data processing method and device
WO2016165622A1 (en) * 2015-04-14 2016-10-20 Yi Tai Fei Liu Information Technology Llc Systems and methods for key-value stores
CN106778351A (en) * 2016-12-30 2017-05-31 中国民航信息网络股份有限公司 Data desensitization method and device
CN106789324A (en) * 2017-01-09 2017-05-31 上海轻维软件有限公司 FTP distributed acquisition methods based on MapReduce
CN106777046A (en) * 2016-12-09 2017-05-31 武汉卓尔云市集团有限公司 A kind of data analysing method based on nginx daily records
CN106897338A (en) * 2016-07-04 2017-06-27 阿里巴巴集团控股有限公司 A kind of data modification request processing method and processing device for database
CN107562796A (en) * 2017-08-02 2018-01-09 上海斐讯数据通信技术有限公司 A kind of magnanimity mobile terminal measures statistical method and device online
CN108052679A (en) * 2018-01-04 2018-05-18 焦点科技股份有限公司 A kind of Log Analysis System based on HADOOP
CN108694220A (en) * 2017-04-12 2018-10-23 普天信息技术有限公司 A kind of air quality index acquisition methods and device
CN108959445A (en) * 2018-06-13 2018-12-07 云南电网有限责任公司信息中心 Distributed information log processing method and processing device
US10250531B2 (en) 2016-10-06 2019-04-02 Microsoft Technology Licensing, Llc Bot monitoring
CN112463739A (en) * 2019-09-09 2021-03-09 山东省计算中心(国家超级计算济南中心) Data processing method and system based on ocean mode ROMS
CN113434376A (en) * 2021-06-24 2021-09-24 山东浪潮科学研究院有限公司 Web log analysis method and device based on NoSQL

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033748A (en) * 2010-12-03 2011-04-27 中国科学院软件研究所 Method for generating data processing flow codes

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033748A (en) * 2010-12-03 2011-04-27 中国科学院软件研究所 Method for generating data processing flow codes

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
胡光民、周亮、柯立新: "基于Hadoop 的网络日志分析系统研究", 《电脑知识与技术》 *
金松昌等: "基于Hadoop 的网络安全日志分析系统的设计与实现", 《全国计算机安全学术交流论文集》 *
陈文波等: "基于Hadoop 的分布式日志分析系统", 《广西大学学报:自然科学版》 *

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399887A (en) * 2013-07-19 2013-11-20 蓝盾信息安全技术股份有限公司 Query and statistical analysis system for mass logs
CN103593385A (en) * 2013-08-14 2014-02-19 北京觅缘信息科技有限公司 Novel multi-model intelligent internet police detection method for use in big data environments
CN103559118A (en) * 2013-10-12 2014-02-05 福建亿榕信息技术有限公司 Security auditing method based on aspect oriented programming (AOP) and annotation information system
CN103559118B (en) * 2013-10-12 2016-02-03 福建亿榕信息技术有限公司 A kind of method for auditing safely based on AOP and annotating information system
CN104731796A (en) * 2013-12-19 2015-06-24 北京思博途信息技术有限公司 Data storage computing method and system
CN104731796B (en) * 2013-12-19 2017-12-19 秒针信息技术有限公司 Data storage computational methods and system
CN103856354A (en) * 2014-03-07 2014-06-11 浪潮电子信息产业股份有限公司 Method for achieving unified management of logs of cluster storage system
CN103942707B8 (en) * 2014-04-08 2018-06-29 璧合科技股份有限公司 Advertising results optimization system based on real time bid
CN103942707A (en) * 2014-04-08 2014-07-23 北京璧合科技有限公司 Advertising effect optimizing system based on real-time bidding
CN103942707B (en) * 2014-04-08 2018-05-01 壁合科技股份有限公司 Advertising results optimization system based on real time bid
CN105656706A (en) * 2014-11-14 2016-06-08 北京通达无限科技有限公司 Business data processing method and device
WO2016165622A1 (en) * 2015-04-14 2016-10-20 Yi Tai Fei Liu Information Technology Llc Systems and methods for key-value stores
US10740290B2 (en) 2015-04-14 2020-08-11 Jetflow Technologies Systems and methods for key-value stores
CN105224445B (en) * 2015-10-28 2017-02-15 北京汇商融通信息技术有限公司 Distributed tracking system
CN105224445A (en) * 2015-10-28 2016-01-06 北京汇商融通信息技术有限公司 Distributed tracking system
CN106897338A (en) * 2016-07-04 2017-06-27 阿里巴巴集团控股有限公司 A kind of data modification request processing method and processing device for database
US11106695B2 (en) 2016-07-04 2021-08-31 Ant Financial (Hang Zhou) Network Technology Co., Ltd. Database data modification request processing
US11132379B2 (en) 2016-07-04 2021-09-28 Ant Financial (Hang Zhou) Network Technology Co., Ltd. Database data modification request processing
US10250531B2 (en) 2016-10-06 2019-04-02 Microsoft Technology Licensing, Llc Bot monitoring
CN106777046A (en) * 2016-12-09 2017-05-31 武汉卓尔云市集团有限公司 A kind of data analysing method based on nginx daily records
CN106778351A (en) * 2016-12-30 2017-05-31 中国民航信息网络股份有限公司 Data desensitization method and device
CN106789324A (en) * 2017-01-09 2017-05-31 上海轻维软件有限公司 FTP distributed acquisition methods based on MapReduce
CN106789324B (en) * 2017-01-09 2024-03-22 上海轻维软件有限公司 FTP distributed acquisition method based on MapReduce
CN108694220A (en) * 2017-04-12 2018-10-23 普天信息技术有限公司 A kind of air quality index acquisition methods and device
CN107562796A (en) * 2017-08-02 2018-01-09 上海斐讯数据通信技术有限公司 A kind of magnanimity mobile terminal measures statistical method and device online
CN108052679A (en) * 2018-01-04 2018-05-18 焦点科技股份有限公司 A kind of Log Analysis System based on HADOOP
CN108959445A (en) * 2018-06-13 2018-12-07 云南电网有限责任公司信息中心 Distributed information log processing method and processing device
CN112463739A (en) * 2019-09-09 2021-03-09 山东省计算中心(国家超级计算济南中心) Data processing method and system based on ocean mode ROMS
CN113434376A (en) * 2021-06-24 2021-09-24 山东浪潮科学研究院有限公司 Web log analysis method and device based on NoSQL
CN113434376B (en) * 2021-06-24 2023-04-11 山东浪潮科学研究院有限公司 Web log analysis method and device based on NoSQL

Similar Documents

Publication Publication Date Title
CN103166785A (en) Distributed type log analysis system based on Hadoop
US10929428B1 (en) Adaptive database replication for database copies
US10956601B2 (en) Fully managed account level blob data encryption in a distributed storage environment
US10795905B2 (en) Data stream ingestion and persistence techniques
US10691716B2 (en) Dynamic partitioning techniques for data streams
CA2929776C (en) Client-configurable security options for data streams
US10635644B2 (en) Partition-based data stream processing framework
US20170206140A1 (en) System and method for building a point-in-time snapshot of an eventually-consistent data store
US10659225B2 (en) Encrypting existing live unencrypted data using age-based garbage collection
US9489233B1 (en) Parallel modeling and execution framework for distributed computation and file system access
Padhy Big data processing with Hadoop-MapReduce in cloud systems
CA2930026A1 (en) Data stream ingestion and persistence techniques
JP2016524750A (en) Index update pipeline
Dwivedi et al. Analytical review on Hadoop Distributed file system
US11327676B1 (en) Predictive data streaming in a virtual storage system
US11818012B2 (en) Online restore to different topologies with custom data distribution
Ma et al. Stream‐based live data replication approach of in‐memory cache
Antoniu et al. Scalable data management for map-reduce-based data-intensive applications: a view for cloud and hybrid infrastructures
Jakkula HBase or Cassandra? A comparative study of nosql database performance
Cheng et al. A practical cross-datacenter fault-tolerance algorithm in the cloud storage system
Ma et al. Live data replication approach from relational tables to schema-free collections using stream processing framework
US11757703B1 (en) Access requests processing and failover handling across multiple fault tolerance zones
Singh NoSQL: A new horizon in big data
US12007983B2 (en) Optimization of application of transactional information for a hybrid transactional and analytical processing architecture
US20240004860A1 (en) Handshake protocol for efficient exchange of transactional information for a hybrid transactional and analytical processing architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130619