CN105718351B - A kind of distributed monitoring management system towards Hadoop clusters - Google Patents

A kind of distributed monitoring management system towards Hadoop clusters Download PDF

Info

Publication number
CN105718351B
CN105718351B CN201610010050.7A CN201610010050A CN105718351B CN 105718351 B CN105718351 B CN 105718351B CN 201610010050 A CN201610010050 A CN 201610010050A CN 105718351 B CN105718351 B CN 105718351B
Authority
CN
China
Prior art keywords
module
data
monitoring
distributed
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610010050.7A
Other languages
Chinese (zh)
Other versions
CN105718351A (en
Inventor
许丹霞
刘寅
汪伟
郑宇�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaodu Information Technology Co Ltd
Original Assignee
Beijing Huishang Rongtong Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huishang Rongtong Information Technology Co Ltd filed Critical Beijing Huishang Rongtong Information Technology Co Ltd
Priority to CN201610010050.7A priority Critical patent/CN105718351B/en
Publication of CN105718351A publication Critical patent/CN105718351A/en
Application granted granted Critical
Publication of CN105718351B publication Critical patent/CN105718351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention relates to a kind of distributed monitoring management system towards Hadoop clusters.A kind of monitoring management system for being more suitable for our actual demands.Mainly include performance monitoring module, fault alarm module, comprehensive analysis enquiry module, overview display module, data memory module, configuration management module, system management module.It can understand server resource by this system to distribute, track Hadoop operation conditions, alert unusual condition, simplify Hadoop platform configuration operation, find system resource bottleneck on this basis, and optimize performance.This system can also be used for the monitoring management for distributed type assemblies in other demand environments.

Description

A kind of distributed monitoring management system towards Hadoop clusters
Technical field
The present invention relates to a kind of distributed monitoring management system towards Hadoop clusters, the system is more suitable for actual need Ask.It can understand server resource by this system to distribute, track Hadoop operation conditions, alert unusual condition, simplify Hadoop platform configuration operation, finds system resource bottleneck, and optimize performance on this basis.This system, which can also be used for other, to be needed Ask the monitoring management for distributed type assemblies in environment.
Background technology
It is different from common computer network environment or data center, based on the cloud computing environment of Hadoop structures, have The characteristics of number of nodes is big, component and application are complicated, Hadoop are intended to run on the computer of low cost, and it is normal to regard failure State, and the function that Hadoop is covered, widely and using complicated Distributed Parallel Computing framework, this also gives Hadoop clusters Operation and maintenance bring great challenge.
Currently in Hadoop monitor and manage instrument have a lot, such as Zookeeper, Ganglia, Nagios, Ambari, Chukwa etc..Every kind of management tool is all that comparison is successful and handy in the field that it is absorbed in.Zookeeper It is cluster to be each responsible for the monitoring of distributed type assemblies and alarm, Ambari for the management of configuration file, Ganglia and Nagios Deployment and monitoring management provide unified solution, Chukwa solves the problems, such as to collect and analyze cluster daily record. Zookeeper is absorbed in the management work of Hadoop platform configuration file.Cluster monitoring work outstanding as one Ganglia Tool, distributed monitoring effect protrude, to we provide a full set of work(of the collection in computer cluster, collection, storage and displaying Energy.But it to log analysis, can not simply monitor the working condition of cluster merely.The monitoring page that Ganglia is carried can be by Different grain size displaying historical data changes with time trend, and can be with custom parameter.But the parameter of displaying is more and complete, The desired information for therefrom filtering out oneself needs is, it is necessary to have good understanding to Ganglia, and need certain management and fortune Seek the experience of cluster.This is a challenge in general Hadoop user.Nagios is a outstanding monitoring and alarm Instrument.The content of oneself care, and given threshold can be arbitrarily monitored by designing plug-in unit, when monitoring value exceedes threshold value, Nagios can be alerted by way of mail or short message.But outstanding alarm function can not meet that we monitor resource Demand.Can only be as the important step of cluster management.In addition, Ganglia and Nagios have, some functions are overlapping, if it is desired that with Both instruments avoid the unnecessary wasting of resources, it is necessary to plan both monitor control indexs.Chukwa is still unstable at this stage, Installation process is complicated, and debugging is difficult.That wherein closest to our demands is Ambari, but in actual use, Wo Menfa The problem of very more also be present in existing Ambari.Ambari can not be used as individually monitoring and management tool, it is impossible to which monitoring is voluntarily pacified The cluster of dress, it is necessary to which various roles when just using Ambari when installing cluster, and strictly observing installation distribute.For difference Operating system, installation be frequently encountered indeterminable problem, that is to say, that Ambari can not run on each well On (SuSE) Linux OS.
The content of the invention
In summary, after the cluster management of current main-stream and monitoring system has been understood, the present invention develop it is a kind of towards The distributed monitoring management system of Hadoop clusters, a kind of monitoring management system for being more suitable for our actual demands.Pass through this System can understand server resource distribution, track Hadoop operation conditions, alert unusual condition, simplify Hadoop platform configuration Operation, system resource bottleneck is found on this basis, and optimize performance.
Present invention aim to address the performance monitoring, fault alarm, configuration management for Hadoop platform, wherein wrapping Include:
1st, monitoring alarm function mainly includes collection and stores basic data and the fault warning of all monitoring.
2nd, the monitoring data of the system be not only system resource and Hadoop Metrics information or The daily record of Hadoop components and other assemblies daily record.Because can not to provide us of concern for Hadoop Metrics information The information such as job run percentage, so Hadoop component daily records are also very important basic monitoring data source.In daily record It typically include after assembly operating starts, each operates the information such as called code bag, operation implementing result.Analyze Hadoop Component log information, the monitoring to flow analysis system have very big help with optimization.Run on further for self-developing Cluster and the component associated with Hadoop components, in order to obtain its current operating situation, and holistic health, it should root The index of component situation can be reflected by being defined according to self-demand, and is provided daily record and exported these indexs.Monitoring system can monitor The daily record of these components, and alarm according to demand.
3rd, the unified configuration service of Hadoop platform is realized by configuration management module, when monitoring management system is alarmed When, related personnel can change the configuration of Hadoop platform, reorganize and coordinate resource, and provide web interface and simplify configuration Operation.
4th, the visualization of front end data is realized using overview display module.Can be with all monitor control indexs of overview display and alarm Index.The operation conditions of comprehensive understanding platform.Simplify and integrate the displayed page of very complicated, only show important general Manager and guardian's parameter and index interested.Other the system also supports user directly to access web page, checks user Other indexs of concern.
To realize the purpose of the present invention, it is achieved using following technical scheme:
A kind of distributed monitoring management system towards Hadoop clusters.A kind of monitoring for being more suitable for our actual demands Management system.It is main to include performance monitoring module, fault alarm module, comprehensive analysis enquiry module, overview display module, configuration Management module, system management module.Wherein:
Performance monitoring module, the last state of Hadoop platform is understood at any time, find resource bottleneck, improve platform operation effect Rate, the data that can be monitored include:Server resource, Hadoop Metrics, the daily record of Hadoop components and other assemblies day Will.
Fault alarm module, when the performance of computing resource reaches bottleneck, send short message to related personnel or mail is accused It is alert.Platform fault is found in time, and to maintain the normal operation of platform, its basic function includes monitor supervision platform state, finds failure Node, crashed process and failed services, failure and processing information are recorded, for different grades of failure, notify different stage Administrative staff processing.
Comprehensive analysis enquiry module, inquiry service is calculated for providing, the system data collected can not be directly presented To user, because data are usually instantaneous value, and the achievement data that we are concerned about needs to obtain after calculating.Comprehensive analysis is looked into The monitoring in module reading database and alert data are ask, is calculated, the index deposit database after calculating, and provide each The query interface of index.
Overview display module, each monitoring alarm index of overview display, optional ECharts realize the visual of front end data Change.
Configuration management module, it is therefore intended that simplify platform configuration operation, tissue and coordination computing resource, complete to Hadoop The configuration work of platform.Zookeeper can be based on and realize distributed unified configuration service, ageing and Information Security can obtain To guarantee, and web interface is provided and simplifies user configuration operation.
System management module, there is provided web interface safeguards user management and rights management function.Increase security of system energy, The configuration management function of Hadoop platform is only opened to system operator, domestic consumer only possesses the monitoring function to platform.
Described distributed monitoring management system, preferably:
The monitoring alarm function of monitoring management system includes:Gather and store the basic data of all monitoring;Carry out failure Alarm.Monitored component is needed to select a machine independently of cluster as monitored node using each in Hadoop clusters A machine of relative free is as monitor node in device, or cluster.The system is mainly by monitored node and monitoring Monitoring alarm module is set to realize monitoring alarm function on node.
Described distributed monitoring management system, preferably:
Performance monitoring module, Ganglia can be based on and realize that the data of monitoring include:Server resource, Hadoop The daily record of Metrics, Hadoop component and other assemblies daily record.The monitoring data of collection deposits in RRD (Round-Robin Database in), show and use for web.
Described distributed monitoring management system, preferably:
Fault alarm module, Nagios can be based on and realize that the basic data for carrying out breakdown judge derives from two classes:One kind is The basic data that performance monitoring module collection is deposited in RRD;Another kind of is the basic data that warning information collection module reports. There is warning information collection module respectively on each monitored node and monitor node, warning information is installed on monitor node Core component, monitored node have warning information collection module, and the warning information being collected into is transferred to the announcement of monitor node Alert information core component, and according to the rank and species of warning information, select the administrative staff of correlation to send short message or mail report It is alert.The warning information collection module of monitor node can scan the data in RRD, according to the rank and species of warning information, selection Related administrative staff send short message or mail alarm, and warning information is transferred to the warning information core of monitor node Part.Warning information is stored in DB by warning information core component, is shown and is used for web.
Described distributed monitoring management system, preferably:
Comprehensive analysis enquiry module, there is provided analysis inquiry service, the initial data gathered is usual and is not suitable for directly opening up Show to user, what user was typically concerned about is the value obtained after being calculated using initial data.So the data for being presented to user have A part obtains after the calculating of former data.Module reads monitoring and alert data in RRD and Mysql databases, carries out Correlation computations, the monitoring after calculating and alarming index are stored in Mysql databases, and the query interface of each index is provided.
Described distributed monitoring management system, preferably:
Overview display module, each monitoring alarm index of overview display, present invention selection ECharts realize front end data Visualization.Remove the presentation parameter and index of Ganglia very complicateds, only show general management person and guardian's prison interested Index is controlled, overview display module also shows that each warning message that Nagios is collected in addition.Certain the system also supports that user is straight The gweb pages that Ganglia is carried are asked in receiving, check user's other monitor control indexs of concern.
Described distributed monitoring management system, preferably:
Configuration management module, distributed unified configuration service is realized based on zookeeper, ageing and Information Security is all It can be guaranteed, it is therefore intended that simplify platform configuration operation, when monitoring management system is alarmed, related personnel can change The configuration of Hadoop platform, reorganize and coordinate resource, and web interface is provided and simplifies configuration operation.
Described distributed monitoring management system, preferably:
System management module, there is provided web interface safeguards user management and rights management function.Increase security of system energy, The configuration management function of Hadoop platform is only opened to system operator, domestic consumer only possesses the monitoring function to platform.
A kind of distributed monitoring management system for distributed cluster system, including:Performance monitoring module, fault alarm Module, comprehensive analysis enquiry module, overview display module, data memory module, configuration management module, system management module, its In:
Performance monitoring module is used for the monitoring data of monitoring distributed group system monitored node, and by the monitoring data Store data memory module;
Fault alarm module is used to carry out fault alarm according to the monitoring data stored in data memory module, or receives Monitor node and the alert data of monitored node transmission, by the alert data storage of the reception to data memory module and basis The information carries out fault alarm;
The monitoring data or alert data that comprehensive analysis enquiry module is used in reading database, carry out calculating analysis, will Analysis result deposit data memory module after calculating;
Data memory module, for storing monitoring data or alert data;
Overview display module is used for the analysis result for showing comprehensive analysis enquiry module;
System management module is used to carry out user management and rights management;
Configuration management module is used to carry out distributed cluster system unified configuration.
Described distributed monitoring management system, preferably:
Performance monitoring module includes collection module and convergence module;
Collection module is used for the monitoring data for reading monitored node, and gives the monitoring data transmission being collected into convergence mould Block;
Convergence module collection monitoring data and collect storage arrive data memory module.
Described distributed monitoring management system, preferably:
Fault alarm module, the data in scan data memory module, determines the rank and species of warning information, sends short Letter or mail alarm;Or monitor node or the alert data of monitored node transmission are received, the alert data of the reception is deposited Data memory module is stored up, and according to the rank and species of alert data, sends short message or mail alarm.
Described distributed monitoring management system, preferably:Overview display module carries out one or a combination set of following exhibition Show:
(1) alarm today project statistics:Current cluster malfunction is shown with the formal intuition of block diagram, how many event Hinder server, failed services and faulty components;
(2) cluster server state:Cluster server is divided into three kinds of states:Normally, failure and high load capacity;
(3) alarm list is not solved:All unsolved alarms;
(4) resource that can change granularity uses timing diagram:Including cpu busy percentage, memory usage.
Described distributed monitoring management system, preferably:Data memory module includes RRD or Mysql, and monitoring data is deposited In RRD, alert data is stored in Mysql for storage.
A kind of distributed monitoring management method for distributed cluster system, comprise the following steps:
Monitored node in the monitoring distributed group system of step 1., by supervising data storage to data memory module;
Step 2. carries out fault alarm according to the monitoring data of storage, or receives monitor node and monitored node transmission Alert data, by the alert data of the reception carry out storage to data memory module and according to the information carry out fault alarm;
Step 3. reads monitoring data or alert data in data memory module, carries out calculating analysis, preserves after calculating Analysis result;
Step 4. shows the analysis result of comprehensive analysis enquiry module;
Step 5. carries out user management and rights management;
Step 6. carries out unified configuration to distributed cluster system.
Described distributed monitoring management method, preferably:
Monitoring in step 1 includes:The monitoring data of monitored node is read, the monitoring data being collected into is collected and deposited Storage.
Described distributed monitoring management method, preferably:
Fault alarm in step 2 is specially the data in scan data memory module, determine warning information rank and Species, send short message or mail alarm;Or monitor node and the alert data of monitored node transmission are received, by the reception Data memory module is arrived in alert data storage, and according to the rank and species of alert data, sends short message or mail alarm.
Described distributed monitoring management method, preferably:Overview display in step 4 includes one or a combination set of following Displaying:
(1) alarm today project statistics:Current cluster malfunction is shown with the formal intuition of block diagram, how many event Hinder server, failed services and faulty components;
(2) cluster server state:Cluster server is divided into three kinds of states:Normally, failure and high load capacity;
(3) alarm list is not solved:All unsolved alarms;
(4) resource that can change granularity uses timing diagram:Including cpu busy percentage, memory usage.
Described distributed monitoring management method, preferably:Data memory module includes RRD or Mysql, and monitoring data is deposited In RRD, alert data is stored in Mysql for storage.
Brief description of the drawings
Fig. 1 is the distributed monitoring management system schematic diagram provided by the invention towards Hadoop clusters.
Embodiment
As shown in figure 1, distributed monitoring management system includes:
1. performance monitoring module, the performance for monitoring distributed group system:The data of monitoring include server resource, Hadoop Metrics, the daily record of Hadoop components and other assemblies daily record.Performance monitoring module includes collection module and convergence Module.Wherein collection module is used for the monitoring data for reading monitored node, including the server resource (essential information of server Including CPU, internal memory, hard disk, network I/O, process etc.), Hadoop Metrics (including HDFS information, MapReduce information, JVM information and other Hadoop module informations (Hbase etc.)), the daily record of Hadoop components and other assemblies daily record.Collect mould The monitoring data transmission being collected into convergence module, is unified collection monitoring information by convergence module and collects storage to data by block Memory module, it is preferred that data memory module includes ring database RRD ((Round-Robin Database)), is deposited by RRD Store up the data.
2. fault alarm module, for carrying out fault alarm, including warning information collection module and warning information core Part.Fault alarm module can carry out two kinds of fault alarms.First is the number stored in warning information collection module scanning RRD According to determining the rank and species of warning information according to user's request, select the administrative staff of correlation to send short message or mail alarm; Second is that warning information core component receives the alarm letter that the warning information collection module of monitor node and monitored node is sent Breath, and warning information is stored in database, such as the database can be Msql databases, show and use for web, and root According to the rank and species of warning information, the administrative staff of correlation are selected to send short message or mail alarm.
The project of monitoring alarm, it is similarly module information, Hadoop cluster state informations and server info.All kinds of alarms The warning content of item is as shown in the table:
3. comprehensive analysis enquiry module, inquiry service is calculated for providing, by the initial data gathered is usual not User is suitably exposed directly to, what user was typically concerned about is the value obtained after being calculated using initial data.So it is presented to use The data at family some through former data calculating after obtain.Module reads the monitoring and alarm in RRD and Mysql databases Data, correlation computations are carried out, the monitoring after calculating and alarming index are stored in Mysql databases, and provide looking into for all kinds of indexs Ask interface.
The computational methods of the two important indicators of CPU usage and memory usage, hard disk, load and network is given below IO value can be fetched directly into, and percent value is can be obtained by by simple division calculation.Likewise, alarm project statistical number It can be obtained according to by simple accumulation calculating, cluster server state percent data can by simple division calculation To obtain, no longer repeated secondary.Other monitoring and alarm data can directly obtain from database.
(1) CPU usage
CPU related datas are extracted from monitoring basic data (namely the monitoring data collected from monitored node):CPU (nice values is shared by negative processes by user time (CPU time of User space, being designated as user [i]), CPU nice time CPU time, be designated as nice [i]), CPU system time (core time, being designated as system [i]), CPU idle time (remove Other stand-by period beyond hard disk I/O latency, be designated as idie [i]), CPU iowait time (hard disk I/O latency, note For iow [i]), CPU irq time (the hard break time, being designated as irq [i]), CPU softirq time (the traps time, note For sirq [i]).Two shorter time interval ti (such as 1 second) are sampled, t1, t2 is denoted as respectively, takes CPU snapshots.
All cpu usages of first time are summed, obtain S1:
S1=user [1]+nice [1]+system [1]+idle [1]+iow [1]+irq [1]+sirq [1]
Secondary all cpu service conditions are summed, obtain S2:
S2=user [2]+nice [2]+system [2]+idle [2]+iow [2]+irq [2]+sirq [2]
Calculate CPU usage CPU_usage:
CPU_usage=1-100* (idle [2]-idle [1])/(S2-S1)
(2) memory usage
Internal memory related data is extracted from monitoring basic data:Mem_total (physical memory total amount), mem_free (free physical memory size), mem_buffers (the physical memory size of filebuf), mem_cached (caches The physical memory size in area).
Calculate memory usage mem_usage:
Mem_usage=100%* (mem_total-mem_free-mem_buffers-mem_cached)/mem_ total
4. data memory module, including RRD and Mysql, for data storage.RRD (Round Robin Database, Ring database) it is used to store monitoring data, the ring database carrys out data storage using the space of fixed size, in database Data storage in suffix be .rrd file in, so as to comprehensive inquiry module use.Mysql databases are used to store alarm number According to so that comprehensive analysis enquiry module uses.In addition Mysql databases also stored for the relevant information of user management part, such as User's detail list, authority list, role's table etc..
5. system management module, for providing user management and rights management function, distributed type assemblies are configured.For Increase security of system energy, the configuration management function of distributed type assemblies Hadoop platform is only opened to system operator, commonly User only possesses the monitoring function to platform.
6. overview display module:For calling comprehensive analysis enquiry module, various achievement datas are obtained, realize analysis result Visualization.The index that can be shown is as follows:
(1) alarm today project statistics:Current cluster malfunction is shown with the formal intuition of block diagram, how many event Hinder server, failed services and faulty components.The alarm list page can be entered by clicking on all alarms.Alarm today project is preferred Fault warning was carried out from 0 point of same day to the current period, it is ensured that provides newest fault warning information.
(2) cluster server state:Cluster server is divided into three kinds of states, normal, failure and high load capacity.Check every kind of The machine of state accounts for the ratio of whole clustered machines, if a server is faulty, load is also high, is classified as failure one Class.
(3) alarm list is not solved:All unsolved alarms.This server can be checked in detail by clicking on server name Resource service condition.
(4) Hadoop cluster states:It is whether busy that Hadoop clusters can be can be visually seen.Block diagram can be visually seen current fortune Capable Map and Reduce operations number and Map the and Reduce numbers of wait operation.
(5) HDFS capacity:HDFS, which can be can be visually seen, uses capacity, including HDFS to use capacity, non-DFS using capacity and not Use capacity.
(6) the MapReduce operations being currently running:List essential information, input data amount, Map and Reduce operations Percent Complete.
(7) resource that can change granularity uses timing diagram, including cpu busy percentage, memory usage.Figure upper left can be clicked on The granularity button at angle pulls the granularity bar below figure to check granularity and monitoring section to change.Service is listed in list The basic machine information of device and Current resource use.
By means of the invention it is possible to provide a kind of monitoring management system for being more suitable for actual demand.Can by this system Server resource distribution is understood in time, tracks distributed type assemblies operation conditions, alerts unusual condition, simplifies distributed type assemblies configuration Operation, system resource bottleneck is found on this basis, and optimize performance.

Claims (10)

  1. A kind of 1. distributed monitoring management system for Hadoop group systems, it is characterised in that including:Performance monitoring module, Fault alarm module, comprehensive analysis enquiry module, overview display module, data memory module, configuration management module, system administration Module, wherein:
    Performance monitoring module is used for the performance of each monitored node in monitoring distributed group system, and the monitoring that will be collected Data Cun Chudao data memory modules, the monitoring data include server resource, Hadoop Metrics, Hadoop component days Will and other assemblies daily record;
    Fault alarm module is used to carry out fault alarm according to the monitoring data stored in data memory module, or receives distribution The alert data of in formula group system or independently of distributed cluster system monitor node and monitored node transmission, Fault alarm is carried out by the alert data received storage to data memory module and according to the data, the fault alarm includes Monitor supervision platform state, malfunctioning node, crashed process and failed services are found, failure and processing information are recorded, for different etc. The failure of level, notifies the administrative staff of different stage to handle;The alert data includes module information, Hadoop cluster states are believed Breath and server info;
    Comprehensive analysis enquiry module is used to read the monitoring data or alert data in data memory module, carries out calculating analysis, Analysis result after calculating is stored in data memory module;
    Data memory module is used to store monitoring data or alert data;
    Overview display module is used for the analysis result for showing comprehensive analysis enquiry module:Comprehensive analysis enquiry module is called, is obtained Various achievement datas, realize that analysis result visualizes;
    System management module is used to carry out user management and rights management:Configuration management work(to distributed type assemblies Hadoop platform Only system operator can be opened, domestic consumer only possesses the monitoring function to platform;
    Configuration management module is used to carry out distributed cluster system unified configuration:Realized based on zookeeper distributed unified Configuration service.
  2. 2. distributed monitoring management system according to claim 1, it is characterised in that:
    Performance monitoring module includes collection module and convergence module;
    Collection module is used for the monitoring data for reading monitored node, and by the monitoring data transmission being collected into convergence module;
    Convergence module collection monitoring data and collect storage arrive data memory module.
  3. 3. distributed monitoring management system according to claim 1, it is characterised in that:
    Fault alarm module, for the data in scan data memory module, the rank and species of warning information are determined, is sent short Letter or mail alarm;Or the alert data of the warning message collection module transmission on monitor node and monitored node is received, Data memory module is arrived into the alert data storage of the reception, and according to the rank and species of alert data, sends short message or postal Part is alarmed.
  4. 4. distributed monitoring management system according to claim 1, it is characterised in that:Overview display module carry out it is following it The displaying of one or its combination:
    (1) alarm today project statistics:Current cluster malfunction is shown in the form of block diagram, how many failed services Device, failed services and faulty components;
    (2) cluster server state:Cluster server is divided into three kinds of states:Normally, failure and high load capacity;
    (3) alarm list is not solved:All unsolved alarms;
    (4) resource that can change granularity uses timing diagram:Including cpu busy percentage, memory usage.
  5. 5. distributed monitoring management system according to claim 1, it is characterised in that:Data memory module include RRD and MysqL, in RRD, alert data is stored in MysqL supervising data storage.
  6. 6. a kind of distributed monitoring management method for Hadoop group systems, methods described is by one of claim 1-5 institutes The distributed monitoring management system stated is realized, it is characterised in that is comprised the following steps:
    Monitored node in the monitoring distributed group system of step 1., it is described by supervising data storage to data memory module Monitoring data includes server resource, Hadoop Metrics, the daily record of Hadoop components and other assemblies daily record;
    Step 2. carries out fault alarm according to the monitoring data of storage, or receives in distributed cluster system or independent In the alert data of the warning message collection module transmission on the monitor node and monitored node of distributed cluster system, by this The alert data storage of reception carries out fault alarm to data memory module and according to the alert data, and the fault alarm includes Monitor supervision platform state, malfunctioning node, crashed process and failed services are found, failure and processing information are recorded, for different etc. The failure of level, notifies the administrative staff of different stage to handle;The alert data includes module information, Hadoop cluster states are believed Breath and server info;
    Step 3. reads monitoring data or alert data in data memory module, carries out calculating analysis, preserves point after calculating Analyse result;
    Step 4. shows the analysis result of comprehensive analysis enquiry module:Comprehensive analysis enquiry module is called, obtains various index numbers According to, realize analysis result visualize;
    Step 5. carries out user management and rights management:To the configuration management function of distributed type assemblies Hadoop platform only to system Manager opens, and domestic consumer only possesses the monitoring function to platform;
    Step 6. carries out unified configuration to distributed cluster system:Distributed unified configuration service is realized based on zookeeper.
  7. 7. distributed monitoring management method according to claim 6, it is characterised in that:
    Monitoring in step 1 includes:The monitoring data of monitored node is read, the monitoring data being collected into is collected into storage.
  8. 8. distributed monitoring management method according to claim 6, it is characterised in that:Fault alarm in step 2 is specific For the data in scan data memory module, the rank and species of warning information are determined, sends short message or mail alarm;Or connect The alert data that the warning message collection module on monitor node and monitored node is transmitted is received, by the alert data of the reception Data memory module is stored, and according to the rank and species of alert data, sends short message or mail alarm.
  9. 9. distributed monitoring management method according to claim 6, it is characterised in that:Overview display in step 4 includes One or a combination set of following displaying:
    (1) alarm today project statistics:Current cluster malfunction is shown in the form of block diagram, how many failed services Device, failed services and faulty components;
    (2) cluster server state:Cluster server is divided into three kinds of states:Normally, failure and high load capacity;
    (3) alarm list is not solved:All unsolved alarms;
    (4) resource that can change granularity uses timing diagram:Including cpu busy percentage, memory usage.
  10. 10. distributed monitoring management method according to claim 6, it is characterised in that:Data memory module include RRD and MysqL, in RRD, alert data is stored in MysqL supervising data storage.
CN201610010050.7A 2016-01-08 2016-01-08 A kind of distributed monitoring management system towards Hadoop clusters Active CN105718351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610010050.7A CN105718351B (en) 2016-01-08 2016-01-08 A kind of distributed monitoring management system towards Hadoop clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610010050.7A CN105718351B (en) 2016-01-08 2016-01-08 A kind of distributed monitoring management system towards Hadoop clusters

Publications (2)

Publication Number Publication Date
CN105718351A CN105718351A (en) 2016-06-29
CN105718351B true CN105718351B (en) 2018-02-09

Family

ID=56147721

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610010050.7A Active CN105718351B (en) 2016-01-08 2016-01-08 A kind of distributed monitoring management system towards Hadoop clusters

Country Status (1)

Country Link
CN (1) CN105718351B (en)

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106375113B (en) * 2016-08-25 2020-01-17 新华三技术有限公司 Method, device and system for recording faults of distributed equipment
CN106407075B (en) * 2016-09-19 2019-09-13 广州视源电子科技股份有限公司 Management method and system for big data platform
CN106487597A (en) * 2016-10-26 2017-03-08 努比亚技术有限公司 A kind of service monitoring system and method based on Zookeeper
CN106453377B (en) * 2016-10-28 2021-03-02 中金云金融(北京)大数据科技股份有限公司 Block chain based distributed network intelligent monitoring system and method
CN106776288B (en) * 2016-11-25 2019-11-19 北京航空航天大学 A kind of health metric method of the distributed system based on Hadoop
CN106533792A (en) * 2016-12-12 2017-03-22 北京锐安科技有限公司 Method and device for monitoring and configuring resources
CN108255661A (en) * 2016-12-29 2018-07-06 北京京东尚科信息技术有限公司 A kind of method and system for realizing Hadoop cluster monitorings
CN107135119B (en) * 2017-04-18 2020-05-05 国网福建省电力有限公司 Business response tracking and interface state monitoring development system
CN107168847A (en) * 2017-04-21 2017-09-15 国家电网公司 The full link application monitoring method and device of a kind of support distribution formula framework
CN107483568A (en) * 2017-08-04 2017-12-15 中兴软创科技股份有限公司 It is a kind of based on cloud platform can flexible scheduling network and service monitoring system
CN107729096A (en) * 2017-09-20 2018-02-23 中国银行股份有限公司 Shunting information method and system
CN109697070B (en) * 2017-10-23 2022-02-18 中移(苏州)软件技术有限公司 Ambari-based cluster management method, device and medium
CN107908526A (en) * 2017-10-26 2018-04-13 北京人大金仓信息技术股份有限公司 Centralized large-scale cluster monitoring early-warning system based on Web
CN108111600A (en) * 2017-12-20 2018-06-01 山东浪潮云服务信息科技有限公司 A kind of data managing method and intelligent operation platform
CN108134697B (en) * 2017-12-21 2021-01-19 四川管理职业学院 Hadoop architecture cloud platform risk assessment and early warning method
CN108390907B (en) * 2018-01-09 2021-06-22 浙江航天恒嘉数据科技有限公司 Management monitoring system and method based on Hadoop cluster
CN108418710B (en) * 2018-02-09 2021-03-26 北京奇艺世纪科技有限公司 Distributed monitoring system, method and device
CN108459944A (en) * 2018-03-29 2018-08-28 中科创能实业有限公司 System operation monitoring method, device and server
CN108449438B (en) * 2018-05-22 2023-08-22 郑州云海信息技术有限公司 Cluster CDC data monitoring device, system and method
CN108959048A (en) * 2018-06-22 2018-12-07 北京优特捷信息技术有限公司 The method for analyzing performance of modular environment, device and can storage medium
CN109165137A (en) * 2018-07-27 2019-01-08 曙光信息产业(北京)有限公司 data analysis and alarm method and system
CN108763038B (en) * 2018-08-08 2022-04-12 平安科技(深圳)有限公司 Alarm data management method and device, computer equipment and storage medium
CN109298945A (en) * 2018-10-17 2019-02-01 北京京航计算通讯研究所 The monitoring of Ceph distributed storage and tuning management method towards big data platform
CN109347703B (en) * 2018-11-21 2022-05-03 中国船舶重工集团公司第七一六研究所 CPS node fault detection device and method
CN109726077A (en) * 2018-12-21 2019-05-07 中冶建筑研究总院有限公司 A kind of Enterprise Project lightweight safety management control data platform
CN109726211B (en) * 2018-12-27 2020-02-04 无锡华云数据技术服务有限公司 Distributed time sequence database
CN109885544A (en) * 2019-01-14 2019-06-14 中国海洋大学 A kind of log storing method and system towards ocean big data cluster
CN109951313B (en) * 2019-01-18 2022-04-19 长江大学 Monitoring device and method for Hadoop cloud platform
CN109886327B (en) * 2019-02-12 2021-11-19 北京奇艺世纪科技有限公司 System and method for processing Java data in distributed system
CN111694705A (en) * 2019-03-15 2020-09-22 北京沃东天骏信息技术有限公司 Monitoring method, device, equipment and computer readable storage medium
WO2021102617A1 (en) * 2019-11-25 2021-06-03 深圳晶泰科技有限公司 Multi-public cloud computing platform-oriented cluster monitoring system and monitoring method therefor
CN112104493A (en) * 2020-09-07 2020-12-18 成都精灵云科技有限公司 Acquisition and analysis system for low-delay host resource monitoring in cluster environment
CN112328445B (en) * 2020-10-27 2023-11-14 许继集团有限公司 Multi-node management system based on condul
CN112526974A (en) * 2020-12-04 2021-03-19 中国航空工业集团公司成都飞机设计研究所 Universal test data acquisition system adopting plug-in management architecture
CN112486776B (en) * 2020-12-07 2024-08-02 中国船舶集团有限公司第七一六研究所 Cluster member node availability monitoring device and method
CN112636979B (en) * 2020-12-24 2022-08-12 北京浪潮数据技术有限公司 Cluster alarm method and related device
CN112667430A (en) * 2021-01-14 2021-04-16 电子科技大学中山学院 Big data cluster management method and device
CN113626280B (en) * 2021-06-30 2024-02-09 广东浪潮智慧计算技术有限公司 Cluster state control method and device, electronic equipment and readable storage medium
CN113419925A (en) * 2021-08-25 2021-09-21 天津南大通用数据技术股份有限公司 Monitoring method and system for monitoring and alarming multiple distributed MPP clusters
CN113868099A (en) * 2021-10-20 2021-12-31 苏州中科先进技术研究院有限公司 Data monitoring system
CN114458968A (en) * 2021-12-29 2022-05-10 浙江中控技术股份有限公司 Alarm integrated management system of oil-gas long-distance pipeline
CN114584593A (en) * 2022-03-28 2022-06-03 中国电子科技集团公司第三十八研究所 Data acquisition system and method based on cluster state perception
CN114629812A (en) * 2022-03-28 2022-06-14 中国电子科技集团公司第三十八研究所 Cluster visualization system and method based on autonomous controllable platform
CN115296868A (en) * 2022-07-22 2022-11-04 联通沃音乐文化有限公司 Music operation background management system and method based on cloud computing
CN118503073B (en) * 2024-07-22 2024-10-11 浙江智臾科技有限公司 Account separating and charging method based on user-level resource tracking

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103236949A (en) * 2013-04-27 2013-08-07 北京搜狐新媒体信息技术有限公司 Monitoring method, device and system for server cluster
CN104268695A (en) * 2014-09-26 2015-01-07 武汉大学 Multi-center watershed water environment distributed cluster management system and method
CN105024877A (en) * 2015-06-01 2015-11-04 北京理工大学 Hadoop malicious node detection system based on network behavior analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103236949A (en) * 2013-04-27 2013-08-07 北京搜狐新媒体信息技术有限公司 Monitoring method, device and system for server cluster
CN104268695A (en) * 2014-09-26 2015-01-07 武汉大学 Multi-center watershed water environment distributed cluster management system and method
CN105024877A (en) * 2015-06-01 2015-11-04 北京理工大学 Hadoop malicious node detection system based on network behavior analysis

Also Published As

Publication number Publication date
CN105718351A (en) 2016-06-29

Similar Documents

Publication Publication Date Title
CN105718351B (en) A kind of distributed monitoring management system towards Hadoop clusters
CN108874640B (en) Cluster performance evaluation method and device
CN106487574A (en) Automatic operating safeguards monitoring system
CN107943668A (en) Computer server cluster daily record monitoring method and monitor supervision platform
CN104881352A (en) System resource monitoring device based on mobile terminal
US20030135382A1 (en) Self-monitoring service system for providing historical and current operating status
CN109783322A (en) A kind of monitoring analysis system and its method of enterprise information system operating status
CN108197261A (en) A kind of wisdom traffic operating system
US20100070981A1 (en) System and Method for Performing Complex Event Processing
CN107070692A (en) A kind of cloud platform monitoring service system analyzed based on big data and method
CN108306980A (en) A kind of engineering flight support big data Log Analysis System
CN112162907A (en) Health degree evaluation method based on monitoring index data
CN106685703A (en) Data acquisition and visual monitoring intelligent system
EP1889161A2 (en) Automated reporting of computer system metrics
CN108092813A (en) Data center's total management system server hardware Governance framework and implementation method
CN112688819A (en) Comprehensive management system for network operation and maintenance
CN109885453A (en) Big data platform monitoring system based on flow data processing
CN101989931A (en) Operation alarm processing method and device
KR20150118963A (en) Queue monitoring and visualization
CN109240863A (en) A kind of cpu fault localization method, device, equipment and storage medium
CN109165137A (en) data analysis and alarm method and system
KR20220166760A (en) Apparatus and method for managing trouble using big data of 5G distributed cloud system
CN113608457A (en) Network operation and maintenance monitoring system
CN115134262B (en) RocktMQ monitoring method and device, storage medium and electronic equipment
CN109951313A (en) A kind of monitoring device and method of Hadoop cloud platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20190724

Address after: Room 206, 2nd floor, No. 18 Keyuan Road, Daxing Economic Development Zone, 102600, Beijing

Patentee after: Beijing Xiaodunbird Information Technology Co.,Ltd.

Address before: 100028 Beijing city Daxing District Keyuan Road Economic Development Zone No. 18 Chinese creative building No. 4

Patentee before: BEIJING HUISHANG RONGTONG INFORMATION TECHNOLOGY Co.,Ltd.

PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Distributed Monitoring and Management System for Hadoop Cluster

Effective date of registration: 20221028

Granted publication date: 20180209

Pledgee: Shaanxi Pharmaceutical Holding Group Paeon Pharmaceutical Co.,Ltd.

Pledgor: Beijing Xiaodunbird Information Technology Co.,Ltd.

Registration number: Y2022110000284