CN105718351B

CN105718351B - A kind of distributed monitoring management system towards Hadoop clusters

Info

Publication number: CN105718351B
Application number: CN201610010050.7A
Authority: CN
Inventors: 许丹霞; 刘寅; 汪伟; 郑宇�
Original assignee: Beijing Huishang Rongtong Information Technology Co Ltd
Current assignee: Beijing Xiaodu Information Technology Co Ltd
Priority date: 2016-01-08
Filing date: 2016-01-08
Publication date: 2018-02-09
Anticipated expiration: 2036-01-08
Also published as: CN105718351A

Abstract

The present invention relates to a kind of distributed monitoring management system towards Hadoop clusters.A kind of monitoring management system for being more suitable for our actual demands.Mainly include performance monitoring module, fault alarm module, comprehensive analysis enquiry module, overview display module, data memory module, configuration management module, system management module.It can understand server resource by this system to distribute, track Hadoop operation conditions, alert unusual condition, simplify Hadoop platform configuration operation, find system resource bottleneck on this basis, and optimize performance.This system can also be used for the monitoring management for distributed type assemblies in other demand environments.

Description

A kind of distributed monitoring management system towards Hadoop clusters

Technical field

The present invention relates to a kind of distributed monitoring management system towards Hadoop clusters, the system is more suitable for actual need Ask.It can understand server resource by this system to distribute, track Hadoop operation conditions, alert unusual condition, simplify Hadoop platform configuration operation, finds system resource bottleneck, and optimize performance on this basis.This system, which can also be used for other, to be needed Ask the monitoring management for distributed type assemblies in environment.

Background technology

It is different from common computer network environment or data center, based on the cloud computing environment of Hadoop structures, have The characteristics of number of nodes is big, component and application are complicated, Hadoop are intended to run on the computer of low cost, and it is normal to regard failure State, and the function that Hadoop is covered, widely and using complicated Distributed Parallel Computing framework, this also gives Hadoop clusters Operation and maintenance bring great challenge.

Currently in Hadoop monitor and manage instrument have a lot, such as Zookeeper, Ganglia, Nagios, Ambari, Chukwa etc..Every kind of management tool is all that comparison is successful and handy in the field that it is absorbed in.Zookeeper It is cluster to be each responsible for the monitoring of distributed type assemblies and alarm, Ambari for the management of configuration file, Ganglia and Nagios Deployment and monitoring management provide unified solution, Chukwa solves the problems, such as to collect and analyze cluster daily record. Zookeeper is absorbed in the management work of Hadoop platform configuration file.Cluster monitoring work outstanding as one Ganglia Tool, distributed monitoring effect protrude, to we provide a full set of work(of the collection in computer cluster, collection, storage and displaying Energy.But it to log analysis, can not simply monitor the working condition of cluster merely.The monitoring page that Ganglia is carried can be by Different grain size displaying historical data changes with time trend, and can be with custom parameter.But the parameter of displaying is more and complete, The desired information for therefrom filtering out oneself needs is, it is necessary to have good understanding to Ganglia, and need certain management and fortune Seek the experience of cluster.This is a challenge in general Hadoop user.Nagios is a outstanding monitoring and alarm Instrument.The content of oneself care, and given threshold can be arbitrarily monitored by designing plug-in unit, when monitoring value exceedes threshold value, Nagios can be alerted by way of mail or short message.But outstanding alarm function can not meet that we monitor resource Demand.Can only be as the important step of cluster management.In addition, Ganglia and Nagios have, some functions are overlapping, if it is desired that with Both instruments avoid the unnecessary wasting of resources, it is necessary to plan both monitor control indexs.Chukwa is still unstable at this stage, Installation process is complicated, and debugging is difficult.That wherein closest to our demands is Ambari, but in actual use, Wo Menfa The problem of very more also be present in existing Ambari.Ambari can not be used as individually monitoring and management tool, it is impossible to which monitoring is voluntarily pacified The cluster of dress, it is necessary to which various roles when just using Ambari when installing cluster, and strictly observing installation distribute.For difference Operating system, installation be frequently encountered indeterminable problem, that is to say, that Ambari can not run on each well On (SuSE) Linux OS.

The content of the invention

In summary, after the cluster management of current main-stream and monitoring system has been understood, the present invention develop it is a kind of towards The distributed monitoring management system of Hadoop clusters, a kind of monitoring management system for being more suitable for our actual demands.Pass through this System can understand server resource distribution, track Hadoop operation conditions, alert unusual condition, simplify Hadoop platform configuration Operation, system resource bottleneck is found on this basis, and optimize performance.

Present invention aim to address the performance monitoring, fault alarm, configuration management for Hadoop platform, wherein wrapping Include：

1st, monitoring alarm function mainly includes collection and stores basic data and the fault warning of all monitoring.

2nd, the monitoring data of the system be not only system resource and Hadoop Metrics information or The daily record of Hadoop components and other assemblies daily record.Because can not to provide us of concern for Hadoop Metrics information The information such as job run percentage, so Hadoop component daily records are also very important basic monitoring data source.In daily record It typically include after assembly operating starts, each operates the information such as called code bag, operation implementing result.Analyze Hadoop Component log information, the monitoring to flow analysis system have very big help with optimization.Run on further for self-developing Cluster and the component associated with Hadoop components, in order to obtain its current operating situation, and holistic health, it should root The index of component situation can be reflected by being defined according to self-demand, and is provided daily record and exported these indexs.Monitoring system can monitor The daily record of these components, and alarm according to demand.

3rd, the unified configuration service of Hadoop platform is realized by configuration management module, when monitoring management system is alarmed When, related personnel can change the configuration of Hadoop platform, reorganize and coordinate resource, and provide web interface and simplify configuration Operation.

4th, the visualization of front end data is realized using overview display module.Can be with all monitor control indexs of overview display and alarm Index.The operation conditions of comprehensive understanding platform.Simplify and integrate the displayed page of very complicated, only show important general Manager and guardian's parameter and index interested.Other the system also supports user directly to access web page, checks user Other indexs of concern.

To realize the purpose of the present invention, it is achieved using following technical scheme：

A kind of distributed monitoring management system towards Hadoop clusters.A kind of monitoring for being more suitable for our actual demands Management system.It is main to include performance monitoring module, fault alarm module, comprehensive analysis enquiry module, overview display module, configuration Management module, system management module.Wherein：

Performance monitoring module, the last state of Hadoop platform is understood at any time, find resource bottleneck, improve platform operation effect Rate, the data that can be monitored include：Server resource, Hadoop Metrics, the daily record of Hadoop components and other assemblies day Will.

Fault alarm module, when the performance of computing resource reaches bottleneck, send short message to related personnel or mail is accused It is alert.Platform fault is found in time, and to maintain the normal operation of platform, its basic function includes monitor supervision platform state, finds failure Node, crashed process and failed services, failure and processing information are recorded, for different grades of failure, notify different stage Administrative staff processing.

Comprehensive analysis enquiry module, inquiry service is calculated for providing, the system data collected can not be directly presented To user, because data are usually instantaneous value, and the achievement data that we are concerned about needs to obtain after calculating.Comprehensive analysis is looked into The monitoring in module reading database and alert data are ask, is calculated, the index deposit database after calculating, and provide each The query interface of index.

Overview display module, each monitoring alarm index of overview display, optional ECharts realize the visual of front end data Change.

Configuration management module, it is therefore intended that simplify platform configuration operation, tissue and coordination computing resource, complete to Hadoop The configuration work of platform.Zookeeper can be based on and realize distributed unified configuration service, ageing and Information Security can obtain To guarantee, and web interface is provided and simplifies user configuration operation.

System management module, there is provided web interface safeguards user management and rights management function.Increase security of system energy, The configuration management function of Hadoop platform is only opened to system operator, domestic consumer only possesses the monitoring function to platform.

Described distributed monitoring management system, preferably：

The monitoring alarm function of monitoring management system includes：Gather and store the basic data of all monitoring；Carry out failure Alarm.Monitored component is needed to select a machine independently of cluster as monitored node using each in Hadoop clusters A machine of relative free is as monitor node in device, or cluster.The system is mainly by monitored node and monitoring Monitoring alarm module is set to realize monitoring alarm function on node.

Described distributed monitoring management system, preferably：

Performance monitoring module, Ganglia can be based on and realize that the data of monitoring include：Server resource, Hadoop The daily record of Metrics, Hadoop component and other assemblies daily record.The monitoring data of collection deposits in RRD (Round-Robin Database in), show and use for web.

Described distributed monitoring management system, preferably：

Fault alarm module, Nagios can be based on and realize that the basic data for carrying out breakdown judge derives from two classes：One kind is The basic data that performance monitoring module collection is deposited in RRD；Another kind of is the basic data that warning information collection module reports. There is warning information collection module respectively on each monitored node and monitor node, warning information is installed on monitor node Core component, monitored node have warning information collection module, and the warning information being collected into is transferred to the announcement of monitor node Alert information core component, and according to the rank and species of warning information, select the administrative staff of correlation to send short message or mail report It is alert.The warning information collection module of monitor node can scan the data in RRD, according to the rank and species of warning information, selection Related administrative staff send short message or mail alarm, and warning information is transferred to the warning information core of monitor node Part.Warning information is stored in DB by warning information core component, is shown and is used for web.

Described distributed monitoring management system, preferably：

Comprehensive analysis enquiry module, there is provided analysis inquiry service, the initial data gathered is usual and is not suitable for directly opening up Show to user, what user was typically concerned about is the value obtained after being calculated using initial data.So the data for being presented to user have A part obtains after the calculating of former data.Module reads monitoring and alert data in RRD and Mysql databases, carries out Correlation computations, the monitoring after calculating and alarming index are stored in Mysql databases, and the query interface of each index is provided.

Described distributed monitoring management system, preferably：

Overview display module, each monitoring alarm index of overview display, present invention selection ECharts realize front end data Visualization.Remove the presentation parameter and index of Ganglia very complicateds, only show general management person and guardian's prison interested Index is controlled, overview display module also shows that each warning message that Nagios is collected in addition.Certain the system also supports that user is straight The gweb pages that Ganglia is carried are asked in receiving, check user's other monitor control indexs of concern.

Described distributed monitoring management system, preferably：

Configuration management module, distributed unified configuration service is realized based on zookeeper, ageing and Information Security is all It can be guaranteed, it is therefore intended that simplify platform configuration operation, when monitoring management system is alarmed, related personnel can change The configuration of Hadoop platform, reorganize and coordinate resource, and web interface is provided and simplifies configuration operation.

Described distributed monitoring management system, preferably：

A kind of distributed monitoring management system for distributed cluster system, including：Performance monitoring module, fault alarm Module, comprehensive analysis enquiry module, overview display module, data memory module, configuration management module, system management module, its In：

Performance monitoring module is used for the monitoring data of monitoring distributed group system monitored node, and by the monitoring data Store data memory module；

Fault alarm module is used to carry out fault alarm according to the monitoring data stored in data memory module, or receives Monitor node and the alert data of monitored node transmission, by the alert data storage of the reception to data memory module and basis The information carries out fault alarm；

The monitoring data or alert data that comprehensive analysis enquiry module is used in reading database, carry out calculating analysis, will Analysis result deposit data memory module after calculating；

Data memory module, for storing monitoring data or alert data；

Overview display module is used for the analysis result for showing comprehensive analysis enquiry module；

System management module is used to carry out user management and rights management；

Configuration management module is used to carry out distributed cluster system unified configuration.

Described distributed monitoring management system, preferably：

Performance monitoring module includes collection module and convergence module；

Collection module is used for the monitoring data for reading monitored node, and gives the monitoring data transmission being collected into convergence mould Block；

Convergence module collection monitoring data and collect storage arrive data memory module.

Described distributed monitoring management system, preferably：

Fault alarm module, the data in scan data memory module, determines the rank and species of warning information, sends short Letter or mail alarm；Or monitor node or the alert data of monitored node transmission are received, the alert data of the reception is deposited Data memory module is stored up, and according to the rank and species of alert data, sends short message or mail alarm.

Described distributed monitoring management system, preferably：Overview display module carries out one or a combination set of following exhibition Show：

(1) alarm today project statistics：Current cluster malfunction is shown with the formal intuition of block diagram, how many event Hinder server, failed services and faulty components；

(2) cluster server state：Cluster server is divided into three kinds of states：Normally, failure and high load capacity；

(3) alarm list is not solved：All unsolved alarms；

(4) resource that can change granularity uses timing diagram：Including cpu busy percentage, memory usage.

Described distributed monitoring management system, preferably：Data memory module includes RRD or Mysql, and monitoring data is deposited In RRD, alert data is stored in Mysql for storage.

A kind of distributed monitoring management method for distributed cluster system, comprise the following steps：

Monitored node in the monitoring distributed group system of step 1., by supervising data storage to data memory module；

Step 2. carries out fault alarm according to the monitoring data of storage, or receives monitor node and monitored node transmission Alert data, by the alert data of the reception carry out storage to data memory module and according to the information carry out fault alarm；

Step 3. reads monitoring data or alert data in data memory module, carries out calculating analysis, preserves after calculating Analysis result；

Step 4. shows the analysis result of comprehensive analysis enquiry module；

Step 5. carries out user management and rights management；

Step 6. carries out unified configuration to distributed cluster system.

Described distributed monitoring management method, preferably：

Monitoring in step 1 includes：The monitoring data of monitored node is read, the monitoring data being collected into is collected and deposited Storage.

Described distributed monitoring management method, preferably：

Fault alarm in step 2 is specially the data in scan data memory module, determine warning information rank and Species, send short message or mail alarm；Or monitor node and the alert data of monitored node transmission are received, by the reception Data memory module is arrived in alert data storage, and according to the rank and species of alert data, sends short message or mail alarm.

Described distributed monitoring management method, preferably：Overview display in step 4 includes one or a combination set of following Displaying：

(3) alarm list is not solved：All unsolved alarms；

Described distributed monitoring management method, preferably：Data memory module includes RRD or Mysql, and monitoring data is deposited In RRD, alert data is stored in Mysql for storage.

Brief description of the drawings

Fig. 1 is the distributed monitoring management system schematic diagram provided by the invention towards Hadoop clusters.

Embodiment

As shown in figure 1, distributed monitoring management system includes：

1. performance monitoring module, the performance for monitoring distributed group system：The data of monitoring include server resource, Hadoop Metrics, the daily record of Hadoop components and other assemblies daily record.Performance monitoring module includes collection module and convergence Module.Wherein collection module is used for the monitoring data for reading monitored node, including the server resource (essential information of server Including CPU, internal memory, hard disk, network I/O, process etc.), Hadoop Metrics (including HDFS information, MapReduce information, JVM information and other Hadoop module informations (Hbase etc.)), the daily record of Hadoop components and other assemblies daily record.Collect mould The monitoring data transmission being collected into convergence module, is unified collection monitoring information by convergence module and collects storage to data by block Memory module, it is preferred that data memory module includes ring database RRD ((Round-Robin Database)), is deposited by RRD Store up the data.

2. fault alarm module, for carrying out fault alarm, including warning information collection module and warning information core Part.Fault alarm module can carry out two kinds of fault alarms.First is the number stored in warning information collection module scanning RRD According to determining the rank and species of warning information according to user's request, select the administrative staff of correlation to send short message or mail alarm； Second is that warning information core component receives the alarm letter that the warning information collection module of monitor node and monitored node is sent Breath, and warning information is stored in database, such as the database can be Msql databases, show and use for web, and root According to the rank and species of warning information, the administrative staff of correlation are selected to send short message or mail alarm.

The project of monitoring alarm, it is similarly module information, Hadoop cluster state informations and server info.All kinds of alarms The warning content of item is as shown in the table：

3. comprehensive analysis enquiry module, inquiry service is calculated for providing, by the initial data gathered is usual not User is suitably exposed directly to, what user was typically concerned about is the value obtained after being calculated using initial data.So it is presented to use The data at family some through former data calculating after obtain.Module reads the monitoring and alarm in RRD and Mysql databases Data, correlation computations are carried out, the monitoring after calculating and alarming index are stored in Mysql databases, and provide looking into for all kinds of indexs Ask interface.

The computational methods of the two important indicators of CPU usage and memory usage, hard disk, load and network is given below IO value can be fetched directly into, and percent value is can be obtained by by simple division calculation.Likewise, alarm project statistical number It can be obtained according to by simple accumulation calculating, cluster server state percent data can by simple division calculation To obtain, no longer repeated secondary.Other monitoring and alarm data can directly obtain from database.

(1) CPU usage

CPU related datas are extracted from monitoring basic data (namely the monitoring data collected from monitored node)：CPU (nice values is shared by negative processes by user time (CPU time of User space, being designated as user [i]), CPU nice time CPU time, be designated as nice [i]), CPU system time (core time, being designated as system [i]), CPU idle time (remove Other stand-by period beyond hard disk I/O latency, be designated as idie [i]), CPU iowait time (hard disk I/O latency, note For iow [i]), CPU irq time (the hard break time, being designated as irq [i]), CPU softirq time (the traps time, note For sirq [i]).Two shorter time interval ti (such as 1 second) are sampled, t1, t2 is denoted as respectively, takes CPU snapshots.

All cpu usages of first time are summed, obtain S1：

S1=user [1]+nice [1]+system [1]+idle [1]+iow [1]+irq [1]+sirq [1]

Secondary all cpu service conditions are summed, obtain S2：

S2=user [2]+nice [2]+system [2]+idle [2]+iow [2]+irq [2]+sirq [2]

Calculate CPU usage CPU_usage：

CPU_usage=1-100* (idle [2]-idle [1])/(S2-S1)

(2) memory usage

Internal memory related data is extracted from monitoring basic data：Mem_total (physical memory total amount), mem_free (free physical memory size), mem_buffers (the physical memory size of filebuf), mem_cached (caches The physical memory size in area).

Calculate memory usage mem_usage：

Mem_usage=100%* (mem_total-mem_free-mem_buffers-mem_cached)/mem_ total

4. data memory module, including RRD and Mysql, for data storage.RRD (Round Robin Database, Ring database) it is used to store monitoring data, the ring database carrys out data storage using the space of fixed size, in database Data storage in suffix be .rrd file in, so as to comprehensive inquiry module use.Mysql databases are used to store alarm number According to so that comprehensive analysis enquiry module uses.In addition Mysql databases also stored for the relevant information of user management part, such as User's detail list, authority list, role's table etc..

5. system management module, for providing user management and rights management function, distributed type assemblies are configured.For Increase security of system energy, the configuration management function of distributed type assemblies Hadoop platform is only opened to system operator, commonly User only possesses the monitoring function to platform.

6. overview display module：For calling comprehensive analysis enquiry module, various achievement datas are obtained, realize analysis result Visualization.The index that can be shown is as follows：

(1) alarm today project statistics：Current cluster malfunction is shown with the formal intuition of block diagram, how many event Hinder server, failed services and faulty components.The alarm list page can be entered by clicking on all alarms.Alarm today project is preferred Fault warning was carried out from 0 point of same day to the current period, it is ensured that provides newest fault warning information.

(2) cluster server state：Cluster server is divided into three kinds of states, normal, failure and high load capacity.Check every kind of The machine of state accounts for the ratio of whole clustered machines, if a server is faulty, load is also high, is classified as failure one Class.

(3) alarm list is not solved：All unsolved alarms.This server can be checked in detail by clicking on server name Resource service condition.

(4) Hadoop cluster states：It is whether busy that Hadoop clusters can be can be visually seen.Block diagram can be visually seen current fortune Capable Map and Reduce operations number and Map the and Reduce numbers of wait operation.

(5) HDFS capacity：HDFS, which can be can be visually seen, uses capacity, including HDFS to use capacity, non-DFS using capacity and not Use capacity.

(6) the MapReduce operations being currently running：List essential information, input data amount, Map and Reduce operations Percent Complete.

(7) resource that can change granularity uses timing diagram, including cpu busy percentage, memory usage.Figure upper left can be clicked on The granularity button at angle pulls the granularity bar below figure to check granularity and monitoring section to change.Service is listed in list The basic machine information of device and Current resource use.

By means of the invention it is possible to provide a kind of monitoring management system for being more suitable for actual demand.Can by this system Server resource distribution is understood in time, tracks distributed type assemblies operation conditions, alerts unusual condition, simplifies distributed type assemblies configuration Operation, system resource bottleneck is found on this basis, and optimize performance.

Claims

A kind of 1. distributed monitoring management system for Hadoop group systems, it is characterised in that including：Performance monitoring module, Fault alarm module, comprehensive analysis enquiry module, overview display module, data memory module, configuration management module, system administration Module, wherein：

Performance monitoring module is used for the performance of each monitored node in monitoring distributed group system, and the monitoring that will be collected Data Cun Chudao data memory modules, the monitoring data include server resource, Hadoop Metrics, Hadoop component days Will and other assemblies daily record；

Fault alarm module is used to carry out fault alarm according to the monitoring data stored in data memory module, or receives distribution The alert data of in formula group system or independently of distributed cluster system monitor node and monitored node transmission, Fault alarm is carried out by the alert data received storage to data memory module and according to the data, the fault alarm includes Monitor supervision platform state, malfunctioning node, crashed process and failed services are found, failure and processing information are recorded, for different etc. The failure of level, notifies the administrative staff of different stage to handle；The alert data includes module information, Hadoop cluster states are believed Breath and server info；

Comprehensive analysis enquiry module is used to read the monitoring data or alert data in data memory module, carries out calculating analysis, Analysis result after calculating is stored in data memory module；

Data memory module is used to store monitoring data or alert data；

Overview display module is used for the analysis result for showing comprehensive analysis enquiry module：Comprehensive analysis enquiry module is called, is obtained Various achievement datas, realize that analysis result visualizes；

System management module is used to carry out user management and rights management：Configuration management work(to distributed type assemblies Hadoop platform Only system operator can be opened, domestic consumer only possesses the monitoring function to platform；

Configuration management module is used to carry out distributed cluster system unified configuration：Realized based on zookeeper distributed unified Configuration service.
2. distributed monitoring management system according to claim 1, it is characterised in that：

Performance monitoring module includes collection module and convergence module；

Collection module is used for the monitoring data for reading monitored node, and by the monitoring data transmission being collected into convergence module；

Convergence module collection monitoring data and collect storage arrive data memory module.
3. distributed monitoring management system according to claim 1, it is characterised in that：

Fault alarm module, for the data in scan data memory module, the rank and species of warning information are determined, is sent short Letter or mail alarm；Or the alert data of the warning message collection module transmission on monitor node and monitored node is received, Data memory module is arrived into the alert data storage of the reception, and according to the rank and species of alert data, sends short message or postal Part is alarmed.
4. distributed monitoring management system according to claim 1, it is characterised in that：Overview display module carry out it is following it The displaying of one or its combination：

(1) alarm today project statistics：Current cluster malfunction is shown in the form of block diagram, how many failed services Device, failed services and faulty components；

(2) cluster server state：Cluster server is divided into three kinds of states：Normally, failure and high load capacity；

(3) alarm list is not solved：All unsolved alarms；

(4) resource that can change granularity uses timing diagram：Including cpu busy percentage, memory usage.
5. distributed monitoring management system according to claim 1, it is characterised in that：Data memory module include RRD and MysqL, in RRD, alert data is stored in MysqL supervising data storage.
6. a kind of distributed monitoring management method for Hadoop group systems, methods described is by one of claim 1-5 institutes The distributed monitoring management system stated is realized, it is characterised in that is comprised the following steps：

Monitored node in the monitoring distributed group system of step 1., it is described by supervising data storage to data memory module Monitoring data includes server resource, Hadoop Metrics, the daily record of Hadoop components and other assemblies daily record；

Step 2. carries out fault alarm according to the monitoring data of storage, or receives in distributed cluster system or independent In the alert data of the warning message collection module transmission on the monitor node and monitored node of distributed cluster system, by this The alert data storage of reception carries out fault alarm to data memory module and according to the alert data, and the fault alarm includes Monitor supervision platform state, malfunctioning node, crashed process and failed services are found, failure and processing information are recorded, for different etc. The failure of level, notifies the administrative staff of different stage to handle；The alert data includes module information, Hadoop cluster states are believed Breath and server info；

Step 3. reads monitoring data or alert data in data memory module, carries out calculating analysis, preserves point after calculating Analyse result；

Step 4. shows the analysis result of comprehensive analysis enquiry module：Comprehensive analysis enquiry module is called, obtains various index numbers According to, realize analysis result visualize；

Step 5. carries out user management and rights management：To the configuration management function of distributed type assemblies Hadoop platform only to system Manager opens, and domestic consumer only possesses the monitoring function to platform；

Step 6. carries out unified configuration to distributed cluster system：Distributed unified configuration service is realized based on zookeeper.
7. distributed monitoring management method according to claim 6, it is characterised in that：

Monitoring in step 1 includes：The monitoring data of monitored node is read, the monitoring data being collected into is collected into storage.
8. distributed monitoring management method according to claim 6, it is characterised in that：Fault alarm in step 2 is specific For the data in scan data memory module, the rank and species of warning information are determined, sends short message or mail alarm；Or connect The alert data that the warning message collection module on monitor node and monitored node is transmitted is received, by the alert data of the reception Data memory module is stored, and according to the rank and species of alert data, sends short message or mail alarm.
9. distributed monitoring management method according to claim 6, it is characterised in that：Overview display in step 4 includes One or a combination set of following displaying：

(1) alarm today project statistics：Current cluster malfunction is shown in the form of block diagram, how many failed services Device, failed services and faulty components；

(2) cluster server state：Cluster server is divided into three kinds of states：Normally, failure and high load capacity；

(3) alarm list is not solved：All unsolved alarms；

(4) resource that can change granularity uses timing diagram：Including cpu busy percentage, memory usage.
10. distributed monitoring management method according to claim 6, it is characterised in that：Data memory module include RRD and MysqL, in RRD, alert data is stored in MysqL supervising data storage.