CN113419925A

CN113419925A - Monitoring method and system for monitoring and alarming multiple distributed MPP clusters

Info

Publication number: CN113419925A
Application number: CN202110978147.8A
Authority: CN
Inventors: 赵伟; 范树磊
Original assignee: Tianjin Nankai University General Data Technologies Co ltd
Current assignee: Tianjin Nankai University General Data Technologies Co ltd
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2021-09-21

Abstract

The invention provides a monitoring method and a system for monitoring and alarming a plurality of distributed MPP clusters, which comprises the steps of setting a monitoring strategy for each distributed MPP cluster through a distributed monitoring system; the acquisition center sends an acquisition signal to an acquisition agent module; the acquisition agent module acquires monitoring indexes according to the received acquisition signals and returns the acquired monitoring indexes to the acquisition center module; the acquisition center module performs alarm operation according to the index value of the monitoring index and judges whether to alarm or not; storing the collected information and the alarm information to a resource library module; the monitoring website module acquires the information of the resource library module to realize the checking of the alarm information and the checking of the index trend information. The method and the system can realize independent monitoring strategy setting for a plurality of distributed MPP clusters and centralized monitoring, and can solve the problem that cross-monitoring platform switching is needed when switching monitoring among a plurality of distributed MPP clusters compared with a monitoring system which can only monitor one cluster.

Description

Monitoring method and system for monitoring and alarming multiple distributed MPP clusters

Technical Field

The invention belongs to the field of distributed MPP cluster monitoring, and particularly relates to a monitoring method and a monitoring system for monitoring and alarming a plurality of distributed MPP clusters.

Background

With the wide application of the distributed MPP clusters, in one environment, a plurality of sets of distributed MPP clusters may exist simultaneously, the state and performance conditions of each distributed MPP cluster can be grasped simultaneously, the obtained index content is evaluated according to a preset monitoring strategy, an index breaking through a threshold value is alarmed, and a database cluster manager is made to know the operation state of the database cluster, so that the database is adjusted in real time, and the normal operation of the database cluster is guaranteed.

Disclosure of Invention

In view of this, the present invention is directed to a monitoring method and a monitoring system for monitoring and alarming a plurality of distributed MPP clusters, so as to solve the problem of monitoring a plurality of distributed MPP clusters in real time, facilitating a database cluster manager to know an operation state of a database cluster, and adjusting a database in real time.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

a monitoring method for monitoring and alarming a plurality of distributed MPP clusters comprises the following steps:

s1, setting a monitoring strategy for each distributed MPP cluster through the distributed monitoring system;

s2, the acquisition center module sends an acquisition signal to an acquisition agent module of the distributed MPP cluster;

s3, the acquisition agent module of the distributed MPP cluster acquires the index value of the monitoring index according to the received acquisition signal and returns the acquired index value of the monitoring index to the acquisition center module;

s4, the acquisition center module performs alarm operation according to the index value of the monitoring index and judges whether to alarm or not;

s5, the acquisition center module stores the information of the index value of the acquired monitoring index and the alarm information to the resource library module;

and S6, the monitoring website module acquires the information of the resource library module to realize visual alarm information viewing and index trend information viewing functions.

Further, the monitoring policy in step S1 includes: setting an acquisition cycle of index values of the monitoring indexes, setting an alarm strategy judgment condition of each monitoring index, and setting a unified alarm mode of all the monitoring indexes;

monitoring an acquisition period of the index, wherein the acquisition period is a time interval between two adjacent times of acquiring the state index and the performance index;

the alarm strategy judgment conditions of each monitoring index comprise: setting the single index judgment condition and then setting the summary index judgment condition;

the setting of the single index judgment condition comprises the following steps: the method comprises the steps of setting an alarm threshold value of each monitoring index, displaying a page display state of each monitoring index (if the state of the monitoring index is yes, the state of the monitoring index can be displayed on a display interface, otherwise, the display is not performed), alarming the state of each monitoring index (if the state of the alarm is yes, the monitoring index gives an alarm, otherwise, the alarm is not performed), recovering the alarm notification state of each monitoring index, continuously alarming the state of each monitoring index, overtime neglected duration of each monitoring index, and continuous breakthrough alarm times of each monitoring index.

The setting of the summary index determination condition includes: the summary alarm state of each monitoring index, the cancellation single-index alarm state of each monitoring index, the summary mode, the summary judgment condition and the summary judgment threshold value of each monitoring index;

the summarizing modes comprise index summation, average value, alarm summation and maximum value;

the summarizing judgment conditions comprise: greater than, less than, equal to, greater than or equal to, less than or equal to;

the alarm mode comprises the following steps: the method comprises the following steps of (1) an alarm mode of a mail, an alarm mode of simple network management protocol transmission, and an alarm mode of message queue sending; alarm mode of network application program.

Further, the monitoring the index in step S3 includes: available class, operating system class, progress state, cluster state class, database state class, execution state class.

Further, the alarm operation of the index value of the monitoring index in step S4 includes a single index alarm operation and a summary index alarm operation.

Further, the single index alarm operation process comprises the following steps:

s401, starting an acquisition center module, and loading monitoring strategies of all monitoring indexes of the distributed MPP cluster from a resource module into a monitoring strategy cache;

s402, the acquisition center module acquires the name of each monitoring index of each server under all monitored distributed MPP clusters and the index value of the monitoring index;

s403, the acquisition center module acquires the judgment condition and the threshold value of the monitoring strategy from the cache, and the acquisition center module compares the judgment condition and the threshold value of the name of each monitoring index and the index value of the monitoring index acquired from the server;

s404, if the name of each monitoring index acquired from the server and the index value of each monitoring index meet the judgment condition and break through the threshold value, determining that the monitoring index of the server is abnormal, and generating an abnormal alarm, otherwise, determining that the monitoring index of the server is normal, and generating a recovery alarm;

s405, if an abnormal alarm is generated, according to the name of the monitoring index, obtaining the index value of the monitoring index from the monitoring strategy to judge whether the alarm is generated or not, judge whether the alarm is generated continuously or not, judge the continuous breakthrough alarm frequency and perform alarm;

s406, if the recovery alarm is generated, acquiring the configuration of the monitoring index from the monitoring strategy according to the name of the monitoring index, and judging whether the alarm needs to be recovered.

The summary index alarm operation process comprises the following steps:

s411, starting an acquisition center module, and loading monitoring strategies of all monitoring indexes of the distributed MPP cluster from a resource module into a monitoring strategy cache;

s412, the acquisition center module acquires the name of the monitoring index and the index value of the monitoring index which are alarmed in the single index from the monitoring index cache, acquires a summarizing mode from the monitoring strategy cache, and then performs summarizing operation according to the summarizing mode to acquire a summarizing value;

s413, the collection center module acquires the summary judgment condition and the summary threshold value of the corresponding monitoring strategy from the monitoring strategy cache according to the monitoring index name, and compares the acquired summary mode and the summary value with the summary judgment condition and the summary threshold value;

s414, if the summarizing mode meets the summarizing condition and the summarizing value breaks through the summarizing threshold value, generating an abnormal alarm;

and S415, if the summarizing mode does not meet the summarizing condition and the summarizing value does not break through the summarizing threshold value, generating a recovery alarm.

A monitoring system that monitors and alarms a plurality of distributed MPP clusters, comprising: the system comprises a resource library module, a WEB module, an acquisition center module and an acquisition agent module;

the acquisition center module is used for sending data acquisition signals to the acquisition agent module;

the acquisition agent module is used for acquiring the index value of the monitoring index of the server according to the received acquisition signal and transmitting the index value of the monitoring index to the acquisition center module for alarm operation;

the WEB module is used for receiving and displaying data after alarm operation of the acquisition center module;

and the resource library module is used for storing the data transmitted to the acquisition center module by the acquisition agent module and the data after the alarm operation of the acquisition center module.

Further, the resource library module is used for storing system configuration data and system acquisition data;

the system configuration data includes: user information, role information, module authority information, summary information of the target distributed MPP cluster, monitoring strategies of the target distributed MPP cluster, node information of the target distributed MPP cluster and auxiliary information for system operation;

the system for collecting data comprises: index acquisition data of the target distributed cluster and alarm information of the target distributed cluster.

Furthermore, the WEB module is used for providing a visual operation panel for a user to perform related configuration of the system, and simultaneously providing display of all target distributed MPP cluster monitoring indexes and alarm information viewing;

the acquisition center module is used for periodically acquiring a value of a monitoring index from an acquisition agent module on the target distributed MPP cluster according to a monitoring strategy set by a user;

each target distributed MPP cluster corresponds to one acquisition center module, and the acquisition center module interacts with the resource library module, the WEB module and the acquisition agent module.

Further, the acquisition agent module is used for receiving the data acquisition request sent by the acquisition center module;

each target distributed MPP cluster server corresponds to one acquisition agent module, the acquisition agent modules must be deployed on the corresponding target distributed MPP cluster servers, and the acquisition agent modules interact with the acquisition center module.

Compared with the prior art, the monitoring method and the system for monitoring and alarming a plurality of distributed MPP clusters have the following beneficial effects:

(1) the method and the system can realize independent monitoring strategy setting for a plurality of distributed MPP clusters and centralized monitoring, and can solve the problem that cross-monitoring platform switching is needed when switching monitoring among a plurality of distributed MPP clusters compared with a monitoring system which can only monitor one cluster;

(2) the method and the system of the invention have more perfect and flexible monitoring strategy setting, can better meet the requirements of users on the monitoring of the distributed MPP cluster under different scenes, and provide more accurate alarm;

(3) the method and the system are arranged in a modularized way, the acquisition center module is used as an acquisition core of a distributed MPP cluster, the acquisition agent module is used as an index acquisition device of a distributed MPP cluster server, responsibility division is clear, the arrangement is flexible, the special acquisition center module has better anti-interference characteristic than the combined acquisition center module, when the acquisition center module of one cluster has a problem, the normal operation of the acquisition center modules of other clusters cannot be influenced, and the special acquisition agent module has better stability than an acquisition scheme configured by a command or a script, can process more complicated monitoring index acquisition flow, and is more suitable for acquiring the internal state of a distributed MPP cluster database.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a framework diagram of a monitoring system for monitoring and alarming a plurality of distributed MPP clusters according to an embodiment of the present invention;

FIG. 2 is a flowchart of a single-indicator alarm operation process according to an embodiment of the present invention;

FIG. 3 is a flow chart of a single-indicator alarm sub-operation according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a summary indicator alarm operation according to an embodiment of the present invention;

FIG. 5 is a flowchart of a summary indicator alarm operation subprocess according to an embodiment of the present invention;

fig. 6 is a flowchart of an embodiment of the process according to the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

As shown in fig. 1, a monitoring system for monitoring and alarming a plurality of distributed MPP clusters includes: the system comprises a resource library module, a WEB module, an acquisition center module and an acquisition agent module;

the user carries out visual setting of monitoring strategies from a1 st to an nth distributed MPP cluster database through a WEB module, the set monitoring strategies are sent through a Java database connection protocol and stored in a resource base module, the 1 st to the nth distributed MPP cluster databases correspond to one acquisition center module, each distributed MPP cluster corresponds to one acquisition center module, the acquisition center module acquires configured monitoring strategies from the resource base module when being started and stores the monitoring strategies in a monitoring strategy cache, the acquisition center module periodically sends data acquisition requests to acquisition agent modules deployed on the 1 st to the nth servers of the distributed MPP cluster databases through a remote process call protocol according to an acquisition period set by the monitoring strategies, and the acquisition agent modules acquire index values of the monitoring indexes from a current server operating system and the distributed MPP cluster databases after receiving the data acquisition requests, after the collection is finished, the index values of all the monitoring indexes are returned to the collection center module, and after the collection center module receives the data returned by the 1 st to the nth collection agent modules, calculating the index value of the monitoring index according to the monitoring strategy to determine whether alarm is needed, sending the index value of the monitoring index and the alarm information of the monitoring index through Java database connection protocol, and storing the index value and the alarm information of the monitoring index in a resource library module, and storing the index values of all the monitoring indexes into a monitoring index cache, wherein the WEB module can acquire alarm information data from a resource library module through a Java database connection protocol for visual display, and can also acquire the index value data of all the monitoring indexes on the 1 st to the nth servers of the distributed MPP cluster databases from the monitoring index cache of the acquisition center module through a remote process call protocol for visual display.

1. The resource library module is used for storing system configuration data and system acquisition data, and the system configuration data comprises: user information, role information, module authority information, summary information of the target distributed MPP cluster, monitoring strategies of the target distributed MPP cluster, node information of the target distributed MPP cluster and auxiliary information for system operation; the system for collecting data comprises: index acquisition data of the target distributed cluster and alarm information of the target distributed cluster. There is only one repository module in a system.

2. And the WEB module is used for providing a visual operation panel for a user to perform related configuration of the system, and simultaneously providing display of all target distributed MPP cluster monitoring indexes and viewing of alarm information. Only one WEB module is arranged in one system, and the WEB module interacts with the resource library module and the acquisition center module.

3. The acquisition center module is used for periodically acquiring the values of the monitoring indexes from the acquisition agent modules on the target distributed MPP clusters according to the monitoring strategies set by a user, each distributed MPP cluster corresponds to one set of monitoring strategies, each set of monitoring strategies contains the monitoring strategies of all the monitoring indexes, the index values of the monitoring indexes are the actual values of the monitoring indexes when the monitoring indexes are acquired, the values can be numerical values or character strings, and the values are used for comparing with the threshold values set by the monitoring indexes, and the comparison mode is the judgment conditions set by the monitoring indexes. The monitoring indexes are divided into monitoring indexes for single-index alarm and monitoring indexes for summary alarm, and a certain monitoring index can be set in the monitoring strategy as the monitoring index for single-index alarm or the monitoring index for summary alarm.

As shown in fig. 2 and fig. 3, the single-index alarm operation process:

a1. and the acquisition center is loaded into the monitoring strategy cache inside the acquisition center module from the resource library module.

a2. And the acquisition center module acquires an acquisition cycle from the internal monitoring strategy cache, starts acquisition scheduling, and starts each server alarm processing thread when the acquisition cycle time interval is reached.

a3. And the alarm processing thread of each server acquires keys, threshold values and judgment conditions of all single index monitoring indexes from the monitoring strategy cache in the acquisition center.

a4. And the alarm processing thread of each server acquires keys and index values of all monitoring indexes from the acquisition agent module.

a5. The alarm processing thread of each server circularly carries out alarm judgment processing on all single index monitoring indexes obtained from the acquisition agent module, and the alarm processing process comprises the following steps:

a51. judging whether the index value breaks through the threshold value according to the judgment condition of the single index monitoring index, wherein the judgment condition can be greater than, greater than or equal to, less than or equal to, and two situations can occur:

a511. in the first situation, if the relationship between the two meets the judgment condition, the monitoring index is judged to break through the threshold value, namely the monitoring index of the server is considered to be abnormal, and an abnormal alarm is generated;

if abnormal alarm is generated, the setting of whether the monitoring index alarms or not, whether continuous alarm or not and the continuous breakthrough alarm frequency need to be acquired from the monitoring strategy, and whether alarm is needed or not is judged, wherein the judgment process is as follows:

a5111. and judging whether the index needs to be alarmed, if not, not continuing to carry out alarm processing, and if so, continuing to judge whether continuous alarm is set.

a5112. And judging whether continuous alarm is performed or not, when discontinuous alarm is set, if the monitoring index in the previous period is already alarmed, the period does not give an alarm, if the monitoring index in the previous period is not alarmed, the period generates an alarm, and if alarm is needed, continuous judgment of the number of times of breaking through the alarm is continued.

a5113. Judging the continuous breakthrough alarm frequency, when the frequency of the alarm required reaches the value of the continuous breakthrough alarm frequency, generating an alarm, wherein the default value of the value is 1, when the alarm condition is met and the alarm is required, generating an alarm, the value can only be a natural number, and when the set natural number is reached, generating an alarm.

a512. In the second case, if the relationship between the two is not satisfied with the judgment condition, it indicates that the monitoring index is normal, and then obtains whether the alarm needs to be recovered from the monitoring strategy cache, if the alarm needs to be recovered, a recovery alarm is generated.

a52. When an abnormal alarm or a recovery alarm is generated, an alarm message is generated and stored in the resource library module.

a53. And generating alarm information and storing the alarm information in an alarm buffer queue.

a54. Setting the monitoring index to break through the alarm times cache, and setting whether the alarm is cached in the previous period.

a6. And the alarm processing thread of each server stores the monitoring index keys of all the servers and the index values of the monitoring indexes into a monitoring index cache.

a7. And the alarm processing thread of each server stores the monitoring index keys of all the servers and the index values of the monitoring indexes to the resource library module.

After the monitoring strategy abnormal alarm or the recovery alarm is generated, the alarm mode corresponding to the monitoring strategy is obtained from the monitoring strategy cache, and the alarm mode comprises the following steps: the system comprises a snmp module, a mail module, a kafka module and a restful module, wherein a plurality of alarm modes can be simultaneously selected, alarm information is pushed to an appointed downstream system interface according to an appointed type of the alarm mode, monitoring index data and alarm data are stored in a resource library module after the alarm information is completed, and the monitoring index data are stored in a monitoring index cache for summarizing operation of monitoring indexes and displaying index values by a WEB module.

As shown in fig. 4 and 5, the summary indicator alarm operation process:

for the monitoring index needing to be subjected to summary operation, the index values on each server under the target distributed MPP cluster need to be summarized together, and whether the index values are abnormal or not is calculated according to a summary mode, a summary judgment condition and a summary judgment threshold, wherein the calculation process is as follows:

b1. and the acquisition center is loaded into the monitoring strategy cache inside the acquisition center module from the resource library module.

b2. And the acquisition center module acquires an acquisition cycle from the internal monitoring strategy cache, starts acquisition scheduling, and starts a summary alarm processing thread when the acquisition cycle time interval is reached.

b3. And a summarizing alarm processing thread acquires keys, summarizing conditions, summarizing judgment modes and summarizing threshold values of all summarizing index monitoring indexes from a monitoring strategy cache in the acquisition center.

b4. And the summarizing alarm processing thread acquires keys and index values of all the summarizing indexes at all the nodes from the monitoring index cache.

b5. Summarizing alarm processing threads, circulating all summary index monitoring indexes, and performing alarm judgment processing, wherein the alarm processing process comprises the following steps:

b51. and summarizing the index values of all the nodes to form a summarized value according to the summarizing conditions, and then comparing the summarized threshold value with the summarized value according to the summarizing judgment conditions.

b52. Judging whether the summary value breaks through the summary threshold value according to a summary judgment condition, wherein the judgment condition can be greater than, greater than or equal to, less than or equal to, and two conditions can be generated here:

b521. in the first situation, if the relationship between the two meets the judgment condition, the monitoring index is judged to break through the threshold value, namely the monitoring index of the server is considered to be abnormal, and an abnormal alarm is generated;

b5211. and judging whether the index needs to be alarmed, if not, not continuing to carry out alarm processing, and if so, continuing to judge whether continuous alarm is set.

b5212. And judging whether continuous alarm is performed or not, when discontinuous alarm is set, if the monitoring index in the previous period is already alarmed, the period does not give an alarm, if the monitoring index in the previous period is not alarmed, the period generates an alarm, and if alarm is needed, continuous judgment of the number of times of breaking through the alarm is continued.

b5213. Judging the continuous breakthrough alarm frequency, when the frequency of the alarm required reaches the value of the continuous breakthrough alarm frequency, generating an alarm, wherein the default value of the value is 1, when the alarm condition is met and the alarm is required, generating an alarm, the value can only be a natural number, and when the set natural number is reached, generating an alarm.

b522. In the second case, if the relationship between the two is not satisfied with the judgment condition, it indicates that the monitoring index is normal, and then obtains whether the alarm needs to be recovered from the monitoring strategy cache, if the alarm needs to be recovered, a recovery alarm is generated.

b53. When an abnormal alarm or a recovery alarm is generated, an alarm message is generated and stored in the resource library module.

b54 generates an alarm message and stores the alarm message in the alarm buffer queue.

b55. Setting the monitoring index to break through the alarm times cache, and setting whether the alarm is cached in the previous period.

b6. And summarizing alarm processing threads, and storing the monitoring index keys of all servers and the index values of the monitoring indexes into a monitoring index cache.

b7. And summarizing alarm processing threads, and storing the monitoring index keys of all the servers and the index values of the monitoring indexes to a resource library module. After the abnormal alarm or the recovery alarm is generated, the alarm mode corresponding to the monitoring strategy is obtained from the monitoring strategy cache, and the alarm mode comprises the following steps: the system comprises a snmp module, a mail module, a kafka module and a restful module, wherein a plurality of alarm modes can be simultaneously selected, alarm information is pushed to an appointed downstream system interface according to an appointed type of the alarm mode, monitoring index data and alarm data are stored in a resource library module after the alarm information is completed, and the monitoring index data are stored in a monitoring index cache for summarizing operation of monitoring indexes and displaying index values by a WEB module.

And the acquisition center module interacts with the resource library module, the WEB module and the acquisition agent module.

4. The acquisition agent module is used for receiving a data acquisition request sent by the acquisition center module, and acquiring an index value on a target distributed MPP cluster server where the acquisition agent module is located according to the data acquisition request, wherein the acquisition mode comprises the following steps: and the operation system commands, the scripts, the third party class library (sigar) and the SQL are used for returning the index values to the acquisition center module. Each target distributed MPP cluster server corresponds to one collection agent module, and the collection agent module must be deployed on the corresponding target distributed MPP cluster server. The acquisition agent module interacts with the acquisition center module.

The monitoring policy in step S1 includes: setting an acquisition cycle of index values of the monitoring indexes, setting an alarm strategy judgment condition of each monitoring index, and setting a unified alarm mode of all the monitoring indexes;

the setting of the single index judgment condition comprises the following steps: the method comprises the steps of setting an acquisition index state of each monitoring index (if the acquisition index state is yes, the monitoring index can be acquired, if not, the acquisition is not performed), setting an alarm threshold value of each monitoring index, displaying a page display state of each monitoring index (if the page display state is yes, a display interface can display the state of the monitoring index, if not, the display is not performed), an alarm state of each monitoring index (if the alarm state is yes, the monitoring index gives an alarm, if not, the display interface does not give an alarm), recovering a notification state of each monitoring index alarm (if the alarm state is yes, the alarm state is recovered, if the alarm state is not maintained continuously), continuously maintaining the alarm state of each monitoring index (if the alarm state is maintained continuously), ignoring an overtime state of each monitoring index (if the alarm state is exceeded, if not, judging the overtime duration), the overtime neglect duration of each monitoring index (setting the overtime neglect duration), and the continuous breakthrough alarm times of each monitoring index.

the summarizing modes comprise summation, average value, alarm summation and maximum value;

the alarm mode comprises the following steps: mail, snmp, kafka, restful;

mail: and sending the alarm in a mail mode, wherein the alarm thread sends the alarm information from the outbox to the inbox through a mail server by a mail protocol.

snmp: the alarm is sent by a snmp mode (the snmp generally refers to a simple network management protocol, which is a standard protocol specially designed for managing network nodes in an ip network), and an alarm thread sends alarm information from a sending end to a receiving end by the snmp protocol.

kafka: the alarm is sent through a Kafka mode (Kafka is a high-throughput distributed publish-subscribe message system) message queue, the alarm thread sends the alarm information to the Kafka queue by calling a Kafka producer command, and a required third-party application can acquire the alarm information from the Kafka message queue by calling a Kafka consumer command.

restful: the alarm is sent through a restful protocol (the restful is a design style and a development mode of a network application program, and based on http, xml format definition or json format definition can be used), and the alarm thread sends alarm information from a sending end to a receiving end through the restful protocol.

The monitoring of the index in step S3 includes: available class, operating system class, progress state, cluster state class, database state class, execution state class.

The number of the collected indexes is 47, the users cannot increase the monitoring indexes through configuration, and the monitoring indexes can be divided into the following types according to the collection types of the monitoring indexes:

available classes: the monitoring indexes for monitoring whether the monitoring system is available comprise two monitoring indexes, namely host _ availability and agent _ availability, which respectively represent whether a server host can be reached or not and whether an acquisition agent can be reached or not, only when the two indexes are normal, the monitoring system is available, other monitoring indexes can be normally acquired, and otherwise, other monitoring indexes cannot be acquired.

Class of operating system: the monitoring indexes for monitoring the four aspects of cpu, memory, disk and network of the server operating system currently support various Linux operating systems such as reddat, centros, suse and kylin, support various cpu instruction sets such as x86, dragon core and fly, and the acquisition indexes comprise cpu, memory, disk and network.

The process state class: the monitoring indexes for monitoring the states of the main processes of the distributed MPP cluster comprise the states of the processes and the use conditions of a memory, and the main monitored processes comprise gcclusterd, gbased, syncserver and gcware.

Cluster state class: the monitoring indexes for monitoring the state of the distributed MPP cluster comprise a cluster state, a cluster mode, a gcluster data state and a gnode data state.

Database state class: monitoring indexes for monitoring the database of the distributed MPP cluster comprise the disk space occupied by the table space, the disk space occupied by the database and the database table voidage.

Executing the state class: and the monitoring indexes for monitoring the task execution condition of the distributed MPP cluster comprise SQL execution overtime pieces and server session numbers.

The alarm operation of the index value of the monitoring index in step S4 includes a single index alarm operation and a summary index alarm operation.

The single index alarm operation process comprises the following steps: s401, starting an acquisition center module, and loading monitoring strategies of all monitoring indexes of the distributed MPP cluster from a resource module into a monitoring strategy cache;

The summary index alarm operation process comprises the following contents:

s412, the collection center module obtains the name of the monitoring index and the index value of the monitoring index which are alarmed in a single index from a monitoring index cache (the monitoring index cache is used for storing the actual value of the monitoring index obtained from each server node), and obtains a summarizing mode from a monitoring strategy cache, and then performs summarizing operation according to the summarizing mode to obtain a summarizing value;

The specific embodiment is as follows:

as shown in fig. 6, c1, a distributed monitoring system capable of simultaneously setting monitoring strategies, collecting monitoring indexes and monitoring and alarming for a plurality of distributed MPP clusters is built.

c2. Through a distributed monitoring system, monitoring strategy setting is carried out on each monitoring target distributed MPP cluster, and the method specifically comprises the following steps:

step one, when a target distributed MPP cluster is added, a monitoring strategy can be newly established for the distributed MPP cluster, and the monitoring strategy comprises the following contents: collecting all monitoring indexes in a unified way; each monitoring index alarm threshold value, whether collection is carried out, whether page display is carried out, whether alarm is carried out, whether notification is recovered, whether continuous alarm is carried out, whether overtime is ignored or not, the overtime is ignored, the continuous breakthrough alarm frequency, whether single index alarm is cancelled or not, whether summary alarm is carried out or not, a summary mode (summation, average, alarm summation and maximum value), a summary judgment condition (greater than, less than, equal to, greater than or equal to and less than or equal to) and a summary judgment threshold value; and when all alarms generate alarms, interface modes including mail, snmp, kafka and restful can be pushed to a downstream system, a newly-built monitoring strategy rule is generated according to a default rule, and an existing monitoring strategy can be selected for the distributed MPP cluster.

And secondly, after the addition of the target distributed MPP cluster is finished, personalized strategy setting can be carried out on the monitoring strategy bound with the target distributed MPP cluster according to actual service requirements, if the personalized strategy setting is not carried out, the monitoring strategy generated according to default rules is adopted for monitoring, personalized strategy setting can be carried out, monitoring on different distributed MPP clusters according to requirements can be realized, setting can be carried out according to resource conditions, operation service characteristics and monitoring frequency requirements of a distributed MPP cluster deployment environment, and therefore monitoring of the distributed MPP cluster under different environments is met. The policy settable contents include:

monitoring an acquisition period of the index, the acquisition period being used to control a time interval for acquiring the status and performance index from the target distributed MPP cluster.

The monitoring index setting, the monitoring index that can set can be divided into: available class, operating system class, process state, cluster state class, database state class, and execution state class, and the set contents for the monitoring finger may include: alarm threshold value, whether collection is carried out, whether alarm is displayed on a page or not, whether alarm is given or not, whether notification is recovered or not, whether continuous alarm is given or not, whether overtime is ignored or not, overtime neglected duration, continuous breakthrough alarm times, whether summary alarm is carried out or not, whether single index alarm is cancelled or not, summary mode (summation, average, alarm summation and maximum value), summary judgment condition (greater than, less than, equal to, greater than or equal to and less than or equal to) and summary judgment threshold value. The monitoring index is set to be used for carrying out alarm operation on monitoring data acquired from the target distributed MPP cluster.

And after the corresponding mode is started and corresponding configuration is carried out, if the state or performance index collected by the target distributed MPP cluster breaks through a threshold value and an alarm occurs, the alarm data is pushed to the target system through the mode.

c3. Through the distributed monitoring system, monitoring of the target distributed MPP cluster is started, in the starting process, the acquisition center module acquires a monitoring strategy and acquisition agent information of the target distributed MPP cluster from the resource library module, the monitoring strategy is used for operation of monitoring index alarm, the acquisition agent module information is used for the acquisition center module to periodically link all the acquisition agent modules, accordingly, values of monitoring indexes are acquired from all servers of the target distributed MPP cluster, after the information is acquired, the acquisition center module starts an information acquisition task, after a time interval set by an acquisition period is reached, the acquisition center module links all the acquisition agent modules, and index values of the monitoring indexes are acquired from all the acquisition agent modules.

c4. And the acquisition center module is used for carrying out alarm operation on the value of the monitoring index according to the setting of the monitoring strategy. And for the index of the single index alarm, setting the index type, judgment condition and threshold value according to the monitoring strategy. Calculating whether the index value of the monitoring index is abnormal or not, and generating an alarm if the index value of the monitoring index is abnormal; for the index needing to be subjected to summary operation, the index values on each server under the target distributed MPP cluster need to be summarized together, whether the index values are abnormal or not is calculated according to a summary mode, a summary judgment condition and a summary judgment value, and if the index values are abnormal, an alarm is generated.

c5. The acquisition center module judges whether to alarm according to the alarm mode and alarm level setting in the monitoring strategy, if so, the acquisition center module pushes the alarm information to an appointed downstream system interface according to the alarm pushing interface setting, and the acquisition center module comprises: mail or snmp or kafka or restful, and then save the alarm data to the repository module.

c6. And the acquisition center module stores the index information acquired by all the target distributed MPP cluster servers to the resource library module.

c7. The monitoring website module acquires alarm information and index information of all monitored target distributed MPP clusters through the resource library module, and visual alarm information viewing and index trend information viewing functions of a plurality of distributed MPP clusters are formed for users.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A monitoring method for monitoring and alarming a plurality of distributed MPP clusters is characterized by comprising the following steps:

s6, the monitoring website module acquires the information of the resource library module to realize visual alarm information viewing and index trend information viewing functions;

the alarm operation of the index value of the monitoring index in the step S4 includes a single index alarm operation and a summary index alarm operation;

the single index alarm operation process is to judge whether to generate an alarm or not by comparing the monitoring index value with the design threshold of the monitoring strategy;

the operation process of the summarizing index alarm is to summarize index values of single indexes of all collection agents under each distributed MPP cluster according to a design summarizing mode, compare the obtained values with a design threshold value of a monitoring strategy and judge whether an alarm is generated.

2. The monitoring method for monitoring and alarming a plurality of distributed MPP clusters according to claim 1, wherein: monitoring an acquisition period of the index, wherein the acquisition period is a time interval between two adjacent times of acquiring the state index and the performance index;

the setting of the single index judgment condition comprises the following steps: acquiring an index state of each monitoring index, setting an alarm threshold value of each monitoring index, displaying a page state of each monitoring index, an alarm state of each monitoring index, a recovery notification state of each monitoring index alarm, a continuous alarm state of each monitoring index, an overtime neglected duration of each monitoring index, and the number of continuous breakthrough alarms of each monitoring index;

the summarizing mode comprises the following steps: summing, averaging, alarm summing and maximum values of the index values;

the alarm mode comprises the following steps: the alarm mode of the mail and the alarm mode of the simple network management protocol transmission; the alarm mode sent by the message queue; alarm mode of network application program.

3. The monitoring method for monitoring and alarming a plurality of distributed MPP clusters according to claim 1, wherein: the classification of the monitoring index in step S3 includes: available class, operating system class, progress state, cluster state class, database state class, execution state class.

4. The monitoring method for monitoring and alarming multiple distributed MPP clusters as claimed in claim 1, wherein the single index alarm calculation process comprises the following steps:

5. The monitoring method for monitoring and alarming multiple distributed MPP clusters according to claim 1, wherein the summary index alarm calculation process comprises the following steps:

6. The monitoring system used in the monitoring method for monitoring and alarming a plurality of distributed MPP clusters according to any one of claims 1 to 5, is characterized by comprising: the system comprises a resource library module, a WEB module, an acquisition center module and an acquisition agent module;

7. The monitoring system for monitoring and alarming a plurality of distributed MPP clusters as set forth in claim 6, wherein: the resource library module is used for storing system configuration data and system acquisition data;

8. The monitoring system for monitoring and alarming a plurality of distributed MPP clusters as set forth in claim 7, wherein: the WEB module is used for providing a visual operation panel for a user to perform related configuration of the system, and simultaneously providing display of all target distributed MPP cluster monitoring indexes and alarm information viewing;

9. The monitoring system for monitoring and alerting a plurality of distributed MPP clusters of claim 8, wherein: the acquisition agent module is used for receiving the data acquisition request sent by the acquisition center module;