CN105024880A

CN105024880A - Elastic monitoring method for key task computer cluster

Info

Publication number: CN105024880A
Application number: CN201510419779.5A
Authority: CN
Inventors: 王慧强; 戴秀豪; 冯光升; 吕宏武; 林俊宇
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2015-07-17
Filing date: 2015-07-17
Publication date: 2015-11-04

Abstract

The invention discloses an elastic monitoring method for a key task computer cluster. The method includes the following steps that: a management system is set for a main node in the cluster, and a monitoring agent is allocated for each sub node, and the monitoring agents of the sub nodes are responsible for acquiring monitoring data of the sub nodes and transmitting the monitoring data to the management system of the main node; when each monitoring period begins, the maximum load of each node in the cluster is calculated, if the maximum load of the sub nodes is larger than 70%, a high-load monitoring method is adopted to monitor the sub nodes; if the maximum load of the sub nodes is smaller than or equal to 70% and is larger than 30%, a normal load monitoring method is adopted to monitor the sub nodes; and if the maximum load of the sub nodes is smaller than or equal to 30%, a low-load monitoring method is adopted to monitor the sub nodes. When the load of the nodes of the cluster is high, unnecessary resource occupancy can be decreased, and the influence of the monitoring system on the performance of the cluster can be reduced; when the load of the nodes of the cluster is low, fine-grained monitoring can be adopted for the nodes.

Description

A kind of elasticity method for supervising towards mission critical computer cluster

Technical field

The invention belongs to a kind of computer cluster method for supervising, particularly relate to a kind of elasticity method for supervising towards mission critical computer cluster.

Background technology

Current computer system not only can be subject to external attack, system itself also may because factors such as operational environment, operating time and operating loads, there is delaying machine or fault, therefore a supervisory control system is needed to monitor in real time computer, the threat that Timeliness coverage is potential, at the advance row relax that breaks down.Especially on mission critical computer to survey of deep space, down to seabed mapping, the various and bad environments of deployed position, the local monitor that professional and technical personnel cannot realize system with safeguard safeguards system availability, therefore ensure that its availability is most important.

Current monitoring software, mostly all exists and continues to take the situation compared with multi-system resource, when the high capacity of mission critical computer, because seize computational resource, can cause system malfunctions; And when the load of mission critical computer cluster is lower, supervisory control system utilizes the computational resource of trunked idle, more fine-grained monitoring can be carried out and the operating state of cluster is more comprehensively analyzed, the hidden danger that Timeliness coverage is potential, guarantees that mission critical computer cluster can continuous firing for a long time.Therefore in order to ensure the high availability of mission critical computer, take the method for supervising of effectively and not influential system availability extremely important.

Current cluster method for supervising is fast-developing, but the occupation condition of supervisory control system under cluster high capacity, the problem that can have a strong impact on the stability of cluster own is also properly settled.Patent " a kind of construction method with the cluster management supervisory control system of elastic system framework " (CN100366001C), propose extendible management framework, two-layer architecture is adopted for small-scale cluster, large-scale cluster then adopts L 3 architecture, the object that will reach with the present invention has points of resemblance, but adopt method with for problem not identical.

In sum, also there is following problem in current mission critical computer cluster method for supervising:

(1) seldom consider the loading condition of group system in monitor procedure, when cluster load is higher, supervisory control system can have a strong impact on cluster performance, even can causing trouble or machine of delaying;

(2) when the load of mission critical computer cluster is lower, do not make full use of idle computational resource and carry out fine granularity monitor and managment more.

Summary of the invention

The object of this invention is to provide a kind of system availability that can improve, a kind of elasticity method for supervising towards mission critical computer cluster.

Towards an elasticity method for supervising for mission critical computer cluster, comprise the following steps,

Step one: for the host node in cluster arranges a management system, for each child node distributes a monitoring agent, the monitoring agent of child node is responsible for the monitor data gathering respective node, and sending to the management system of host node, monitor data comprises: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state and critical port;

Step 2: the management system of host node is according to the monitor data receiving monitoring agent, when each monitoring period starts, calculate the maximum load of each node of cluster, the maximum load of node is the mean value of its cpu busy percentage, memory usage and network bandwidth utilization rate three, if child node maximum load is greater than 70%, then enter step 3; If child node maximum load is less than or equal to 70% and is greater than 30%, then enter step 4; If the most load of child node is less than or equal to 30% and enters step 5;

Step 3: the management system of host node takes high capacity method for supervising to monitor child node, enters step 6;

Step 4: the management system of host node takes normal load method for supervising to monitor child node, enters step 6;

Step 5: the management system of host node takes low load monitoring method to monitor child node, enters step 6;

Step 6: return step one, until task terminates.

A kind of elasticity method for supervising towards mission critical computer cluster of the present invention, can also comprise:

1, high capacity method for supervising is:

(1) monitoring agent of child node is monitored node, monitor data is cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory and network bandwidth utilization rate, is stored in the memory space of local node by monitor data temporarily;

(2) monitoring agent of child node sends a heartbeat packet to the management system of host node every t second, informs that this child node of host node is in normal operating conditions;

(3) if one or more data that the monitoring agent of child node monitors in cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, the monitoring agent of child node sends this abnormal state information to host node at once, the management system of host node, by the abnormal information of the child node of reception, gives the alarm to administrative staff.

2, normal load method for supervising is:

(1) child node monitoring agent checks the local monitor data file that whether there is interim storage, if there is the interim monitor data stored, then packing sends to the management system of host node, and the interim storage file that deletion has sent;

(2) management system informs that the monitoring agent of this child node carries out overall monitor to node, and the data of monitoring comprise: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state and critical port;

(3) if one or more data that the monitoring agent of child node monitors in cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, the monitoring agent of child node sends this abnormal state information to host node at once, the management system of host node, by the abnormal information of the child node of reception, gives the alarm to administrative staff;

(4) management system of host node is according to the Monitoring Data of child node, judge whether that I/O response timeout, disk read-write fault, CPU voltage exceed threshold value or critical port by the abnormal conditions illegally occupied, if there is exception, the management system of host node gives the alarm to administrative staff.

3, low load monitoring method is:

(2) management system of host node informs that the monitoring agent of child node carries out overall monitor to child node, and the data of monitoring comprise: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state, critical port;

(4) management system of host node is according to the Monitoring Data of child node, judge whether that I/O response timeout, disk read-write fault, CPU voltage exceed threshold value or critical port by the abnormal conditions illegally occupied, if there is exception, the management system of host node gives the alarm to administrative staff;

(5) management system of host node calculates unhealthy degree P according to the Monitoring Data of child node:

P = (λ_{1} \frac{w_{1}}{W_{1}} + λ_{2} \frac{w_{2}}{W_{2}} + λ_{3} \frac{w_{3}}{W_{3}} + λ_{4} \frac{w_{4}}{W_{4}} + λ_{5} \frac{w_{5}}{W_{5}} + λ_{6} \frac{w_{6}}{W_{6}}) * 100

Wherein cpu busy percentage w ₁, memory usage w ₂, cache miss rate w ₃, cpu temperature w ₄, virtual memory w ₅, network bandwidth utilization rate w ₆, W ₁, W ₂, W ₃, W ₄, W ₅, W ₆the prior defined threshold of cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate respectively, λ ₁, λ ₂, λ ₃, λ ₄, λ ₅, λ ₆be respectively the weight of above Monitoring Data;

If unhealthy degree is greater than threshold value, the management system of host node gives the alarm to administrative staff.

Beneficial effect:

The present invention proposes a kind of elasticity method for supervising towards mission critical computer cluster, loading condition that can be different according to cluster, take different monitoring schemes: when clustered node high capacity, reduce unnecessary resource occupation as far as possible, reduce supervisory control system to the impact of cluster performance; When the low load of clustered node, more fine-grained monitoring is taked to node, and make full use of the health degree of PC cluster resource analysis cluster, in minimizing supervisory control system under the prerequisite of the impact of mission critical computer cluster availability, maintain the high availability of system.

The present invention, according to the different loads situation of mission critical computer cluster, takes different monitoring schemes, the resource of flexible utilization mission critical computer cluster, under the prerequisite not affecting cluster resource, and utilance mission critical computer cluster resource substantially;

The present invention when mission critical computer cluster is in high capacity, can reduces the resource shared by monitoring as far as possible, reduces the probability that mission critical computer cluster breaks down; When cluster is in low load, more fine-grained monitoring is carried out to cluster, analytical system health degree, guarantee that mission critical computer cluster can continuous firing for a long time.

Accompanying drawing explanation

The flow chart of a kind of elasticity method for supervising towards mission critical computer cluster of Fig. 1.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in further details.

The object of the present invention is to provide a kind of elasticity method for supervising towards mission critical computer cluster, solve the cluster monitoring problem under different loads.Feature of the present invention is the cluster operating state for low load, normal load and high capacity three kinds of different situations, take three kinds of monitoring schemes respectively, under the prerequisite not affecting cluster availability, make full use of PC cluster resource and carry out more fine-grained overall monitor.

The present invention is a kind of elasticity method for supervising towards mission critical computer cluster, first for the host node in cluster arranges a management system, for each child node distributes a monitoring agent, the monitoring agent of child node is responsible for the monitor data gathering respective node, and sends to the management system of host node.Host node, according to the Monitoring Data receiving monitoring agent, when each monitoring period starts, calculates the maximum load of each node of cluster, and load level represents by 0 ~ 100%.Wherein maximum load is the mean value of cpu busy percentage, memory usage and network bandwidth utilization rate three.And according to maximum load, take three kinds of different monitoring schemes respectively.

(1) low load monitoring scheme: if the maximum load of child node is less than or equal to 30%, adopts low load monitoring scheme.Under this monitoring scheme, the monitoring flow process of this child node is as follows:

1. this child node monitoring agent checks the local monitor data file that whether there is interim storage, if there is the interim Monitoring Data stored, then packing sends to the management system of host node, and the interim storage file that deletion has sent;

2. management system informs that the monitoring agent of this child node carries out comprehensive monitoring to node, and the data of monitoring comprise 10 parts: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state, critical port (port 21,23,25 and 80);

If 3. the monitoring agent of this child node monitors cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, namely there is w ₁> W ₁, w ₂> W ₂, w ₃> W ₃, w ₄> W ₄, w ₅> W ₅, w ₆> W ₆in one or more situations, the monitoring agent of this child node sends this abnormal state information to host node at once, and is given the alarm by the monitoring management system of host node, notifies that keeper safeguards.

4. the management system of host node is according to the Monitoring Data of child node, exceed threshold value, critical port by the abnormal conditions illegally occupied, and alert notice keeper safeguards in time for I/O response timeout, disk read-write fault, CPU voltage;

5. for cpu busy percentage (w ₁), memory usage (w ₂), cache miss rate (w ₃), cpu temperature (w ₄), virtual memory (w ₅), network bandwidth utilization rate (w ₆), adopt the unhealthy degree P of AHP algorithm computing system:

P = (λ_{1} \frac{w_{1}}{W_{1}} + λ_{2} \frac{w_{2}}{W_{2}} + λ_{3} \frac{w_{3}}{W_{3}} + λ_{4} \frac{w_{4}}{W_{4}} + λ_{5} \frac{w_{5}}{W_{5}} + λ_{6} \frac{w_{6}}{W_{6}}) * 100 - - - (1)

Wherein W ₁, W ₂, W ₃, W ₄, W ₅, W ₆the prior defined threshold of cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate respectively, λ ₁, λ ₂, λ ₃, λ ₄, λ ₅, λ ₆be respectively the weight of above Monitoring Data, calculate unhealthy degree P according to formula (1), if P > 70, illustrate that this node is in unhealthy condition, timely alert notice administrative staff safeguard.

(2) normal load monitoring scheme: if the maximum load of child node is greater than 30% and is less than or equal to 70%, adopts normal load monitoring scheme.Under this monitoring scheme, the monitoring flow process of this child node is as follows:

4. the management system of host node is according to the Monitoring Data of child node, and exceed threshold value, critical port by the abnormal conditions illegally occupied for I/O response timeout, disk read-write fault, CPU voltage, alert notice keeper safeguard in time;

(3) high capacity monitoring scheme: if the maximum load of child node is greater than 70%, adopts high capacity monitoring scheme.Under this monitoring scheme, the monitoring flow process of this child node is as follows:

1. the monitoring agent of this child node is under high capacity monitoring scheme, no longer real-time report the machine Monitoring Data.The monitoring agent of child node is monitored node, the data of monitoring comprise 6 parts: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, and Monitoring Data are stored in temporarily the memory space of local node;

2. the monitoring agent of this child node sends a heartbeat packet to the management system of host node every t second, guarantees host node and the most basic the communicating of child node, informs that this child node of host node is in normal operating conditions;

As shown in Figure 1, a kind of elasticity method for supervising flow process towards mission critical computer cluster of the present invention is as follows:

(1) host node utilizes each monitoring agent to obtain the Monitoring Data of child node, enters step (2);

(2) host node carries out load evaluation to the Monitoring Data obtained, if child node load is greater than 70%, then enter step (3), if node load is less than or equal to 70% and be greater than 30%, then enter step (4) if load is less than or equal to 30% enter step (5);

(3) host node takes high capacity monitoring scheme to monitor child node, enters step (6);

(4) host node takes normal load monitoring scheme to monitor child node, enters step (6);

(5) host node takes low load monitoring scheme to monitor child node, enters step (6);

(6) management system of host node judges whether each node occurs exception, if noted abnormalities, enters (7), otherwise returns step (1);

(7) abnormal information of this child node is sent warning by the mode such as mail or note to administrative staff by the management system of host node, returns step (1).

A kind of elasticity method for supervising towards mission critical computer cluster of the present invention, first for the host node in cluster arranges a management system, for each child node distributes a monitoring agent, the monitoring agent of child node is responsible for the monitor data gathering respective node, and sends to the management system of host node.Host node, according to the Monitoring Data receiving monitoring agent, when each monitoring period starts, calculates the maximum load of each node of cluster, and load level represents by 0 ~ 100%.Wherein maximum load is the mean value in cpu busy percentage, memory usage and network bandwidth utilization rate three.And according to maximum load, take three kinds of different monitoring schemes respectively.

(1) low load monitoring scheme: the maximum load of child node is less than or equal to 30%, adopts low load monitoring scheme.Under this monitoring scheme, the monitoring flow process of this node is as follows:

5. for cpu busy percentage (w1), memory usage (w2), cache miss rate (w3), cpu temperature (w ₄), virtual memory (w ₅), network bandwidth utilization rate (w ₆) adopt formula (1) to calculate the unhealthy degree P of this child node, wherein W ₁, W ₂, W ₃, W ₄, W ₅, W ₆be respectively 90%, 90%, 10%, 80 DEG C, 4000M, 90%, respectively the threshold value of corresponding cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, λ ₁, λ ₂, λ ₃, λ ₄, λ ₅, λ ₆be respectively 0.25,0.25,0.1,0.15,0.1,0.15, calculate unhealthy degree P by formula (1), if P > 70, illustrate that this node is in unhealthy condition, alert notice administrative staff safeguard.

(2) normal load monitoring scheme: the maximum load of child node is greater than 30% and is less than or equal to 70%, adopts normal load monitoring scheme.Under this monitoring scheme, the monitoring flow process of this node is as follows:

Below in conjunction with specific embodiment, a kind of elasticity method for supervising towards mission critical computer cluster of the present invention is described in detail below.

Embodiment 1:

As shown in Figure 1, when the load of mission critical computer cluster is less than 70%, and cpu busy percentage w ₁=68%, memory usage w ₂=65%, cache miss rate w ₃=8%, cpu temperature w ₄=70 DEG C, virtual memory w ₅=3800M, network bandwidth utilization rate w ₆when=80%, and CPU voltage, I/O state, disk state, critical port are all normal, and workflow is as follows:

(1) host node utilizes each monitoring agent to obtain the Monitoring Data of child node;

(2) host node carries out load evaluation to the Monitoring Data of child node, and child node load is less than 70%;

(3) when each monitoring period of this child node starts, monitoring agent checks the local monitor data file that whether there is interim storage, if had, then packing sends to the management system of host node, and deletes the interim storage file sent;

(4) management system of host node is according to the Monitoring Data of child node, wherein cpu busy percentage w ₁, memory usage w ₂, cache miss rate w ₃, cpu temperature w ₄, virtual memory, network bandwidth utilization rate be all in normal range (NR), CPU voltage, I/O state, disk state, critical port are all normal;

(5) the cpu busy percentage w of Monitoring Data ₁=68%, memory usage w ₂=65%, cache miss rate w ₃=8%, cpu temperature w ₄=70 DEG C, virtual memory w ₅=3800M, network bandwidth utilization rate w ₆=80% adopts formula (1) to calculate the unhealthy degree P of this child node, wherein W ₁, W ₂, W ₃, W ₄, W ₅, W ₆be respectively 90%, 90%, 10%, 80 DEG C, 4000M, 90%, λ ₁, λ ₂, λ ₃, λ ₄, λ ₅, λ ₆be respectively 0.25,0.25,0.1,0.15,0.1,0.15, calculate unhealthy degree P', P'=81 > 70, this node is in unhealthy condition, and alert notice administrative staff safeguard.

Embodiment 2:

As shown in Figure 1, when the load of mission critical computer cluster is greater than 70%, but in Monitoring Data, cpu temperature is w ₄=85 DEG C, exceed threshold value, other Monitoring Data are when normal range (NR), and workflow is as follows:

(2) host node carries out load evaluation to the Monitoring Data obtained to this child node, and load is greater than 70%;

(3) monitoring agent of this child node is under high capacity monitoring scheme, no longer real-time report the machine Monitoring Data, but Monitoring Data is stored in temporarily the memory space of local node;

(4) monitoring agent of this child node sends a heartbeat packet to the management system of host node every 5 seconds, guarantees host node and the most basic the communicating of child node, informs that this child node of host node is in normal operating conditions;

(5) monitoring agent of this child node monitors cpu temperature w ₄=85 DEG C of >W _4,exceed threshold value, the monitoring agent of this child node sends this abnormal state information to host node at once, and is given the alarm by the monitoring management system of host node, notifies that keeper safeguards.

Claims

1., towards an elasticity method for supervising for mission critical computer cluster, it is characterized in that: comprise the following steps,

Step 6: return step one, until task terminates.

2. a kind of elasticity method for supervising towards mission critical computer cluster according to claim 1, is characterized in that: described high capacity method for supervising is:

3. a kind of elasticity method for supervising towards mission critical computer cluster according to claim 1, is characterized in that: described normal load method for supervising is:

4. a kind of elasticity method for supervising towards mission critical computer cluster according to claim 1, is characterized in that: described low load monitoring method is:

P = (λ_{1} \frac{w_{1}}{W_{1}} + λ_{2} \frac{w_{2}}{W_{2}} + λ_{3} \frac{w_{3}}{W_{3}} + λ_{4} \frac{w_{4}}{W_{4}} + λ_{5} \frac{w_{5}}{W_{5}} + λ_{6} \frac{w_{6}}{W_{6}}) * 100