CN105024880A - Elastic monitoring method for key task computer cluster - Google Patents

Elastic monitoring method for key task computer cluster Download PDF

Info

Publication number
CN105024880A
CN105024880A CN201510419779.5A CN201510419779A CN105024880A CN 105024880 A CN105024880 A CN 105024880A CN 201510419779 A CN201510419779 A CN 201510419779A CN 105024880 A CN105024880 A CN 105024880A
Authority
CN
China
Prior art keywords
node
monitoring
child node
management system
host node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510419779.5A
Other languages
Chinese (zh)
Inventor
王慧强
戴秀豪
冯光升
吕宏武
林俊宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201510419779.5A priority Critical patent/CN105024880A/en
Publication of CN105024880A publication Critical patent/CN105024880A/en
Pending legal-status Critical Current

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention discloses an elastic monitoring method for a key task computer cluster. The method includes the following steps that: a management system is set for a main node in the cluster, and a monitoring agent is allocated for each sub node, and the monitoring agents of the sub nodes are responsible for acquiring monitoring data of the sub nodes and transmitting the monitoring data to the management system of the main node; when each monitoring period begins, the maximum load of each node in the cluster is calculated, if the maximum load of the sub nodes is larger than 70%, a high-load monitoring method is adopted to monitor the sub nodes; if the maximum load of the sub nodes is smaller than or equal to 70% and is larger than 30%, a normal load monitoring method is adopted to monitor the sub nodes; and if the maximum load of the sub nodes is smaller than or equal to 30%, a low-load monitoring method is adopted to monitor the sub nodes. When the load of the nodes of the cluster is high, unnecessary resource occupancy can be decreased, and the influence of the monitoring system on the performance of the cluster can be reduced; when the load of the nodes of the cluster is low, fine-grained monitoring can be adopted for the nodes.

Description

A kind of elasticity method for supervising towards mission critical computer cluster
Technical field
The invention belongs to a kind of computer cluster method for supervising, particularly relate to a kind of elasticity method for supervising towards mission critical computer cluster.
Background technology
Current computer system not only can be subject to external attack, system itself also may because factors such as operational environment, operating time and operating loads, there is delaying machine or fault, therefore a supervisory control system is needed to monitor in real time computer, the threat that Timeliness coverage is potential, at the advance row relax that breaks down.Especially on mission critical computer to survey of deep space, down to seabed mapping, the various and bad environments of deployed position, the local monitor that professional and technical personnel cannot realize system with safeguard safeguards system availability, therefore ensure that its availability is most important.
Current monitoring software, mostly all exists and continues to take the situation compared with multi-system resource, when the high capacity of mission critical computer, because seize computational resource, can cause system malfunctions; And when the load of mission critical computer cluster is lower, supervisory control system utilizes the computational resource of trunked idle, more fine-grained monitoring can be carried out and the operating state of cluster is more comprehensively analyzed, the hidden danger that Timeliness coverage is potential, guarantees that mission critical computer cluster can continuous firing for a long time.Therefore in order to ensure the high availability of mission critical computer, take the method for supervising of effectively and not influential system availability extremely important.
Current cluster method for supervising is fast-developing, but the occupation condition of supervisory control system under cluster high capacity, the problem that can have a strong impact on the stability of cluster own is also properly settled.Patent " a kind of construction method with the cluster management supervisory control system of elastic system framework " (CN100366001C), propose extendible management framework, two-layer architecture is adopted for small-scale cluster, large-scale cluster then adopts L 3 architecture, the object that will reach with the present invention has points of resemblance, but adopt method with for problem not identical.
In sum, also there is following problem in current mission critical computer cluster method for supervising:
(1) seldom consider the loading condition of group system in monitor procedure, when cluster load is higher, supervisory control system can have a strong impact on cluster performance, even can causing trouble or machine of delaying;
(2) when the load of mission critical computer cluster is lower, do not make full use of idle computational resource and carry out fine granularity monitor and managment more.
Summary of the invention
The object of this invention is to provide a kind of system availability that can improve, a kind of elasticity method for supervising towards mission critical computer cluster.
Towards an elasticity method for supervising for mission critical computer cluster, comprise the following steps,
Step one: for the host node in cluster arranges a management system, for each child node distributes a monitoring agent, the monitoring agent of child node is responsible for the monitor data gathering respective node, and sending to the management system of host node, monitor data comprises: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state and critical port;
Step 2: the management system of host node is according to the monitor data receiving monitoring agent, when each monitoring period starts, calculate the maximum load of each node of cluster, the maximum load of node is the mean value of its cpu busy percentage, memory usage and network bandwidth utilization rate three, if child node maximum load is greater than 70%, then enter step 3; If child node maximum load is less than or equal to 70% and is greater than 30%, then enter step 4; If the most load of child node is less than or equal to 30% and enters step 5;
Step 3: the management system of host node takes high capacity method for supervising to monitor child node, enters step 6;
Step 4: the management system of host node takes normal load method for supervising to monitor child node, enters step 6;
Step 5: the management system of host node takes low load monitoring method to monitor child node, enters step 6;
Step 6: return step one, until task terminates.
A kind of elasticity method for supervising towards mission critical computer cluster of the present invention, can also comprise:
1, high capacity method for supervising is:
(1) monitoring agent of child node is monitored node, monitor data is cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory and network bandwidth utilization rate, is stored in the memory space of local node by monitor data temporarily;
(2) monitoring agent of child node sends a heartbeat packet to the management system of host node every t second, informs that this child node of host node is in normal operating conditions;
(3) if one or more data that the monitoring agent of child node monitors in cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, the monitoring agent of child node sends this abnormal state information to host node at once, the management system of host node, by the abnormal information of the child node of reception, gives the alarm to administrative staff.
2, normal load method for supervising is:
(1) child node monitoring agent checks the local monitor data file that whether there is interim storage, if there is the interim monitor data stored, then packing sends to the management system of host node, and the interim storage file that deletion has sent;
(2) management system informs that the monitoring agent of this child node carries out overall monitor to node, and the data of monitoring comprise: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state and critical port;
(3) if one or more data that the monitoring agent of child node monitors in cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, the monitoring agent of child node sends this abnormal state information to host node at once, the management system of host node, by the abnormal information of the child node of reception, gives the alarm to administrative staff;
(4) management system of host node is according to the Monitoring Data of child node, judge whether that I/O response timeout, disk read-write fault, CPU voltage exceed threshold value or critical port by the abnormal conditions illegally occupied, if there is exception, the management system of host node gives the alarm to administrative staff.
3, low load monitoring method is:
(1) child node monitoring agent checks the local monitor data file that whether there is interim storage, if there is the interim monitor data stored, then packing sends to the management system of host node, and the interim storage file that deletion has sent;
(2) management system of host node informs that the monitoring agent of child node carries out overall monitor to child node, and the data of monitoring comprise: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state, critical port;
(3) if one or more data that the monitoring agent of child node monitors in cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, the monitoring agent of child node sends this abnormal state information to host node at once, the management system of host node, by the abnormal information of the child node of reception, gives the alarm to administrative staff;
(4) management system of host node is according to the Monitoring Data of child node, judge whether that I/O response timeout, disk read-write fault, CPU voltage exceed threshold value or critical port by the abnormal conditions illegally occupied, if there is exception, the management system of host node gives the alarm to administrative staff;
(5) management system of host node calculates unhealthy degree P according to the Monitoring Data of child node:
P = ( λ 1 w 1 W 1 + λ 2 w 2 W 2 + λ 3 w 3 W 3 + λ 4 w 4 W 4 + λ 5 w 5 W 5 + λ 6 w 6 W 6 ) * 100
Wherein cpu busy percentage w 1, memory usage w 2, cache miss rate w 3, cpu temperature w 4, virtual memory w 5, network bandwidth utilization rate w 6, W 1, W 2, W 3, W 4, W 5, W 6the prior defined threshold of cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate respectively, λ 1, λ 2, λ 3, λ 4, λ 5, λ 6be respectively the weight of above Monitoring Data;
If unhealthy degree is greater than threshold value, the management system of host node gives the alarm to administrative staff.
Beneficial effect:
The present invention proposes a kind of elasticity method for supervising towards mission critical computer cluster, loading condition that can be different according to cluster, take different monitoring schemes: when clustered node high capacity, reduce unnecessary resource occupation as far as possible, reduce supervisory control system to the impact of cluster performance; When the low load of clustered node, more fine-grained monitoring is taked to node, and make full use of the health degree of PC cluster resource analysis cluster, in minimizing supervisory control system under the prerequisite of the impact of mission critical computer cluster availability, maintain the high availability of system.
The present invention, according to the different loads situation of mission critical computer cluster, takes different monitoring schemes, the resource of flexible utilization mission critical computer cluster, under the prerequisite not affecting cluster resource, and utilance mission critical computer cluster resource substantially;
The present invention when mission critical computer cluster is in high capacity, can reduces the resource shared by monitoring as far as possible, reduces the probability that mission critical computer cluster breaks down; When cluster is in low load, more fine-grained monitoring is carried out to cluster, analytical system health degree, guarantee that mission critical computer cluster can continuous firing for a long time.
Accompanying drawing explanation
The flow chart of a kind of elasticity method for supervising towards mission critical computer cluster of Fig. 1.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further details.
The object of the present invention is to provide a kind of elasticity method for supervising towards mission critical computer cluster, solve the cluster monitoring problem under different loads.Feature of the present invention is the cluster operating state for low load, normal load and high capacity three kinds of different situations, take three kinds of monitoring schemes respectively, under the prerequisite not affecting cluster availability, make full use of PC cluster resource and carry out more fine-grained overall monitor.
The present invention is a kind of elasticity method for supervising towards mission critical computer cluster, first for the host node in cluster arranges a management system, for each child node distributes a monitoring agent, the monitoring agent of child node is responsible for the monitor data gathering respective node, and sends to the management system of host node.Host node, according to the Monitoring Data receiving monitoring agent, when each monitoring period starts, calculates the maximum load of each node of cluster, and load level represents by 0 ~ 100%.Wherein maximum load is the mean value of cpu busy percentage, memory usage and network bandwidth utilization rate three.And according to maximum load, take three kinds of different monitoring schemes respectively.
(1) low load monitoring scheme: if the maximum load of child node is less than or equal to 30%, adopts low load monitoring scheme.Under this monitoring scheme, the monitoring flow process of this child node is as follows:
1. this child node monitoring agent checks the local monitor data file that whether there is interim storage, if there is the interim Monitoring Data stored, then packing sends to the management system of host node, and the interim storage file that deletion has sent;
2. management system informs that the monitoring agent of this child node carries out comprehensive monitoring to node, and the data of monitoring comprise 10 parts: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state, critical port (port 21,23,25 and 80);
If 3. the monitoring agent of this child node monitors cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, namely there is w 1> W 1, w 2> W 2, w 3> W 3, w 4> W 4, w 5> W 5, w 6> W 6in one or more situations, the monitoring agent of this child node sends this abnormal state information to host node at once, and is given the alarm by the monitoring management system of host node, notifies that keeper safeguards.
4. the management system of host node is according to the Monitoring Data of child node, exceed threshold value, critical port by the abnormal conditions illegally occupied, and alert notice keeper safeguards in time for I/O response timeout, disk read-write fault, CPU voltage;
5. for cpu busy percentage (w 1), memory usage (w 2), cache miss rate (w 3), cpu temperature (w 4), virtual memory (w 5), network bandwidth utilization rate (w 6), adopt the unhealthy degree P of AHP algorithm computing system:
P = ( λ 1 w 1 W 1 + λ 2 w 2 W 2 + λ 3 w 3 W 3 + λ 4 w 4 W 4 + λ 5 w 5 W 5 + λ 6 w 6 W 6 ) * 100 - - - ( 1 )
Wherein W 1, W 2, W 3, W 4, W 5, W 6the prior defined threshold of cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate respectively, λ 1, λ 2, λ 3, λ 4, λ 5, λ 6be respectively the weight of above Monitoring Data, calculate unhealthy degree P according to formula (1), if P > 70, illustrate that this node is in unhealthy condition, timely alert notice administrative staff safeguard.
(2) normal load monitoring scheme: if the maximum load of child node is greater than 30% and is less than or equal to 70%, adopts normal load monitoring scheme.Under this monitoring scheme, the monitoring flow process of this child node is as follows:
1. this child node monitoring agent checks the local monitor data file that whether there is interim storage, if there is the interim Monitoring Data stored, then packing sends to the management system of host node, and the interim storage file that deletion has sent;
2. management system informs that the monitoring agent of this child node carries out comprehensive monitoring to node, and the data of monitoring comprise 10 parts: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state, critical port (port 21,23,25 and 80);
If 3. the monitoring agent of this child node monitors cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, namely there is w 1> W 1, w 2> W 2, w 3> W 3, w 4> W 4, w 5> W 5, w 6> W 6in one or more situations, the monitoring agent of this child node sends this abnormal state information to host node at once, and is given the alarm by the monitoring management system of host node, notifies that keeper safeguards.
4. the management system of host node is according to the Monitoring Data of child node, and exceed threshold value, critical port by the abnormal conditions illegally occupied for I/O response timeout, disk read-write fault, CPU voltage, alert notice keeper safeguard in time;
(3) high capacity monitoring scheme: if the maximum load of child node is greater than 70%, adopts high capacity monitoring scheme.Under this monitoring scheme, the monitoring flow process of this child node is as follows:
1. the monitoring agent of this child node is under high capacity monitoring scheme, no longer real-time report the machine Monitoring Data.The monitoring agent of child node is monitored node, the data of monitoring comprise 6 parts: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, and Monitoring Data are stored in temporarily the memory space of local node;
2. the monitoring agent of this child node sends a heartbeat packet to the management system of host node every t second, guarantees host node and the most basic the communicating of child node, informs that this child node of host node is in normal operating conditions;
If 3. the monitoring agent of this child node monitors cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, namely there is w 1> W 1, w 2> W 2, w 3> W 3, w 4> W 4, w 5> W 5, w 6> W 6in one or more situations, the monitoring agent of this child node sends this abnormal state information to host node at once, and is given the alarm by the monitoring management system of host node, notifies that keeper safeguards.
As shown in Figure 1, a kind of elasticity method for supervising flow process towards mission critical computer cluster of the present invention is as follows:
(1) host node utilizes each monitoring agent to obtain the Monitoring Data of child node, enters step (2);
(2) host node carries out load evaluation to the Monitoring Data obtained, if child node load is greater than 70%, then enter step (3), if node load is less than or equal to 70% and be greater than 30%, then enter step (4) if load is less than or equal to 30% enter step (5);
(3) host node takes high capacity monitoring scheme to monitor child node, enters step (6);
(4) host node takes normal load monitoring scheme to monitor child node, enters step (6);
(5) host node takes low load monitoring scheme to monitor child node, enters step (6);
(6) management system of host node judges whether each node occurs exception, if noted abnormalities, enters (7), otherwise returns step (1);
(7) abnormal information of this child node is sent warning by the mode such as mail or note to administrative staff by the management system of host node, returns step (1).
A kind of elasticity method for supervising towards mission critical computer cluster of the present invention, first for the host node in cluster arranges a management system, for each child node distributes a monitoring agent, the monitoring agent of child node is responsible for the monitor data gathering respective node, and sends to the management system of host node.Host node, according to the Monitoring Data receiving monitoring agent, when each monitoring period starts, calculates the maximum load of each node of cluster, and load level represents by 0 ~ 100%.Wherein maximum load is the mean value in cpu busy percentage, memory usage and network bandwidth utilization rate three.And according to maximum load, take three kinds of different monitoring schemes respectively.
(1) low load monitoring scheme: the maximum load of child node is less than or equal to 30%, adopts low load monitoring scheme.Under this monitoring scheme, the monitoring flow process of this node is as follows:
1. this child node monitoring agent checks the local monitor data file that whether there is interim storage, if there is the interim Monitoring Data stored, then packing sends to the management system of host node, and the interim storage file that deletion has sent;
2. management system informs that the monitoring agent of this child node carries out comprehensive monitoring to node, and the data of monitoring comprise 10 parts: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state, critical port (port 21,23,25 and 80);
If 3. the monitoring agent of this child node monitors cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, namely there is w 1> W 1, w 2> W 2, w 3> W 3, w 4> W 4, w 5> W 5, w 6> W 6in one or more situations, the monitoring agent of this child node sends this abnormal state information to host node at once, and is given the alarm by the monitoring management system of host node, notifies that keeper safeguards.
4. the management system of host node is according to the Monitoring Data of child node, exceed threshold value, critical port by the abnormal conditions illegally occupied, and alert notice keeper safeguards in time for I/O response timeout, disk read-write fault, CPU voltage;
5. for cpu busy percentage (w1), memory usage (w2), cache miss rate (w3), cpu temperature (w 4), virtual memory (w 5), network bandwidth utilization rate (w 6) adopt formula (1) to calculate the unhealthy degree P of this child node, wherein W 1, W 2, W 3, W 4, W 5, W 6be respectively 90%, 90%, 10%, 80 DEG C, 4000M, 90%, respectively the threshold value of corresponding cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, λ 1, λ 2, λ 3, λ 4, λ 5, λ 6be respectively 0.25,0.25,0.1,0.15,0.1,0.15, calculate unhealthy degree P by formula (1), if P > 70, illustrate that this node is in unhealthy condition, alert notice administrative staff safeguard.
(2) normal load monitoring scheme: the maximum load of child node is greater than 30% and is less than or equal to 70%, adopts normal load monitoring scheme.Under this monitoring scheme, the monitoring flow process of this node is as follows:
1. this child node monitoring agent checks the local monitor data file that whether there is interim storage, if there is the interim Monitoring Data stored, then packing sends to the management system of host node, and the interim storage file that deletion has sent;
2. management system informs that the monitoring agent of this child node carries out comprehensive monitoring to node, and the data of monitoring comprise 10 parts: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state, critical port (port 21,23,25 and 80);
If 3. the monitoring agent of this child node monitors cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, namely there is w 1> W 1, w 2> W 2, w 3> W 3, w 4> W 4, w 5> W 5, w 6> W 6in one or more situations, the monitoring agent of this child node sends this abnormal state information to host node at once, and is given the alarm by the monitoring management system of host node, notifies that keeper safeguards.
4. the management system of host node is according to the Monitoring Data of child node, and exceed threshold value, critical port by the abnormal conditions illegally occupied for I/O response timeout, disk read-write fault, CPU voltage, alert notice keeper safeguard in time;
(3) high capacity monitoring scheme: if the maximum load of child node is greater than 70%, adopts high capacity monitoring scheme.Under this monitoring scheme, the monitoring flow process of this child node is as follows:
1. the monitoring agent of this child node is under high capacity monitoring scheme, no longer real-time report the machine Monitoring Data.The monitoring agent of child node is monitored node, the data of monitoring comprise 6 parts: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, and Monitoring Data are stored in temporarily the memory space of local node;
2. the monitoring agent of this child node sends a heartbeat packet to the management system of host node every t second, guarantees host node and the most basic the communicating of child node, informs that this child node of host node is in normal operating conditions;
If 3. the monitoring agent of this child node monitors cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, namely there is w 1> W 1, w 2> W 2, w 3> W 3, w 4> W 4, w 5> W 5, w 6> W 6in one or more situations, the monitoring agent of this child node sends this abnormal state information to host node at once, and is given the alarm by the monitoring management system of host node, notifies that keeper safeguards.
Below in conjunction with specific embodiment, a kind of elasticity method for supervising towards mission critical computer cluster of the present invention is described in detail below.
Embodiment 1:
As shown in Figure 1, when the load of mission critical computer cluster is less than 70%, and cpu busy percentage w 1=68%, memory usage w 2=65%, cache miss rate w 3=8%, cpu temperature w 4=70 DEG C, virtual memory w 5=3800M, network bandwidth utilization rate w 6when=80%, and CPU voltage, I/O state, disk state, critical port are all normal, and workflow is as follows:
(1) host node utilizes each monitoring agent to obtain the Monitoring Data of child node;
(2) host node carries out load evaluation to the Monitoring Data of child node, and child node load is less than 70%;
(3) when each monitoring period of this child node starts, monitoring agent checks the local monitor data file that whether there is interim storage, if had, then packing sends to the management system of host node, and deletes the interim storage file sent;
(4) management system of host node is according to the Monitoring Data of child node, wherein cpu busy percentage w 1, memory usage w 2, cache miss rate w 3, cpu temperature w 4, virtual memory, network bandwidth utilization rate be all in normal range (NR), CPU voltage, I/O state, disk state, critical port are all normal;
(5) the cpu busy percentage w of Monitoring Data 1=68%, memory usage w 2=65%, cache miss rate w 3=8%, cpu temperature w 4=70 DEG C, virtual memory w 5=3800M, network bandwidth utilization rate w 6=80% adopts formula (1) to calculate the unhealthy degree P of this child node, wherein W 1, W 2, W 3, W 4, W 5, W 6be respectively 90%, 90%, 10%, 80 DEG C, 4000M, 90%, λ 1, λ 2, λ 3, λ 4, λ 5, λ 6be respectively 0.25,0.25,0.1,0.15,0.1,0.15, calculate unhealthy degree P', P'=81 > 70, this node is in unhealthy condition, and alert notice administrative staff safeguard.
Embodiment 2:
As shown in Figure 1, when the load of mission critical computer cluster is greater than 70%, but in Monitoring Data, cpu temperature is w 4=85 DEG C, exceed threshold value, other Monitoring Data are when normal range (NR), and workflow is as follows:
(1) host node utilizes each monitoring agent to obtain the Monitoring Data of child node;
(2) host node carries out load evaluation to the Monitoring Data obtained to this child node, and load is greater than 70%;
(3) monitoring agent of this child node is under high capacity monitoring scheme, no longer real-time report the machine Monitoring Data, but Monitoring Data is stored in temporarily the memory space of local node;
(4) monitoring agent of this child node sends a heartbeat packet to the management system of host node every 5 seconds, guarantees host node and the most basic the communicating of child node, informs that this child node of host node is in normal operating conditions;
(5) monitoring agent of this child node monitors cpu temperature w 4=85 DEG C of >W 4,exceed threshold value, the monitoring agent of this child node sends this abnormal state information to host node at once, and is given the alarm by the monitoring management system of host node, notifies that keeper safeguards.

Claims (4)

1., towards an elasticity method for supervising for mission critical computer cluster, it is characterized in that: comprise the following steps,
Step one: for the host node in cluster arranges a management system, for each child node distributes a monitoring agent, the monitoring agent of child node is responsible for the monitor data gathering respective node, and sending to the management system of host node, monitor data comprises: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state and critical port;
Step 2: the management system of host node is according to the monitor data receiving monitoring agent, when each monitoring period starts, calculate the maximum load of each node of cluster, the maximum load of node is the mean value of its cpu busy percentage, memory usage and network bandwidth utilization rate three, if child node maximum load is greater than 70%, then enter step 3; If child node maximum load is less than or equal to 70% and is greater than 30%, then enter step 4; If the most load of child node is less than or equal to 30% and enters step 5;
Step 3: the management system of host node takes high capacity method for supervising to monitor child node, enters step 6;
Step 4: the management system of host node takes normal load method for supervising to monitor child node, enters step 6;
Step 5: the management system of host node takes low load monitoring method to monitor child node, enters step 6;
Step 6: return step one, until task terminates.
2. a kind of elasticity method for supervising towards mission critical computer cluster according to claim 1, is characterized in that: described high capacity method for supervising is:
(1) monitoring agent of child node is monitored node, monitor data is cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory and network bandwidth utilization rate, is stored in the memory space of local node by monitor data temporarily;
(2) monitoring agent of child node sends a heartbeat packet to the management system of host node every t second, informs that this child node of host node is in normal operating conditions;
(3) if one or more data that the monitoring agent of child node monitors in cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, the monitoring agent of child node sends this abnormal state information to host node at once, the management system of host node, by the abnormal information of the child node of reception, gives the alarm to administrative staff.
3. a kind of elasticity method for supervising towards mission critical computer cluster according to claim 1, is characterized in that: described normal load method for supervising is:
(1) child node monitoring agent checks the local monitor data file that whether there is interim storage, if there is the interim monitor data stored, then packing sends to the management system of host node, and the interim storage file that deletion has sent;
(2) management system informs that the monitoring agent of this child node carries out overall monitor to node, and the data of monitoring comprise: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state and critical port;
(3) if one or more data that the monitoring agent of child node monitors in cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, the monitoring agent of child node sends this abnormal state information to host node at once, the management system of host node, by the abnormal information of the child node of reception, gives the alarm to administrative staff;
(4) management system of host node is according to the Monitoring Data of child node, judge whether that I/O response timeout, disk read-write fault, CPU voltage exceed threshold value or critical port by the abnormal conditions illegally occupied, if there is exception, the management system of host node gives the alarm to administrative staff.
4. a kind of elasticity method for supervising towards mission critical computer cluster according to claim 1, is characterized in that: described low load monitoring method is:
(1) child node monitoring agent checks the local monitor data file that whether there is interim storage, if there is the interim monitor data stored, then packing sends to the management system of host node, and the interim storage file that deletion has sent;
(2) management system of host node informs that the monitoring agent of child node carries out overall monitor to child node, and the data of monitoring comprise: cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate, CPU voltage, I/O state, disk state, critical port;
(3) if one or more data that the monitoring agent of child node monitors in cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate exceed threshold value, the monitoring agent of child node sends this abnormal state information to host node at once, the management system of host node, by the abnormal information of the child node of reception, gives the alarm to administrative staff;
(4) management system of host node is according to the Monitoring Data of child node, judge whether that I/O response timeout, disk read-write fault, CPU voltage exceed threshold value or critical port by the abnormal conditions illegally occupied, if there is exception, the management system of host node gives the alarm to administrative staff;
(5) management system of host node calculates unhealthy degree P according to the Monitoring Data of child node:
P = ( λ 1 w 1 W 1 + λ 2 w 2 W 2 + λ 3 w 3 W 3 + λ 4 w 4 W 4 + λ 5 w 5 W 5 + λ 6 w 6 W 6 ) * 100
Wherein cpu busy percentage w 1, memory usage w 2, cache miss rate w 3, cpu temperature w 4, virtual memory w 5, network bandwidth utilization rate w 6, W 1, W 2, W 3, W 4, W 5, W 6the prior defined threshold of cpu busy percentage, memory usage, cache miss rate, cpu temperature, virtual memory, network bandwidth utilization rate respectively, λ 1, λ 2, λ 3, λ 4, λ 5, λ 6be respectively the weight of above Monitoring Data;
If unhealthy degree is greater than threshold value, the management system of host node gives the alarm to administrative staff.
CN201510419779.5A 2015-07-17 2015-07-17 Elastic monitoring method for key task computer cluster Pending CN105024880A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510419779.5A CN105024880A (en) 2015-07-17 2015-07-17 Elastic monitoring method for key task computer cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510419779.5A CN105024880A (en) 2015-07-17 2015-07-17 Elastic monitoring method for key task computer cluster

Publications (1)

Publication Number Publication Date
CN105024880A true CN105024880A (en) 2015-11-04

Family

ID=54414605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510419779.5A Pending CN105024880A (en) 2015-07-17 2015-07-17 Elastic monitoring method for key task computer cluster

Country Status (1)

Country Link
CN (1) CN105024880A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105681103A (en) * 2016-03-03 2016-06-15 山东超越数控电子有限公司 Loongson-chip-based cluster resource monitoring realization method
CN106027328A (en) * 2016-05-13 2016-10-12 深圳市中润四方信息技术有限公司 Cluster monitoring method and system based on application container deployment
CN106161140A (en) * 2016-06-28 2016-11-23 中国联合网络通信集团有限公司 Determine method, monitor node and the group system of monitored node duty
CN106713423A (en) * 2016-12-06 2017-05-24 上海斐讯数据通信技术有限公司 Distributed data processing method and device for cloud access point controller
CN108038043A (en) * 2017-12-22 2018-05-15 郑州云海信息技术有限公司 A kind of distributed storage cluster alarm method, system and equipment
CN109032890A (en) * 2018-07-23 2018-12-18 国云科技股份有限公司 A kind of mixing cloud data center large-size screen monitors monitoring method
CN109376043A (en) * 2018-10-18 2019-02-22 郑州云海信息技术有限公司 A kind of method and apparatus of equipment monitoring
CN110187838A (en) * 2019-05-30 2019-08-30 北京百度网讯科技有限公司 Data IO information processing method, analysis method, device and relevant device
CN111124829A (en) * 2019-12-22 2020-05-08 北京浪潮数据技术有限公司 Method for monitoring states of kubernetes computing nodes
CN114500096A (en) * 2022-02-28 2022-05-13 浪潮电子信息产业股份有限公司 Alarm method, system, equipment and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1756257A (en) * 2004-09-30 2006-04-05 北京航空航天大学 Host performance collection proxy in large-scale network
CN101631048A (en) * 2008-07-14 2010-01-20 中国移动通信集团河南有限公司 Method, device and system for monitoring managed object
CN102497292A (en) * 2011-11-30 2012-06-13 中国科学院微电子研究所 Computer cluster monitoring method and system thereof
CN103139007A (en) * 2011-12-05 2013-06-05 阿里巴巴集团控股有限公司 Method and system for detecting application server performance
CN103259682A (en) * 2013-05-16 2013-08-21 浪潮通信信息系统有限公司 Communication network element security evaluation method based on multidimensional data aggregation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1756257A (en) * 2004-09-30 2006-04-05 北京航空航天大学 Host performance collection proxy in large-scale network
CN101631048A (en) * 2008-07-14 2010-01-20 中国移动通信集团河南有限公司 Method, device and system for monitoring managed object
CN102497292A (en) * 2011-11-30 2012-06-13 中国科学院微电子研究所 Computer cluster monitoring method and system thereof
CN103139007A (en) * 2011-12-05 2013-06-05 阿里巴巴集团控股有限公司 Method and system for detecting application server performance
CN103259682A (en) * 2013-05-16 2013-08-21 浪潮通信信息系统有限公司 Communication network element security evaluation method based on multidimensional data aggregation

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105681103A (en) * 2016-03-03 2016-06-15 山东超越数控电子有限公司 Loongson-chip-based cluster resource monitoring realization method
CN106027328A (en) * 2016-05-13 2016-10-12 深圳市中润四方信息技术有限公司 Cluster monitoring method and system based on application container deployment
CN106161140B (en) * 2016-06-28 2019-07-02 中国联合网络通信集团有限公司 Determine method, monitoring node and the group system of monitored node working condition
CN106161140A (en) * 2016-06-28 2016-11-23 中国联合网络通信集团有限公司 Determine method, monitor node and the group system of monitored node duty
CN106713423B (en) * 2016-12-06 2019-11-29 上海斐讯数据通信技术有限公司 The processing method and processing device of distributed data in a kind of cloud access base site controller
CN106713423A (en) * 2016-12-06 2017-05-24 上海斐讯数据通信技术有限公司 Distributed data processing method and device for cloud access point controller
CN108038043A (en) * 2017-12-22 2018-05-15 郑州云海信息技术有限公司 A kind of distributed storage cluster alarm method, system and equipment
CN108038043B (en) * 2017-12-22 2021-04-23 郑州云海信息技术有限公司 Distributed storage cluster warning method, system and equipment
CN109032890A (en) * 2018-07-23 2018-12-18 国云科技股份有限公司 A kind of mixing cloud data center large-size screen monitors monitoring method
CN109376043A (en) * 2018-10-18 2019-02-22 郑州云海信息技术有限公司 A kind of method and apparatus of equipment monitoring
CN110187838A (en) * 2019-05-30 2019-08-30 北京百度网讯科技有限公司 Data IO information processing method, analysis method, device and relevant device
CN111124829A (en) * 2019-12-22 2020-05-08 北京浪潮数据技术有限公司 Method for monitoring states of kubernetes computing nodes
CN114500096A (en) * 2022-02-28 2022-05-13 浪潮电子信息产业股份有限公司 Alarm method, system, equipment and computer readable storage medium
CN114500096B (en) * 2022-02-28 2023-10-10 浪潮电子信息产业股份有限公司 Alarm method, system, equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN105024880A (en) Elastic monitoring method for key task computer cluster
US11385951B2 (en) Monitoring and analyzing watchdog messages in an internet of things network environment
Ruiz et al. Fault management in event-driven wireless sensor networks
Hu et al. Fault-tolerant clustering topology evolution mechanism of wireless sensor networks
McCann et al. Evaluation issues in autonomic computing
US7165190B1 (en) Method and mechanism for managing traces within a computer system
CN104753994B (en) Method of data synchronization and its device based on aggregated server system
Gu et al. Online anomaly prediction for robust cluster systems
Li et al. Exploit failure prediction for adaptive fault-tolerance in cluster computing
CN109753385A (en) A kind of restoration methods and system towards the monitoring of stream calculation system exception
US20060277295A1 (en) Monitoring system and monitoring method
US20080263556A1 (en) Real-time system exception monitoring tool
CN104320311A (en) Heartbeat detection method of SCADA distribution type platform
Araujo et al. Dependability evaluation of a mhealth system using a mobile cloud infrastructure
CN102902615A (en) Failure alarm method and system for Lustre parallel file system
CN104503894A (en) System and method for monitoring state of distributed server in real time
Cheraghlou et al. A novel fault-tolerant leach clustering protocol for wireless sensor networks
Acharya et al. An ANFIS estimator based data aggregation scheme for fault tolerant wireless sensor networks
US20200084129A1 (en) Method and apparatus for reporting power down events in a network node without a backup energy storage device
Colombo et al. Towards self-adaptive peer-to-peer monitoring for fog environments
CN106844083A (en) A kind of fault-tolerance approach and system perceived towards stream calculation system exception
CN109005076A (en) A kind of intelligent substation switch monitoring system and interchanger monitoring method
CN103634167B (en) Security configuration check method and system for target hosts in cloud environment
Kong et al. Correlated and cascading node failures in random geometric networks: A percolation view
Bruneo et al. Energy control in dependable wireless sensor networks: a modelling perspective

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20151104

WD01 Invention patent application deemed withdrawn after publication