CN103607297B

CN103607297B - Fault processing method of computer cluster system

Info

Publication number: CN103607297B
Application number: CN201310548737.2A
Authority: CN
Inventors: 陈浩; 赵亚萍
Original assignee: Shanghai Eisoo Information Technology Co Ltd
Current assignee: Shanghai Eisoo Information Technology Co Ltd
Priority date: 2013-11-07
Filing date: 2013-11-07
Publication date: 2017-02-08
Anticipated expiration: 2033-11-07
Also published as: CN103607297A

Abstract

The invention discloses a fault processing method of a computer cluster system. The method comprises the following steps: (A) at least two nodes in the computer cluster system are selected and are set as management nodes which bear the fault processing and the management of the computer cluster system, one node in the management nodes is taken as a main node, and other nodes are taken as standby nodes, (B) a bottom monitoring service module of each node in the computer cluster system monitors the operation state of the node and software and hardware loads and judges whether a fault appears or not, and if so, the bottom monitoring service module notifies a message middleware service module to send a fault massage to a management center service module of the main node; and (C) the management center service module of the main node carries out fault processing according to the fault message. According to the technical scheme of the invention, in the condition that human intervention is not needed, the automatic processing function of the cluster computer system fault can be realized.

Description

A kind of fault handling method of computer cluster

Technical field

The application is related to computer technology, particularly to computer cluster, more particularly, to a kind of computer cluster system The fault handling method of system.

Background technology

With the propulsion of informationization technology, either enterprise or other organizations are all increasingly dependent on department of computer science System.Along with the drastically expansion of data volume, single computer cannot meet its needs, if using supercomputer again greatly The cost increasing computer, in this case, computer cluster technology arises at the historic moment.

Computer cluster is coupled together by the software of one group of loose integrated computer or hardware, and height is closely assisted Complete evaluation work.The multiple stage computers equipment of composition computer cluster logically can be counted as a calculating Machine.Single computer in computer cluster is commonly referred to node, and computer cluster can be connected by LAN, Also other connected modes are supported.Computer cluster is commonly used to improve the calculating speed data stream of single computer Load balancing.Calculating speed and cheap price that computer cluster is exceedingly fast with it, are widely favored, and are obtained fast Speed popularization.

The number of nodes of computer cluster even thousands of from several to hundreds of, therefore works as computer cluster During one or more of system nodes break down, the calculating speed of computer cluster would generally be affected, or even The all nodes in computer cluster are led to all cannot normally to use.Therefore for user of service, how to ensure to count During any one nodes break down in calculation machine group system, computer cluster still can use on the whole, and not shadow Ringing calculating speed then becomes the key of lifting work efficiency and the creation of value.

For the fault processing in computer cluster, usual method is that attendant enters machine room in computer cluster Failed machines are searched in multiple stage node in system, it is then determined that the failure cause of machine, then carry out maintenance work, when node Quantity may need when increasing to increase quantity and the workload of attendant, and not only cost is higher, and work efficiency is very Low.

Content of the invention

This application provides a kind of fault handling method of computer cluster, can be not required to want the bar of manual intervention That realizes computer cluster fault under part automatically processes function.

A kind of fault handling method of computer cluster that the embodiment of the present application provides, including：

In A, selection computer cluster, at least two nodes are set to undertake troubleshooting and management computer collection The management node of group's system, as host node, remaining is as slave node for one of described management node；

The bottom monitoring service module of each of B, computer cluster node monitor the running status of this node with And software and hardware load condition, and judge whether to break down, if so, bottom monitoring service module notifies in the middle of the message of this node Part service module sends failure message to administrative center's service module of host node；

C, administrative center's service module of host node carry out troubleshooting according to described failure message.

It is preferred that internal memory, CPU or system disk utilization rate that described fault is node exceed prespecified threshold value；

Step C is：Defect content is reported attendant by administrative center's service module of host node.

It is preferred that described fault is hardware fault；

Step C is：The hardware identifier that administrative center's service module of host node will appear from fault notifies manager, and will be former Barrier equipment is rejected from computer cluster.

It is preferred that the node breaking down is ordinary node, fault is software fault；

Step C is：Administrative center's service module of host node to identify the state of this node with defined state value, and Concrete fault message is notified attendant.

It is preferred that the node breaking down is host node, fault is software fault；

Step C is：The work that a new host node takes over former host node is elected from slave node.

It is preferred that the method further includes：

Computer cluster has detected node by heartbeat mechanism and has been in off-line state, if saving based on this node Point, elects after a new host node takes over the work of former host node from slave node, will former host node enter aging；If should Node is then directly entered aging for ordinary node；

After aging period, delete all information of this node from computer cluster.

It is preferred that heartbeat message is sent in the middle of the message at host node place for each node unification of computer cluster Part service module, is collected by host node and slave node and manages heartbeat message, if in the last item heartbeat message being received Timestamp also do not receive new heartbeat message then it is assumed that sending this heart apart from the current time beyond threshold value set in advance Jump the node off-line of message.

As can be seen from the above technical solutions, form a covering using message-oriented middleware and single node monitoring programme whole The monitoring network of individual computer cluster node, the service state of each node of monitor in real time and network state, if find Fault information reporting is then uniformly processed to administrative center by node failure by the monitoring programme on this node, thus being not required to very important person That realizes computer cluster fault under conditions of work intervention automatically processes function it is ensured that computer cluster node occurs Can normally use after fault, mitigate the workload of attendant, improve the fault-tolerant ability of computer cluster.

Brief description

A kind of fault handling method schematic flow sheet of computer cluster that Fig. 1 provides for the embodiment of the present application；

The deployment process schematic of the fault handling method of the computer cluster that Fig. 2 provides for the embodiment of the present application.

Specific embodiment

For problems of the prior art, this application provides a kind of troubleshooting side of computer cluster Method, realizes reporting of computer cluster fault using message mechanism, by specific node handling failure, thus being not required to very important person That realizes computer cluster fault under conditions of work intervention automatically processes function it is ensured that computer cluster node occurs Can normally use after fault, mitigate the workload of attendant, improve the fault-tolerant ability of computer cluster.

The main design idea of technical scheme is：Form one using message-oriented middleware and single node monitoring programme The individual monitoring network covering whole computer cluster node, the service state of each node of monitor in real time and network-like Fault information reporting, if finding node failure, is uniformly processed to administrative center, wherein by state by the monitoring programme on this node The monitoring programme of node and failure message suffer from normalized definition, and the process for all kinds of faults also has unified mark Standard, strives realizing the High Availabitity of computer cluster in the case of cost-effective and manpower and materials it is ensured that computer collection Group's system is continuously available on the premise of there is not major accident.

Know-why, feature and technique effect for making technical scheme are clearer, below in conjunction with concrete reality Apply example technical scheme is described in detail.

A kind of fault handling method flow process of computer cluster that the embodiment of the present application provides is as shown in figure 1, include：

Step 101：Choose at least two nodes in computer cluster to be set to undertake troubleshooting and management meter The management node of calculation machine group system, as host node, remaining is as slave node for one of described management node；

Step 102：The bottom monitoring service module of each of computer cluster node monitors the operation of this node State and software and hardware load condition, and judge whether to break down, if so, bottom monitoring service module notification message middleware Service module sends failure message to administrative center's service module of host node；

Step 103：Administrative center's service module of host node carries out troubleshooting according to described failure message.

Mainly utilize message-oriented middleware in the embodiment of the present application scheme, monitored the shape of each node by bottom monitoring programme Condition, once finding that fault reports in time, being collected failure message and being processed by the specific node unification of computer cluster.At this In invention, need installation message middleware, and our computer cluster single node monitoring services of being formulated, computer Cluster system management center service etc., the operating system being used is linux system.The fault processing system of the embodiment of the present application Relate generally to four more crucial parts：Message-oriented middleware service module, bottom monitoring service module, administrative center's service module And failover processing module.

The deployment process of the fault handling method of computer cluster that the embodiment of the present application provides is as shown in Fig. 2 wrap Include：

Step 201：It is installed and activated linux system.

For each of computer cluster node, correctly install required linux system respectively, and right Start after linux system configuration.

Step 202：It is installed and activated message-oriented middleware service.

Correct installation message middleware starting on each node of computer cluster, and just guarantee its work Often, can accurate messaging.

Step 203：Start other services of computer cluster.

The correct administrative center's service mould starting in computer cluster on all nodes in computer cluster Block and bottom monitoring service module.Bottom monitoring service module is responsible for monitoring the running status of each node, and software and hardware Load condition, administrative center's service module is responsible for processing message, and the type of analysis fault, and is carried out point according to fault type Other places are managed.

Step 204：Configuration main-standby nodes.

By the web interface of application programming interfaces (API) or O＆M software choose in computer cluster 2 or 3 nodes are set to undertake the management node of troubleshooting and management computer cluster it is ensured that computer cluster Normal work simultaneously has fail-over feature, in the management node of selection one be host node remaining be slave node.Corresponding, In computer cluster, the node in addition to management node is referred to as ordinary node.

After above-mentioned flow processing, computer cluster is in normal operating conditions, if breaking down, computer Group system can quick response fault processing, taking over fault node is it is ensured that the High Availabitity of computer cluster as needed Property.

Common several fault types given below and corresponding processing method：

The system failure

The system failure include but is not limited to internal memory, CPU, system disk utilization rate too high (be defaulted as 70%, can be according to actual feelings Condition configures).When bottom monitoring service module detects above-mentioned fault, can be by fault message notification message middleware services mould Block, message-oriented middleware service module sends failure message, this message package section containing fault to administrative center's service module of host node Point information, fault time etc..

Because above-mentioned fault does not affect the normal work of host node, administrative center's service module of host node pass through mail or Other modes are informed its defect content of attendant or are checked corresponding system index, no by the web page of O＆M software Manager is needed to enter machine room inspection machine, the great convenience work disposal of manager.

Device hardware fault

Device hardware fault includes but is not limited to disk failure, raid fault, net card failure etc., when bottom monitoring service mould Block detects such fault, can be by fault message notification message middleware services module, and message-oriented middleware service module is to main section Administrative center's service module of point sends failure message, and administrative center's service module is responsible for handling failure, and concrete grammar is to notify The hardware identifier that manager is broken down, rejects faulty equipment.

Ordinary node software fault

Software fault include the various softwares that computer cluster used there occurs fault, such as message-oriented middleware therefore Barrier, ASC administrative service center fault, bottom monitoring service fault etc..Such fault is primarily referred to as each section in computer cluster The service for providing single node being owned by point there occurs fault, and the process at this point for this node is with defined shape Identifying the state of this node and to inform the concrete fault message of attendant by mail or other modes, such fault needs state value Want human intervention malfunctioning node, repair fault manually.

Administrative center's software fault

Software fault include the various softwares that computer cluster used there occurs fault, such as message-oriented middleware therefore Barrier, ASC administrative service center fault, bottom monitoring service fault etc..When administrative center's service module of host node there occurs fault, Now host node cannot normal work, need from slave node according to certain principle (such as node load situation or Little IP principle etc.), elect a new host node, take over the work of former host node.Bear offer externally to service internally The work of management is provided, or slave node breaks down or taken over by other slave nodes offline, this process is referred to as management node certainly Dynamic switching.

What a kind of management node given below automatically switched realizes process example:Slave node gets master by message mechanism Node there occurs fault or offline, slave node startup election mechanism, learns oneself to be little IP node, then take over from data base The work served as before host node, becomes new host node.

Above-mentioned fault need to carry out the switching of fault it is ensured that the High Availabitity of computer cluster when occurring, and handoff procedure is no Need manual intervention, whole-process automatic monitoring, manager can monitor handoff procedure by the used web O＆M page.Fault discovery is rapid, switching The of short duration normal use not affecting computer cluster of process.

Node off-line

Such fault refers mainly to node and there occurs situations such as power-off, suspension.Computer cluster passes through message-oriented middleware The heartbeat mechanism realized detects this node and has been in off-line state, if host node, then carries out host node automatic switchover laggard Enter aging, if ordinary node is then directly entered aging, this section after aging period, will be deleted from whole computer cluster The all information of point.It is the node that this node is not re-used as in computer cluster, no longer undertake any computer cluster system System work.Heartbeat mechanism in the embodiment of the present application is：Heartbeat message is sent to by each node unification of computer cluster The message-oriented middleware module that host node is located, is collected by host node and slave node and manages heartbeat message, if received The timestamp jumped in message of uniting as one afterwards does not also receive new heart beating apart from the current time beyond threshold value set in advance and disappears Breath is then it is assumed that send the node off-line of this heartbeat message.

By the invention it is possible to reach following effect：

1st, realize the troubleshooting of computer cluster it is ensured that computer cluster due to employing message mechanism In node failure can promptly and accurately report, can be processed according to different fault types, no matter hardware fault is also It is that software fault can respond rapidly to, considerably reduce the maintenance difficulties of manager；

2nd, by the multiple node unified managements in computer cluster, load balancing, data are carried out by host node unification The operation such as shunting substantially increases the efficiency of computer cluster.Node in computer cluster is more, this advantage More obvious；

3rd, in the fault treating procedure of computer cluster, in most cases executed by Automatic Program, need not be artificial Intervene, do not affect computer cluster and run well it is not necessary to the configuration of complexity and extra instrument, therefore this programme has Easy to operate, easy care feature；

4th, the present invention is applicable not only to the server platform of different brands, is equally applicable for various virtual machines and therefore has There is good hardware platform adaptability.Have benefited from message-oriented middleware, the reliability of message high it is ensured that computer cluster The accuracy of switching；The switching time of short duration normal use not affecting computer cluster；Linux system stability is high, Decrease the impact to customer service when safeguarding computer cluster.

The foregoing is only the preferred embodiment of the application, not in order to limit the protection domain of the application, all Within the spirit of technical scheme and principle, any modification, equivalent substitution and improvement done etc., should be included in this Shen Within the scope of please protecting.

Claims

1. a kind of fault handling method of computer cluster is it is characterised in that include：

In A, selection computer cluster, at least two nodes are set to undertake troubleshooting and management computer cluster system The management node of system, as host node, remaining is as slave node for one of described management node；

The bottom monitoring service module of each of B, computer cluster node monitors the running status of this node and soft Hardware load situation, and judge whether to break down, if so, bottom monitoring service module notifies the message-oriented middleware on this node Service module sends failure message to administrative center's service module of host node；

2. method according to claim 1 is it is characterised in that internal memory, CPU or system disk that described fault is node use Rate exceedes prespecified threshold value；

3. method according to claim 1 is it is characterised in that described fault is hardware fault；

Step C is：The hardware identifier that administrative center's service module of host node will appear from fault notifies manager, and fault is set Standby rejecting from computer cluster.

4. it is characterised in that the node breaking down is ordinary node, fault is software to method according to claim 1 Fault；

Step C is：Administrative center's service module of host node to identify the state of this node with defined state value, and will have Body fault message notifies attendant.

5. it is characterised in that the node breaking down is host node, fault is software event to method according to claim 1 Barrier；

6. the method according to any one of claim 1 to 5 is it is characterised in that the method further includes：

Computer cluster has detected node by heartbeat mechanism and has been in off-line state, if this node is host node, from Elect after a new host node takes over the work of former host node in slave node, will former host node enter aging；If this node Then it is directly entered aging for ordinary node；

After aging period, delete all information of this node from computer cluster.

7. method according to claim 6 is it is characterised in that described heartbeat mechanism is：Each section of computer cluster The unified message-oriented middleware service module that heartbeat message is sent to host node place of point, is collected and managed by host node and slave node Reason heartbeat message, if the timestamp in the last item heartbeat message being received apart from the current time exceed set in advance Threshold value does not also receive new heartbeat message then it is assumed that sending the node off-line of this heartbeat message.