CN107147540A

CN107147540A - Fault handling method and troubleshooting cluster in highly available system

Info

Publication number: CN107147540A
Application number: CN201710589299.2A
Authority: CN
Inventors: 杨勇; 亓开元
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-07-19
Filing date: 2017-07-19
Publication date: 2017-09-08

Abstract

The invention discloses the fault handling method in a kind of highly available system and troubleshooting cluster, each node includes in the troubleshooting cluster：Acquisition module, for obtaining the management object in highly available system in working cluster, wherein the management object is the node of working cluster；Monitoring modular, for according to the monitoring policy pre-set, the running status to the management object to be monitored；Node administration module, for when thering is node to be unable to processing business because of failure in management object, notifying that the node broken down is offline.

Description

Fault handling method and troubleshooting cluster in highly available system

Technical field

The present invention relates to communication technical field, fault handling method and troubleshooting in espespecially a kind of highly available system Cluster.

Background technology

High availability cluster (High Available, HA) is the effective solution for ensureing business continuance, is typically had Two or more nodes, and it is divided into active node and secondary node.The business that is carrying out generally is referred to as active section Point, and then it is referred to as secondary node as what one of active node backed up.When active node goes wrong, cause what is be currently running When business (task) is not normally functioning, secondary node will now be detected, and the active node that continues immediately performs business. So as to realize not interrupting or short interruption for business.

But in High Availabitity (HA) system, when the contact between contacting 2 nodes disconnects, it is integral, acts originally The HA systems of coordination, just division turns into 2 independent individuals.Due to mutually losing contact, it can all be considered that other side has gone out event Barrier.HA softwares on two nodes fight for shared resource, have striven application service as " split ", after will occurring seriously Really, such as, shared resource is carved up, the service of 2 sides all be cannot get up；Or 2 node serves all get up, but read while write altogether Storage is enjoyed, causes corrupted data, such as hdfs file system metadatas error etc..

Therefore, in High Availabitity (HA) system, when the contact between contacting 2 nodes disconnects, how to being saved in cluster Point is managed to ensure that normally operation is urgent problem to be solved to business.

The content of the invention

In order to solve the above-mentioned technical problem, the invention provides the fault handling method in a kind of highly available system and event Barrier processing cluster, can prevent the generation of high-availability cluster fissure phenomenon.

It is described the invention provides the troubleshooting cluster in a kind of highly available system in order to reach the object of the invention Each node includes in troubleshooting cluster：

Acquisition module, for obtaining the management object in highly available system in working cluster, wherein the management object For the node of working cluster；

Monitoring modular, for according to the monitoring policy pre-set, the running status to the management object to be monitored；

Node administration module, for when thering is node to be unable to processing business because of failure in management object, notifying The node of failure is offline.

Wherein, the troubleshooting cluster has 2N+1 node, and one of node is host node, and remaining node is from section Point, N is positive integer；Wherein：

Sending module, for notifying to select to work on instead of the node of the failure from working cluster from node Node；

Determining module, for according to from the selection result of node and host node, it is determined that worked on instead of malfunctioning node Node；

Wherein, each node includes：

Module is elected, the node that the node for the selection replacement failure from working cluster works on, and Selection result is sent to the host node.

Wherein, the node administration module includes：

Acquiring unit, the IP address information of the node upper substrate Management Controller BMC for obtaining failure node；

Transmitting element, the BMC IP address information on the node according to failure node, to the node of failure BMC send close power supply instruction.

Wherein, each node also includes：

Alarm module, the failure-description information for exporting the node broken down.

Wherein, each node also includes：

Policy management module, after being asked in the renewal for receiving monitoring policy, is asked, to monitoring according to described update Strategy is updated, and the monitoring policy after renewal is sent into the monitoring modular.

Fault handling method in a kind of highly available system, including：

Troubleshooting cluster obtains the corresponding management object of each node working cluster in highly available system, wherein institute State the node that management object is working cluster；

According to the monitoring policy pre-set, the running status to the management object is monitored；

When thering is node to be unable to processing business because of failure in management object, under the node for notifying failure Line.

Wherein, after the node for notifying to break down is offline, methods described also includes：

The node for notifying each node to select the node instead of the failure to work on from working cluster；

Receive the selection result that each node is sent；

According to the selection result, it is determined that the node worked on instead of malfunctioning node, wherein the troubleshooting cluster There is 2N+1 node, one of node is host node, and remaining node is that N is positive integer from node.

Wherein, it is described notify break down node it is offline including：

Obtain the node upper substrate Management Controller BMC of failure node IP address information；

According to the IP address information of BMC on the node of failure node, send and close to the BMC of the node of failure The instruction of power supply.

Export the failure-description information of the node broken down.

Wherein, methods described also includes：

After the renewal for receiving monitoring policy is asked, asked according to described update, monitoring policy is updated, and will Monitoring policy after renewal is sent to the monitoring modular.

The embodiment that the present invention is provided, by carrying out fault diagnosis to clustered node, when the disconnection of the nodes heart beat of certain in cluster, Power-off operation is carried out to malfunctioning node, it is ensured that malfunctioning node is completely closed, and prevents the generation of high-availability cluster fissure phenomenon.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and obtained in claim and accompanying drawing.

Brief description of the drawings

Accompanying drawing is used for providing further understanding technical solution of the present invention, and constitutes a part for specification, with this The embodiment of application is used to explain technical scheme together, does not constitute the limitation to technical solution of the present invention.

The structure chart of troubleshooting cluster interior joint in the highly available system that Fig. 1 provides for the present invention；

The structural representation for the highly available cluster system based on IPMI that Fig. 2 provides for application example of the present invention；

Fig. 3 is the method flow diagram of the troubleshooting under Fig. 2 systems；

The flow chart of fault handling method in the highly available system that Fig. 4 provides for the present invention.

Embodiment

For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with accompanying drawing to the present invention Embodiment be described in detail.It should be noted that in the case where not conflicting, in the embodiment and embodiment in the application Feature can mutually be combined.

Can be in the computer system of such as one group computer executable instructions the step of the flow of accompanying drawing is illustrated Perform.And, although logical order is shown in flow charts, but in some cases, can be with suitable different from herein Sequence performs shown or described step.

The structure chart of troubleshooting cluster interior joint in the highly available system that Fig. 1 provides for the present invention.Event shown in Fig. 1 Each node includes in barrier processing cluster：

Acquisition module 101, for obtaining the management object in highly available system in working cluster, wherein the management pair As the node for working cluster；

Monitoring modular 102, for according to the monitoring policy pre-set, the running status to the management object to be supervised Survey；

Node administration module 103, for when thering is node to be unable to processing business because of failure in management object, leading to Know that the node of failure is offline.

The troubleshooting cluster that the present invention is provided, by carrying out fault diagnosis to clustered node, when the node heart of certain in cluster Jump and disconnect, power-off operation is carried out to malfunctioning node, it is ensured that malfunctioning node is completely closed, and prevents the production of high-availability cluster fissure phenomenon It is raw.

The troubleshooting cluster that the present invention is provided is illustrated below：

After malfunctioning node is occurred in detecting working cluster, the business of malfunctioning node processing needs to be performed, Then need troubleshooting cluster to select a node from working cluster for the malfunctioning node, carried out instead of the node of failure Work.

To ensure that troubleshooting cluster can select suitable node as early as possible, the quantity of troubleshooting cluster interior joint is 2N+1, one of node is host node, and remaining node is that N is positive integer from node；

The host node notifies to select what the node instead of the failure worked on from working cluster from node Node；

Remaining is from node after notice is received, and selection replaces the node of the failure to continue from working cluster The node of work, and send result to host node；

The node that host node works on according to the selection result from node and host node, selection instead of malfunctioning node.

Wherein host node can select working cluster interior joint the most node of number of times, and instead malfunctioning node continues The node of work.

Each node is carried in working cluster baseboard management controller (Baseboard Management Controller, BMC) chip and management net, pass through IPMI (Intelligent Platform Management Interface, IPMI) the specific virtual machine power-off operation of protocol realization.Current overwhelming majority server master board all carries Bmc cores Piece and bmc network interfaces, bmc chips carry out work independent of the processor, BIOS or operating system of server, are only in the extreme It is vertical, be one individually run in server without proxy management subsystem, as long as electricity just can start working on server.bmc Good autonomous characteristic is just overcome in the past based on the limitation suffered by the way to manage of operating system, and such as operating system is not responding to Or it still can carry out the operation such as switching on and shutting down, information extraction in the case of not loading, and due to managing net independence networking, Service network will be typically better than in network stabilization and accessibility, therefore the ipmi of remote server is operated in non-extreme conditions Under be all reliable, so may insure that node is completely closed when failure occurs, prevent fissure phenomenon.

Specifically, the node administration module includes：

BMC IP address information on acquiring unit, the node for obtaining failure node；

Optionally, each node also includes：

For example, send alarm prompt sound, or, outputting alarm prompt information on the display screen, wherein failure-description is believed Breath can describe the exception that the identification information and node of node occur.

Certainly, monitoring policy can dynamically be updated, and user can also can be carried out with self-defined according to the need for business Dynamic is changed, therefore each node also includes：

In actual applications, can by node state monitoring script and the node control script of rational strict logic, So that after the nodes break down of HA clusters, the shutdown isolation of malfunctioning node being carried out automatically, primary HA softwares are compensate for not The problem of can ensure that malfunctioning node isolation, easily produce fissure phenomenon, and by continuous accumulation monitor mode, can reach The fault comprehensive judgement of a variety of conditions, such as the performance information such as cpu utilization rates, memory usage, whether java processes are normal, section The condition such as whether application process normal on point, can be added in watch-list, and malfunctioning node is held by IPMI protocol Row compulsory power-off operation, it is to avoid the fissure phenomenon often occurred in HA schemes, reaches that HA schemes are flexibly customized, highly reliable The effect of operation.

Illustrated with reference to concrete application example：

The present invention is used as fault-finding and the controller recovered using pacemaker.Pacemaker can voluntarily manage one Individual cluster, creates pacemaker ocf resources on cluster, and ensures election, drift between this cluster of ocf resources in itself And High Availabitity.Therefore we make openstack monitor using ocf scripts, itself are also High Availabitity；

Pacemaker resource mainly has two classes, and (Linux Standards Base, Linux standard take respectively LSB Business) and OCF (Open Cluster Framework, open cluster frameworks).Wherein LSB resources are generally exactly/etc/init.d Script under catalogue, Pacemaker can be with these scripts come start and stop service.OCF resources are the extension serviced LSB, increase Function such as failure monitoring of some high-availability clusters reason etc. and more metamessages, are used as the realization side of specific fault-finding Formula.Pacemaker can be very good to carry out High Availabitity guarantee to service by realizing an OCF resource.

The anti-fissure implementation of the high-availability cluster based on IPMI that the present invention is provided：

The structural representation for the highly available cluster system based on IPMI that Fig. 2 provides for application example of the present invention.The system Deployment it is as follows：

3 nodes in selection at least high-availability system, install pacemaker clustered softwares and install ipmitool, its Middle ipmitool is used for remote management node；Wherein why selecting 3 nodes is led to ensure that pacemaker resources are elected Ballot during node can produce majority；

The node mutual authentication of pacemaker softwares will be mounted with, be configured to an entirety, complete pacemaker clusters Establishment；

The ocf script nodeMonitor of automatic monitor node state are created, and are uploaded in pacemaker clusters each Node /usr/lib/ocf/resource.d/openstack/ catalogues under, wherein the ocf scripts can be realized to enter node Row is started shooting, shut down, checking state, the operation for monitoring resource etc.；Node shape is write by using the form for meeting ocf script specifications State monitoring script, with flexible utilization pacemaker own resources High Availabitity, timer-triggered scheduler and can manage resource and a large amount of Existing Linux ocf scripts, realize the flexible and High Availabitity of monitoring programme in itself；

Certainly, a variety of monitoring conditions can be realized in self-defined monitor methods inside ocf scripts, can be according to business spirit Customization living, reaches the evacuation that virtual machine is can trigger after condition limitation.Method includes but is not limited to：Check business network interface card on node State；Check the state that application process is specified on node；The performance data of the node is checked, such as cpu utilization rates, internal memory are utilized Rate etc.；Check memory space state on the node etc..

Creating the script nodeController of specific execution node power-off operation, and be uploaded to pacemaker user has Under the catalogue of limiting operation, such as/usr/lib/myScript/；The script needs input node bmc ip addresses, account, password；

A pacemaker resource is created using the ocf scripts；Pacemaker resources are equivalent to one by pacemaker Cluster come ensure perform and monitor state Service Instance, each resource in itself may pacemaker clusters each node On start through election, and according to the logic defined inside resource, corresponding action is performed by pacemaker frameworks, for example, The execution interval and time-out time operated defined in ocf meta labels.

Fig. 3 is the method flow diagram of the troubleshooting under Fig. 2 systems.Method includes shown in Fig. 2：

After system starts, pacemaker is spaced the monitor methods of timing perform script, this method according to set time Can by business personnel's self-defining, for decision node whether failure, such as ping ip, access certain business interface, ssh and connect Connect, access database connection etc.；

When nodeMonitor scripts judge certain node failure, then ipmitool is called, it is remote using existing BMC management nets Process control node power, carries out power-off operation, it is not necessary to additionally increase equipment to remote failure node；Use simultaneously It is invalid that malfunctioning node is set to by pacemaker attrd_updater interfaces, is flowed with triggering pacemaker host node and electing Journey.

As seen from the above, application example of the present invention provides a kind of anti-fissure scheme of high-availability cluster based on IPMI, leads to The fault-finding center for using the pacemaker that increases income as HA clusters is crossed, using ocf scripts as monitoring method, using many The fault message of kind of means probe node, the bmc management interface isolated fault nodes provided using server board itself can be with Quickly reliably malfunctioning node is separated from HA clusters, the data in perfect guarantee distributed service running are complete Property, the generation of fissure phenomenon is prevented, makes up and halfway problem is isolated to malfunctioning node in current HA total solution, it is completeer The problem of in kind monitoring business running, the integrality of business datum is protected significantly.

The flow chart of fault handling method in the highly available system that Fig. 4 provides for the present invention.Method includes shown in Fig. 4：

Step 401, the corresponding management object of each node working cluster in highly available system of acquisition, wherein the pipe Manage the node that object is working cluster；

The monitoring policy that step 402, basis are pre-set, the running status to the management object is monitored；

Step 403, when manage object in there is node to be unable to processing business because of failure when, notify break down Node is offline.

Receive the selection result that each node is sent；

Wherein, it is described notify break down node it is offline including：

Export the failure-description information of the node broken down.

Optionally, methods described also includes：

The fault handling method that the present invention is provided, by carrying out fault diagnosis to clustered node, when the node heart of certain in cluster Jump and disconnect, power-off operation is carried out to malfunctioning node, it is ensured that malfunctioning node is completely closed, and prevents the production of high-availability cluster fissure phenomenon It is raw.

Although disclosed herein embodiment as above, described content be only readily appreciate the present invention and use Embodiment, is not limited to the present invention.Technical staff in any art of the present invention, is taken off not departing from the present invention On the premise of the spirit and scope of dew, any modification and change, but the present invention can be carried out in the form and details of implementation Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of troubleshooting cluster in highly available system, it is characterised in that each node in the troubleshooting cluster Including：

Acquisition module, for obtaining the management object in highly available system in working cluster, wherein the management object is work Make the node of cluster；

Node administration module, for when thering is node to be unable to processing business because of failure in management object, notifying occur The node of failure is offline.

2. troubleshooting cluster according to claim 1, it is characterised in that the troubleshooting cluster has 2N+1 section Point, one of node is host node, and remaining node is that N is positive integer from node；Wherein：

The host node also includes：

Sending module, for the section for notifying to select the node instead of the failure to work on from working cluster from node Point；

Determining module, for according to the selection result from node and host node, it is determined that the node worked on instead of malfunctioning node；

Wherein, each node includes：

Module is elected, the node that the node for the selection replacement failure from working cluster works on, and will choosing Select result and be sent to the host node.

3. troubleshooting cluster according to claim 1, it is characterised in that the node administration module includes：

Transmitting element, on the node according to failure node BMC IP address information, to the node of failure BMC sends the instruction for closing power supply.

4. according to any described troubleshooting cluster of claims 1 to 3, it is characterised in that each node also includes：

5. according to any described troubleshooting cluster of claims 1 to 3, it is characterised in that each node also includes：

Policy management module, after being asked in the renewal for receiving monitoring policy, is asked, to monitoring policy according to described update It is updated, and the monitoring policy after renewal is sent to the monitoring modular.

6. fault handling method in a kind of highly available system, it is characterised in that including：

Troubleshooting cluster obtains the corresponding management object of each node working cluster in highly available system, wherein the pipe Manage the node that object is working cluster；

When thering is node to be unable to processing business because of failure in management object, notify that the node broken down is offline.

7. according to the method described in claim 1, it is characterised in that described after the node for notifying to break down is offline Method also includes：

Receive the selection result that each node is sent；

According to the selection result, it is determined that the node worked on instead of malfunctioning node, wherein the troubleshooting cluster has 2N+ 1 node, one of node is host node, and remaining node is that N is positive integer from node.

8. method according to claim 6, it is characterised in that node that the notice breaks down it is offline including：

According to the IP address information of BMC on the node of failure node, sent to the BMC of the node of failure and close power supply Instruction.

9. according to any described method of claim 6 to 8, it is characterised in that node that the notice breaks down it is offline it Afterwards, methods described also includes：

Export the failure-description information of the node broken down.

10. according to any described method of claim 6 to 8, it is characterised in that methods described also includes：

After the renewal for receiving monitoring policy is asked, asked according to described update, monitoring policy is updated, and will updated Monitoring policy afterwards is sent to the monitoring modular.