CN103684941A

CN103684941A - Arbitration server based cluster split-brain prevent method and device

Info

Publication number: CN103684941A
Application number: CN201310615821.1A
Authority: CN
Inventors: 蔡强; 董春青; 袁泉
Original assignee: GUANGDONG NEWSTART TECHNOLOGY SERVICE Ltd
Current assignee: GUANGDONG NEWSTART TECHNOLOGY SERVICE Ltd
Priority date: 2013-11-23
Filing date: 2013-11-23
Publication date: 2014-03-26
Anticipated expiration: 2033-11-23
Also published as: CN103684941B

Abstract

The invention discloses an arbitration server based cluster split-brain prevent method and device, and belongs to high-availability cluster split-brain prevention technology in the field of computer cluster technology, in order to solve the problem that services cannot be taken over or the services run on two nodes at the same time due to the fact that states of other nodes and running services thereof cannot be accurately distinguished when a cluster heartbeat network is interrupted. The scheme includes that when the heartbeat network is interrupted, cluster nodes not running services can take over the services only by acquiring corresponding service locks through an arbitration server so as to avoid the problem of split-brain; after the services cease, the arbitration server recovers the service locks and allows other cluster nodes to preempt the same again; in the process that multiple nodes preempt one service lock, only one node succeeds in preemption and the services can be started to prevent occurrence of the split-brain.

Description

Cluster based on arbitrating server splits brain preventing method and device

(1) technical field

The invention belongs to computer cluster technology field, be applicable to high availability cluster (High-availability Cluster), relate in particular to high availability cluster and split brain prevention technique field.

(2) background technology

Along with the develop rapidly of communication network technology, the key areas such as telecommunications, finance, E-Government are more and more higher to the requirement of server availability.High available (High Availability, HA) Clustering can effectively reduce the service stopping time that operation system causes because of software, hardware fault.

Current highly available cluster system mainly by links such as network or Serial Port Lines as the privately owned heartbeat network of communicating by letter between clustered node, be responsible for the information between exchange synchronization node, the ruuning situation of each node in monitoring cluster.When service operation node failure, backup node can not be received the heartbeat message of service operation node within a certain period of time, thinks that service operation node fault has occurred and carried out service take-over.But when all heartbeat link occurs fault, may cause service operation node and backup node to start business simultaneously, cause cluster to split brain (Split-Brain) and corrupted data.

In order to ensure user's business sustainability and Information Security, prevent that cluster from splitting brain and being absolutely necessary, at present general way is that malfunctioning node Fencing is restarted and maybe will be retained technology by SCSI3 and carry out Fencing isolation to sharing storage.But inventor finds these methods and have limitation, in actual environment, often not possess the hardware condition of Fencing, and on backup node, moving equally other important business, client does not allow operating system to restart or shares storage to be isolated.In addition, although the disk lock technology based on sharing magnetic battle array can be at local area network (LAN), partly solve cluster with the occasion of sharing magnetic battle array, split brain problem, but there is equally many limitation, such as needs are repartitioned shared magnetic battle array subregion, do not support without magnetic matrix ring border, do not support virtual machine environment, do not supported wide area network strange land cluster etc.

(3) summary of the invention

Example object of the present invention is to provide a kind of cluster based on arbitrating server to split brain preventing method and device, overcome the deficiencies in the prior art, in the situation that not needing server node Fencing is restarted or share storage Fencing isolation, still can, when cluster heartbeat network interrupts or be abnormal, prevent that cluster from splitting brain and occurring and corrupted data.And overcome magnetic battle array arbitration dish and must configure shared magnetic battle array, must carry out subregion again to magnetic battle array, can only be for the limitation of local area network (LAN), do not support the limitation such as virtual machine environment, be applicable to without sharing magnetic battle array, not needing magnetic battle array again subregion, cluster virtual machine, the contour availability cluster environment of wide area network strange land cluster.

The present invention realizes with device by the following method:

When node or heartbeat network failure, service off-duty sub-cluster must elder generation to arbitrating server application and obtain service lock, the adapter that just can serve, if because of any reason, service off-duty sub-cluster can not obtain service lock, and the service of can not carrying out starts action.Thereby avoid two nodes to start service simultaneously, prevent that cluster from splitting the generation of brain.

The sub-cluster of attempting the service of taking over because interrupting stopping service, former service operation node heartbeat needs a t_giveup time, so within this time, can continue to send the request of application service lock, until obtain service lock.

Service operation sub-cluster regularly sends service lock refresh message to arbitrating server, and arbitrating server upgrades current service lock timestamp, and maintenance service lock status is constant.Now, the node of non-service operation sub-cluster cannot obtain respective service lock, can not take over service.

If because the reason such as network failure, t_timeout in the time arbitrating server can not receive any service lock refreshing information, think that service operation sub-cluster has crashed or become soliton cluster, and the state of service lock be set to unknown state.After this, for guaranteeing that former service operation node has fully, stop service time, arbitrating server can wait for that the t_giveup time is just set to unlocked service lock state, the service of confirmation stops, and allow other nodes to seize service lock, the of short duration brain problem of splitting causing because origin node service does not stop completely while avoiding standby host to take over service.

Now, service operation sub-cluster and arbitrating server lose and are connected and become soliton cluster.For guaranteeing the operation continuation of service, process in two kinds of situation: (1) service operation sub-cluster number of nodes is greater than 1/2 o'clock of former clustered node quantity, continuing externally provides service, avoids because the linkage fault of arbitrating server has influence on the availability of service; (2) number of nodes of service operation sub-cluster is less than or equal to 1/2 o'clock of former clustered node quantity, carries out and stops service operations and discharge service lock.Now the backup node of service operation sub-cluster will not be taken over service, and when at t_giveup, in the time, service can not normally stop, service operation node will be carried out and restart system acting, to facilitate other sub-cluster to take over.When service operation sub-cluster nodes >1/2, non-service operation sub-cluster is less than 1/2 certainly, so now non-service operation sub-cluster can not attempted applying for service lock and be taken over service, does not exist cluster to split brain risk.

For improving service availability, maximizing service continuous service ability, can not carry out by option the algorithm of 1/2 nodes yet, now no matter non-service operation sub-cluster is >1/2 whether, as long as node state changes or heartbeat fault, capital is carried out and is robbed latching operation, and attempts the service of taking over.This mode improves service sustainability, but has reduced Information Security, increases cluster and splits brain risk.

When service fault, first cluster can carry out service stopping operation, and initiatively discharges service lock to arbitrating server.And must in the maximum dwell time t_giveup of service, stop, in the t_giveup time, service does not stop, and needs to carry out server reboot operation immediately, guarantees that arbitrating server is set to unlocked by service lock, backup node is taken over before service, and service has stopped.

The present invention on the other hand, provide a kind of based on arbitrating server split brain prevent mean, its feature comprises:

Cluster server end proxy module.Service operation sub-cluster election communication node regularly sends refreshing service lock message to arbitrating server, and refreshing service lock message mainly comprises service name, refreshing service lock node, refresh time stamp etc.; Service off-duty sub-cluster election communication node, before attempting the service of taking over, sends service lock solicitation message to arbitrating server, and application service lock message content comprises Service name, robs lock node name etc.

Arbitrating server module.When service lock is during in unlocked state, arbitrating server authorizes by service lock the node that first robs lock application, then service lock is set to locked state, upgrades and accounts for lock nodename; When service lock is during in locked state, special services is locked into line duration stamp and refreshes, the service lock application of the backup node from service off-duty sub-cluster is returned and robbed lock failed message; Arbitrating server is safeguarded the information of each service lock, comprising: the node at the state of service lock title, service lock, service lock refresh time stamp, service place.

The present invention is based on the Client/Server network architecture and realized a kind of service lock arbitration device, based on service lock, account for the uniqueness of lock node, the node of only obtaining service lock could start service, the risk of avoiding service simultaneously to start at 2 nodes, thus avoided cluster to split the generation of brain.Restart Fencing with system or share storage Fencing isolation technology and compare, the present invention is based on the concept of service lock, can support the different services of each self-operating of active/standby server, improving server resource service efficiency.The present invention disposes and implements conveniently, does not need to share the equipment such as magnetic battle array, as long as can move the machine that arbitral procedure, each node of cluster can connected reference, can be configured to arbitrating server.Under the strange land cluster environment of virtualized environment, wide area network, Fencing technology originally and magnetic battle array arbitration disk technology are all inapplicable, and the present invention all can play better arbitration effect under above-mentioned environment, to cluster virtual machine, strange land cluster, provide service-conformance and data security guarantee.In addition, the present invention is applicable to binode, the contour availability cluster of multinode simultaneously.

(4) accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, will be briefly described with the accompanying drawing of required use in embodiment or description of the Prior Art below.

Fig. 1 is service lock structure chart provided by the invention;

Fig. 2 a is refreshing service lock content provided by the invention;

Fig. 2 b is application service lock content provided by the invention;

Fig. 3 is the non-service operation sub-cluster of the present invention application service lock flow chart;

Fig. 4 is service operation sub-cluster refreshing service lock flow chart of the present invention;

Fig. 5 is the process chart that arbitrating server of the present invention receives application service lock message;

Fig. 6 is the flow chart that arbitrating server of the present invention receives refreshing service lock message;

Fig. 7 is the flow chart that arbitrating server of the present invention regularly detects service lock refresh time stamp;

Fig. 8 is that the cluster based on arbitrating server provided by the invention splits brain prevent mean schematic diagram;

(5) embodiment

Below in conjunction with drawings and Examples, the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.

In order to solve when cluster heartbeat network interrupts, because can not effectively judging that other nodes and operation service state thereof easily make error, and then produce and split brain, destroy the problem of data consistency.It is a kind of based on arbitrating server that the embodiment of the present invention provides, and do not needing system to restart or store isolated in the situation that, still the method and apparatus of pre-anticracking brain effectively as far as possible.

The cluster of dual-node configurations arbitrating server is the representative available group system of height, system has two nodes, server node A and Node B, business externally provides service by global network, between node, by private network, carry out switching node information, the running status of serving on monitor node.For guaranteeing the robustness of heartbeat, heartbeat network is generally comprised of two or above straight-through network cables or Serial Port Line.Arbitrating server is distributed to node A service lock, after node A startup is served successfully, the heartbeat between node A, B is normal, and cluster is in normal condition, all signalling of bouquet state normal messages is to arbitrating server for node A, B, and arbitrating server maintenance service lock situation is constant.

When after a time constant t_heartbeat, node A, B do not receive the other side's node, all think separately that heartbeat interrupts, and cluster splits into two sub-cluster, and each sub-cluster only has a node.So, service operation node A regularly sends refreshing service lock message to arbitrating server refreshing service lock; Node B is not robbed and is accounted for service lock, can regularly send application service lock message to arbitrating server yet, and arbitrating server receives the message of node A, B and makes arbitration process.Arbitrating server is being safeguarded a corresponding service lock to each service in cluster.Node is first seized service lock before starting service, and node discharges this service lock after stopping serving.

In the group system of the high available configuration arbitrating server of multinode, with binode system class seemingly.When cluster heartbeat network is normal, each node sends refreshing service lock message to arbitrating server, and it is constant that arbitrating server maintains service lock situation.When cluster heartbeat network is undesired, cluster may split into two or more sub-cluster, and the sub-cluster at service place is called service operation sub-cluster, and service not sub-cluster is therein called service off-duty sub-cluster.The node of service operation sub-cluster all regularly sends refreshing service lock message to arbitrating server, the node of service off-duty sub-cluster also regularly sends application service lock message to arbitrating server, and arbitrating server receives the message of each node and makes corresponding arbitration process.When service lock is during in unlocked state, each node can be seized service lock, only has a unique node can seize successfully, seizes successful node and can start service, this node becomes service operation node, and the sub-cluster at its place has become service operation sub-cluster.

As shown in Figure 1, the structure chart of the service lock of safeguarding for arbitrating server, content comprises that the state, service lock refresh time stamp, the service node at place of service lock title, service lock are, the member node of service operation sub-cluster etc.

As Fig. 2 a, shown in 2b, be respectively the content of refreshing service lock message, application service lock message.

In the embodiment content of following description, contained the highly available cluster system of binode and multinode simultaneously.

As shown in Figure 3, the embodiment of the present invention comprises that cluster end service off-duty sub-cluster is to the method for arbitrating server application service lock:

Step 301, when heartbeat network interrupts, cluster end proxy module detects clustered node state variation has occurred, and this node is in service off-duty sub-cluster, so in step 302, this sub-cluster sends application service lock message to arbitrating server, and waits for the arbitration result of arbitrating server.

In step 303, service off-duty sub-cluster receives the arbitration result that arbitrating server returns.

If robbed, lock successfully, in step 304, start service corresponding to service lock, this node has become service operation node, and its place sub-cluster has become service operation sub-cluster.Simultaneously, the cluster end proxy module of other member node of service operation sub-cluster also detects it in service operation sub-cluster, become the backup node of service operation sub-cluster, after this, the backup node of service operation sub-cluster no longer sends application service lock message to arbitrating server, change into and send refreshing service lock message, and the time of refreshing service lock success message is sent to service operation node, with the refreshing service as service operation node, lock the successfully time.

In step 303, node receives seizing after service lock failed message that arbitrating server returns, can continue to send application service lock message to arbitrating server at t_giveup in the time.If t_giveup received in the time, seize service lock success message, start the service of adapter, if not receiving service lock seizes success message and just proves that service is still in other sub-cluster operations always, or started by other sub-cluster, this node can not taken over service, avoid splitting brain and occur.

In the present embodiment, if adopt the state that carrys out decision node by the IP of global network PING heartbeat disconnected node, in system, do not allow in PING (being that system can not responded icmp request bag), switch ports themselves damage, network storm etc. situation, states of decision node effectively all, more can not effectively judge the state of service, exist very large security risk.Inventor is received refreshing service lock message and is seized the service lock message state of detection node more accurately by arbitrating server, and can detect exactly the state of service.

As shown in Figure 4, be service operation sub-cluster refreshing service lock flow chart of the present invention, comprising:

Step 401, service operation node just sends refreshing service lock message to arbitrating server.Arbitrating server receives the refreshing service lock message of this node, and it can, according to the content update service lock information of refreshing service lock message, return to refreshing service lock success message to this node.

In step 402, service operation node judges that whether refreshing service latching operation is successful.As successful EO, as unsuccessful, by step 403, judge whether current time and the difference of last refreshing service lock time have surpassed t_timeout.As surpass t_timeout service operation node think that this service operation sub-cluster has disconnected with arbitrating server, service operation sub-cluster becomes soliton cluster.Service operation sub-cluster loses the arbitration function of arbitrating server,

Service operation cluster becomes after soliton cluster, in order to keep continuity, the reliability of service, step 404 separately two kinds of situations is processed: the number of nodes of (1) service operation sub-cluster is more than or equal to 1/2 o'clock of number of nodes of former cluster, service operation node, without stopping service, keeps externally service; (2) number of nodes of service operation sub-cluster is less than 1/2 o'clock of number of nodes of former cluster, service operation node is carried out service stopping operation, when at t_giveup, in the time, service can not normally stop, service operation node to carry out server restart system acting service finally stop.

It is that the up-to-date refreshing service of each node of service operation sub-cluster is locked the successfully time in fact that the refreshing service that service operation node maintenance is locked the successfully time, each node of service operation sub-cluster all regularly sends refreshing service lock message to arbitrating server, and arbitrating server returns to refreshing service lock success message.The backup node of service operation sub-cluster receives after oneself refreshing service lock success message this time announcement service operation node, service operation node is made comparisons this time and the former refreshing service successfully time of locking, and show that the up-to-date time is as the service operation node refreshing service lock time.

In the present embodiment, for binode cluster, service A moves on node 1, thinks that node 1 is the service operation node of service A, and correspondingly, another one node 2 is exactly backup node.If operation service B on node 2, node 2 is service B service operation nodes, and node 1 is backup node.Also like this for multinode cluster, a node can be the service operation node of a service, can be also the backup node of another one service.

As shown in Figure 5, embodiment of the present invention arbitrating server end service lock hypervisor, comprising:

In the present embodiment, when cluster heartbeat is normal, cluster can not split into two or more sub-cluster, and cluster at this moment can be regarded maximum service operation sub-cluster as, and it has comprised all nodes.Therefore, all nodes of cluster all send refreshing service lock message to arbitrating server, and arbitrating server maintains corresponding service lock state, and constantly update the refresh time stamp of service lock.

In the present embodiment, when cluster heartbeat network interrupts, whether arbitrating server differentiation service has stopped is vital.In step 501, when arbitrating server receives application service lock message, it has just known that cluster heartbeat network interrupts, and cluster splits at least two sub-cluster.

In step 502, if the state of service lock in unlocked implemented to seize service lock operation, service lock state is set to locked state, service lock operation node is set to robs lock node, and returns and rob lock success message.If the state of step 503 service lock in locked, arbitrating server returns to application service lock node and robs lock failed message; If service lock is in unknown state, when step 504 detects service lock refresh time is not greater than t_giveup, arbitrating server is implemented to account for latching operation, and service lock state is set to locked state, service lock operation node is set to robs lock node, and returns and rob lock success message.

As shown in Figure 6, the invention provides arbitrating server and receive refreshing service lock Message Processing flow process, comprising:

Corresponding to each service, arbitrating server is being safeguarded a service lock, and service lock is to have a t_timeout term of validity.In step 601, if when arbitrating server receives refreshing service lock message, in step 602, it can be according to the information of the content update service lock of refreshing service lock message.If service operation node transformation, or node member changed, and arbitrating server can be learnt these information from each refreshing service lock message, and constantly upgrades service lock information.When receiving refreshing service lock message, member's item of the service lock that must upgrade is exactly the refresh time stamp of service lock, arbitrating server often receives a refreshing service lock message, all will upgrade the refresh time stamp of service lock, with maintenance service, locks up-to-date refresh time.

As shown in Figure 7, the invention provides the method that arbitrating server regularly detects the flow process of service lock refresh time stamp, comprising:

Arbitrating server checks the refresh time stamp of service lock termly.In step 701, if detecting the difference of the refresh time stamp of current time and service lock, arbitrating server surpassed t_timeout, arbitrating server thinks that service operation sub-cluster has become soliton cluster.Because arbitrating server be can not determine the state of service, therefore, in step 702, arbitrating server is set to unknown shape state service lock state.

Surpassing t_timeout in the time, arbitrating server does not all receive node refreshing service lock message or application service lock message, or arbitrating server is because of after fault restarts system, arbitrating server just thinks that all nodes of it and cluster have all disconnected.Now, arbitrating server is state that can not determine service, is likely that cluster heartbeat is not interrupted, and whole cluster do not divide, and service is operation normally; Also be likely that cluster just divides; Also be likely to serve already to have stopped.In this case, arbitrating server is all set to unknown state service lock.When arbitrating server is connected with clustered node again, rise, if arbitrating server receives node refreshing service lock message, just service lock state is set to locked state, and according to the information of the content update service lock of refreshing service lock message.If being again connected with clustered node and receiving, arbitrating server applies for service lock message, member node quantity in application service lock message is greater than 1/2 o'clock of number of nodes of former cluster, and arbitrating server is set to unlocked state the state of service lock at t_giveup after the time.

In the present embodiment, the service lock that arbitrating server in noticeable the present invention is safeguarded there will not be " deadlock " phenomenon, so-called " deadlock " phenomenon be seize service lock service operation node because of fault or node is dead cannot discharge, other nodes cannot successfully be seized service lock.In the present invention, arbitrating server has been introduced service lock " term of validity " character, seize the service operation sub-cluster of service lock must be periodically " term of validity " (t_timeout) in to arbitrating server refreshing service lock timestamp.Arbitrating server is not received the refreshing service lock message of service operation sub-cluster in " term of validity ", and the member node quantity in the application service lock message receiving is greater than service lock will being reclaimed of number of nodes of former cluster at 1/2 o'clock, be set to unlocked state, more again seize; Equally, service operation sub-cluster (t_timeout) does not receive refreshing service lock success message in " term of validity ", and its sub-cluster number of nodes is less than at 1/2 o'clock of former clustered node quantity, will stop service.

The embodiment of the present invention provides a kind of high availability cluster based on arbitrating server to split the method for brain prevention, at heartbeat network, interrupt, cluster splits into after a plurality of sub-cluster, each node still can continue to monitor the state of each service lock, continuation can take over, start service service is externally provided, and at utmost realizes the continuity of service.The technical scheme that the embodiment of the present invention provides has solved does not need system restart or share in the situation of store isolated at present as far as possible, in cluster heartbeat network, has no progeny, and cluster still continues reliably, at utmost externally provides incessantly service.

As shown in Figure 8, the invention provides the device that a kind of high availability cluster based on arbitrating server splits brain prevention, comprising:

Cluster end proxy module 801, has those other nodes members for detection of the sub-cluster at this node place, and whether the service that detects is in its place sub-cluster operation.

In the present embodiment, cluster end proxy module 801 is modules in cluster.It can receive clustered node member change events information in real time.When node member changes or when heartbeat interrupts, whether cluster end proxy module judgement service certain node operation in this sub-cluster.If service certain node operation in this sub-cluster, continues refreshing service lock information, if service does not move in this sub-cluster, need to send application service lock message, attempt the service of taking over.

Arbitrating server end service lock module 802, for the treatment of serving off-duty sub-cluster querying node, seizing service lock message, processes service operation sub-cluster node refreshing service lock message, and distribution services is locked to triumph node, safeguards, upgrades service lock information.

In the present embodiment, service lock maintenance module is the main modular of mediation service program, and it processes the message that arbitrating server termination is received, and makes arbitration result.When cluster heartbeat network interrupts, whether service lock maintenance module differentiation service has stopped is vital, and it may process refreshing service lock message or application service lock message.

In the present embodiment, when processing application service lock message, service lock maintenance module checks the state of service lock, if the state of service lock in locked, it is returned just to application service lock node and robs lock failed message; If service lock is in unknown state, member node quantity in application service lock message is greater than 1/2 o'clock of number of nodes of former cluster, arbitrating server is set to locked state the state of service lock at t_giveup after the time, and upgrade to account for and lock node name and account for lock timestamp, announce to rob to lock successfully.

When processing refreshing service lock message, service lock maintenance module is according to the information of the content update service lock of refreshing service lock message.Refreshing service lock message of the every processing of service lock maintenance module, the refresh time that all will process corresponding service lock stabs, and with maintenance service, locks up-to-date refresh time stamp.Service lock maintenance module checks the refresh time stamp of service lock termly, if the difference of the refresh time of current time and service lock stamp has surpassed t_timeout, it thinks that service operation sub-cluster has become soliton cluster, arbitrating server be can not determine the state of service, and service lock state is set to unknown state.

The embodiment of the present invention provides a kind of high availability cluster based on arbitrating server to split the device of brain prevention, in the situation that heartbeat network interrupts, arbitrating server receives the message that each node is issued it, according to these message, and the information of the accurate maintenance service lock of arbitrating server.If the state of service lock is unlocked, arbitrating server fills node perhaps and participates in seizing service lock, and triumph node will start service externally provides service; If the state of service lock is locked, the service that illustrates has started and service is externally provided on certain node, and node is not taken over service, has avoided splitting brain and has produced.The technical scheme that the embodiment of the present invention provides has solved not to be needed system to be restarted or share in the situation of store isolated at present as far as possible, in cluster heartbeat network, has no progeny, and cluster still continues reliably, at utmost externally provides incessantly service.

The embodiment of the present invention provides a kind of high availability cluster based on arbitrating server to split the method and apparatus of brain prevention, can directly be applied in the cluster of high availability.

The method of describing in conjunction with enforcement disclosed herein or algorithm steps can directly apply to the software module of hardware, the execution of outer reason device, or the two is comprehensively implemented.

The above, only for protection scope of the present invention is not limited to this, be anyly familiar with those skilled in the art; according to thought of the present invention; all will change in specific embodiments and applications, therefore, this description should not be construed as limitation of the present invention.

Claims

1. the high availability cluster based on arbitrating server splits brain preventing method and device, it is characterized in that:

In cluster, server node must be to arbitrating server application service lock before starting service, and the clustered node that does not obtain service lock must not start service; When node is dead or during heartbeat failure, the sub-cluster of off-duty service is by regularly determining whether to arbitrating server application service lock the service of taking over; Application is taken over service to service lock, and application will not taken over to service lock; Thereby avoid service operation simultaneously in a plurality of sub-cluster;

Note: splitting brain state is that cluster splits into several sub-cluster, out of touch and think that other nodes are in heaven each other, and attempt taking over resource from " node in heaven "; Thereby cause service at a plurality of nodes, to move simultaneously, share a series of serious problems such as storage corrupted data;

Before 1.1 startup services, need to obtain service lock, it is characterized in that:

Described service off-duty node is attempting taking over service beginning, t_giveup in the time regularly to arbitrating server application service lock, when the respective service of arbitrating server is locked in unlocked state, service off-duty node will be seized service lock, and carry out service take-over;

The regular refreshing service lock of 1.2 service operation node, is characterized in that:

Described service operation node place sub-cluster is selected a communication node and is communicated by letter with arbitrating server, regularly sends refreshing service lock message to arbitrating server, carries out refreshing of service lock timestamp etc.;

1.3 service faults can stop serving and discharging service lock, it is characterized in that:

When service stops because of fault at operation node, service discharges back arbitrating server by service lock.This sub-cluster is inner can be selected a backup node to attempt application service lock and take over service, and this backup node, after the success of arbitrating server application service lock, will carry out service take-over, and become new service operation node; If backup node starts serv-fail, will stop service and again discharge service lock.

All backup nodes in service operation sub-cluster are applied for service lock and adapter serv-fail continuously, unless there is new node state change events, this sub-cluster will not reattempt application service lock and take over this service;

The processing that 1.4 clusters and arbitrating server break contact, is characterized in that:

Service operation sub-cluster can be selected a node as arbitrating server communication node; When service operation sub-cluster communication node detects current time and refreshing service, lock the successfully difference of time and surpass the predetermined t_timeout time, think and disconnect with arbitrating server, service operation sub-cluster can attempt electing other nodes and arbitrating server to communicate, when all nodes all cannot be communicated by letter with arbitrating server, service operation sub-cluster becomes soliton cluster, loses the arbitration function of arbitrating server;

Described service operation sub-cluster becomes soliton cluster, if this sub-cluster number of nodes is less than or equal to 1/2 of former clustered node quantity, service operation node must stop service; When at t_giveup, in the time, service can not normally stop, service operation node will be carried out and restart system acting, and in service operation sub-cluster, backup node can not be taken over service, also no longer sends refreshing service lock message to arbitrating server;

Described service operation sub-cluster becomes soliton cluster, if the number of nodes of service operation sub-cluster be greater than former cluster number of nodes 1/2, service operation node, without stopping service, continue to keep externally service;

The service lock of 1.5 arbitrating server ends is processed, and it is characterized in that:

The difference that described arbitrating server detects current time and service lock refresh time stamp surpasses predetermined time t_timeout, and arbitrating server thinks that service operation sub-cluster disconnects, is set to unknown state the state of service lock;

Arbitrating server is set to unlocked state in the t_giveup time that the state of service lock is set to after unknown state the state of service lock;

Arbitrating server is set to after unlocked at the state service lock, if receive new service lock application, service lock is distributed to this node.

2. according to claim 1, this method and device also comprise following functional module and restriction, it is characterized in that:

Cluster server end arbitrating server proxy module; For regularly locking to arbitrating server refreshing service, or to arbitrating server application service lock;

Arbitrating server end module: for the treatment of service lock application, service lock discharge, service lock refreshes and the expired processing of service lock etc.;

2.1 cluster end proxy modules, is characterized in that:

This proxy module operates on each node of cluster; Service operation sub-cluster election communication node regularly sends refreshing service lock message to arbitrating server, and refreshing service lock message mainly comprises service name, refreshing service lock node, refresh time stamp etc.;

Service off-duty sub-cluster election communication node, before attempting the service of taking over, sends service lock solicitation message to arbitrating server, and application service lock message content comprises Service name, robs lock node name etc.;

2.2 arbitrating server end modules, is characterized in that:

This module operates on arbitrating server;

When service lock is during in unlocked state, arbitrating server authorizes by service lock the node that first robs lock application, then service lock is set to locked state, upgrades and accounts for lock nodename;

When service lock is during in unknown state, arbitrating server judges update time, whether stamp did not surpass the t_timeout time, if surpass, be considered as robbing and lock successfully, by service lock, authorize the node that first carries out service lock application, then service lock is set to locked state, upgrades and account for lock nodename;

When service lock is during in locked state, special services is locked into line duration stamp and refreshes, the service lock application of the backup node from service off-duty sub-cluster is returned and robbed lock failed message;

Arbitrating server is safeguarded the information of each service lock, comprising: the node at the state of service lock title, service lock, service lock refresh time stamp, service place;

Server architecture Network Based and the media implementation of 2.3 devices, is characterized in that:

This installs server client/server framework Network Based, and clustered node sends request to arbitrating server as client end, and arbitrating server responds and replys request as server end.Need to be by equipment and media implementation such as shared magnetic battle arrays.

3. according to claim 1, this method and device also comprise the high availability redundant mechanism of arbitrating server, it is characterized in that::

For avoiding arbitrating server to become Single Point of Faliure source, odd number platform arbitrating server can be set in cluster, in service lock, rob when lock, according to robbing the service lock >1/2 arbitrating server nodes person principle of winning, carry out service take-over.