CN103684941B - Cluster based on arbitrating server splits brain preventing method and device - Google Patents

Cluster based on arbitrating server splits brain preventing method and device Download PDF

Info

Publication number
CN103684941B
CN103684941B CN201310615821.1A CN201310615821A CN103684941B CN 103684941 B CN103684941 B CN 103684941B CN 201310615821 A CN201310615821 A CN 201310615821A CN 103684941 B CN103684941 B CN 103684941B
Authority
CN
China
Prior art keywords
service
cluster
lock
node
arbitrating server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310615821.1A
Other languages
Chinese (zh)
Other versions
CN103684941A (en
Inventor
蔡强
董春青
袁泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Zhongxing Newstart Technology Co Ltd
Original Assignee
Guangdong Zhongxing Newstart Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Zhongxing Newstart Technology Co Ltd filed Critical Guangdong Zhongxing Newstart Technology Co Ltd
Priority to CN201310615821.1A priority Critical patent/CN103684941B/en
Publication of CN103684941A publication Critical patent/CN103684941A/en
Application granted granted Critical
Publication of CN103684941B publication Critical patent/CN103684941B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Hardware Redundancy (AREA)

Abstract

The present invention discloses the method and apparatus that a kind of high-availability cluster based on arbitrating server splits brain prevention, and the high-availability cluster for belonging to computer cluster technology field splits brain prevention technique.To solve in cluster heartbeat network interruption, the state of other nodes and its operation service can not be accurately differentiated, and the service of can not taking over occur or service in two nodes while operation problem.Scheme provided in an embodiment of the present invention includes:In heartbeat network interruption, the clustered node of off-duty service only obtains respective service by arbitrating server and locked, and service take-over can be just carried out, so as to avoid splitting brain problem;After service stopping, arbitrating server recovery service lock simultaneously allows other clustered nodes to seize it again;During multiple nodes seize service lock simultaneously, an only node is seized successfully and can start service, it is therefore prevented that splits the generation of brain.

Description

Cluster based on arbitrating server splits brain preventing method and device
Technical field
The invention belongs to computer cluster technology field, suitable for high availability cluster (High-availability Cluster), more particularly to high-availability cluster splits brain prevention technique field.
Background technology
With the rapid development of communication network technology, the key area such as telecommunications, finance, E-Government is to server availability Requirement more and more higher.High Availabitity (High Availability, HA) Clustering can effectively reduce operation system because soft The service stopping time caused by part, hardware fault.
Current highly available cluster system be mainly used as by links such as network or Serial Port Lines communicated between clustered node it is privately owned Heartbeat network, it is responsible for exchanging the information between synchronization node, monitors the running situation of each node in cluster.When service operation node Failure, backup node can not receive the heartbeat message of service operation node within a certain period of time, then it is assumed that service operation node is sent out Give birth to failure and carry out service take-over.But when all heartbeat links break down, it may result in service operation node and standby Part node starts business simultaneously, causes cluster to split brain (Split-Brain) and corrupted data.
In order to ensure the business sustainability of user and Information Security, prevent cluster split brain be it is essential, at present General way be by malfunctioning node Fencing restart or will be retained by SCSI3 technology shared storage is carried out Fencing every From.But inventor has found that these methods have limitation, in actual environment, often do not possess Fencing hardware condition, and And equally running other important business on backup node, client does not allow operating system to restart or share storage to be isolated. Split in addition, although the disk lock technology based on shared magnetic battle array can solve cluster in LAN, the occasion part with shared magnetic battle array Brain problem, but equally exist many limitations, than if desired for repartitioning shared magnetic battle array subregion, do not support no magnetic matrix ring border, no Support virtual machine environment, do not support wide area network strange land cluster etc..
The content of the invention
Present example purpose is that providing a kind of cluster based on arbitrating server splits brain preventing method and device, overcomes The deficiencies in the prior art, in the case where that server node Fencing need not be restarted or shared storage Fencing isolates, Remain able to cluster heartbeat network interruption or it is abnormal when, prevent cluster split brain occur and corrupted data.And overcome magnetic battle array secondary Shared magnetic battle array must be configured by cutting out disk, it is necessary to carried out subregion again to magnetic battle array, be only used for the limitation of LAN, do not support virtual The limitations such as machine environment, suitable for without shared magnetic battle array, need not be to magnetic battle array again subregion, cluster virtual machine, wide area network strange land collection The high-availability cluster environment such as group.
The present invention realizes with device by the following method:
When node or heartbeat network failure, service off-duty sub-cluster first to arbitrating server application and must be taken Business lock, the adapter that could be serviced, if for any reason, service off-duty sub-cluster can not obtain service lock, then can not Perform service starting operation.So as to avoid two nodes while start service, prevent cluster from splitting the generation of brain.
Because former service operation nodes heart beat line is interrupted to service one t_giveup time of needs is stopped, so at this In time, attempting the sub-cluster of adapter service can continue to send application service lock request, until obtaining service lock.
Service operation sub-cluster periodically sends service lock refresh message to arbitrating server, the current clothes of arbitrating server renewal Business lock timestamp, safeguard that service lock status is constant.Now, the node of non-serving operation sub-cluster can not obtain respective service lock, Service can not be taken over.
If as reasons such as network failures, arbitrating server can not receive any service lock within the t_timeout times Refreshing information, then it is assumed that service operation sub-cluster has been crashed or become soliton cluster, and the state of service lock is set to Unknown states.Hereafter, sufficiently service time is stopped to ensure that former service operation node has, arbitrating server can wait t_ Service lock status is just set to unlocked by the giveup times, and confirmed service has stopped, and allows the service of seizing of other nodes Lock, avoid during standby host adapter service because origin node service does not stop completely and caused by of short duration split brain problem.
Now, service operation sub-cluster disconnects with arbitrating server and becomes soliton cluster.To ensure service Continuation is run, is handled in two kinds of situation:(1) when service operation sub-cluster number of nodes is more than the 1/2 of former clustered node quantity, Continue externally to provide service, avoid because the linkage fault of arbitrating server has influence on the availability of service;(2) service operation When the number of nodes of cluster is less than or equal to the 1/2 of former clustered node quantity, performs stopping service operations and discharge service lock.Now The backup node of service operation sub-cluster not takes over service, when service can not normally stop within the t_giveup times, service Operation node, which will perform, restarts system acting, to facilitate other sub-clusters to take over.As service operation sub-cluster nodes > 1/2 When, non-serving operation sub-cluster is certainly less than 1/2, so now non-serving operation sub-cluster will not be attempted to apply for service lock and connect Pipe service, split brain risk in the absence of cluster.
To improve service availability, maximizing service continuous service ability, 1/2 nodes can not also be performed by option Algorithm, now non-serving operation sub-cluster is in spite of > 1/2, as long as node state change or heartbeat failure, can all be performed Lock operation is robbed, and attempts adapter service.This mode improves service sustainability, but reduces Information Security, increases cluster Split brain risk.
When the service fails, service stopping operation can be first carried out in cluster, and actively discharges service lock to arbitrating server. And must stop completing in maximum dwell time t_giveup servicing, service in the t_giveup times and do not stop, then need to hold Row server reboot operation immediately, it is ensured that service lock is set to unlocked by arbitrating server, before backup node adapter service, clothes Business has stopped completing.
Another aspect of the present invention, there is provided a kind of to split brain preventing mean based on arbitrating server, its feature includes:
Cluster server end proxy module.Service operation sub-cluster election communication node is periodically sent to arbitrating server to be brushed New demand servicing locks message, and refreshing service lock message mainly includes service name, refreshing service lock node, refresh time stamp etc.;Service Off-duty sub-cluster elects communication node before adapter service is attempted, and service lock solicitation message, application are sent to arbitrating server Service lock message content includes Service name, robs lock node name etc..
Arbitrating server module.When service lock is in unlocked states, service lock is authorized first by arbitrating server It is individual to enter to rob the node of lock application, service lock is then set to locked states, renewal accounts for lock nodename;When service lock is in During locked states, special services are locked into line duration stamp and is refreshed, to the backup node from service off-duty sub-cluster Service lock application return rob lock failed message;Arbitrating server safeguards the information of each service lock, including:Service lock title, The state of service lock, service lock refresh time stamp, the node at service place.
The present invention realizes a kind of service lock arbitration device based on the Client/Server network architectures, is accounted for based on service lock Lock node uniqueness, only obtain service lock node could start service, come avoid service 2 nodes and meanwhile startup Risk, so as to avoid the generation that cluster splits brain.Compared with system reboot Fencing or shared storage Fencing isolation technologies, Concept of the invention based on service lock, it would be preferable to support the different services of each self-operating of active/standby server, improve server resource and use Efficiency.Present invention deployment is easy to implement, it is not necessary to the equipment such as shared magnetic battle array, as long as each node of arbitral procedure, cluster can be run Can the machine of connected reference may be configured to arbitrating server.Virtualized environment, wide area network strange land cluster environment under, Original Fencing technologies and magnetic battle array arbitration disk technology are inapplicable, and the present invention can be played under above-mentioned environment it is preferably secondary Sanction is acted on, and service-conformance and data safety guarantee are provided to cluster virtual machine, strange land cluster.It is in addition, of the invention while applicable In high-availability clusters such as binode, multinodes.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, embodiment or existing will be used below There is the required accompanying drawing used in technology description to be briefly described.
Fig. 1 is service lock structure chart provided by the invention;
Fig. 2 a are that refreshing service provided by the invention locks content;
Fig. 2 b are application service lock content provided by the invention;
Fig. 3 is that non-serving of the present invention runs sub-cluster application service lock flow chart;
Fig. 4 is that service operation sub-cluster refreshing service of the present invention locks flow chart;
Fig. 5 is the process chart that arbitrating server of the present invention receives application service lock message;
Fig. 6 is that arbitrating server of the present invention receives the flow chart that refreshing service locks message;
Fig. 7 is the flow chart of arbitrating server periodic detection service lock refresh time of the present invention stamp;
Fig. 8 is that the cluster provided by the invention based on arbitrating server splits brain preventing mean schematic diagram;
Embodiment
The present invention is clearly and completely described below in conjunction with drawings and examples, it is clear that described embodiment Only part of the embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, the common skill in this area The every other embodiment that art personnel are obtained under the premise of creative work is not made, belong to the model that the present invention protects Enclose.
In order to solve in cluster heartbeat network interruption, because can not effectively judge other nodes and its operation service state and Easily make error, and then produce and split brain, the problem of destroying data consistency.The embodiments of the invention provide a kind of base In arbitrating server, in the case of as far as possible need not be by system reboot or storage isolation, the still effectively side of pre- anticracking brain Method and device.
The cluster of dual-node configurations arbitrating server is the group system of representative High Availabitity, and system has two sections Point, server node A and node B, business externally provide service by global network, swapped between node by private network Nodal information, the running status serviced on monitor node.To ensure the robustness of heartbeat, heartbeat network is typically by two or more Straight-through network cables or Serial Port Line composition.Service lock is distributed to node A by arbitrating server, after node A startups service successfully, node A, the heartbeat between B is normal, and cluster is in normal condition, node A, B all signalling of bouquet state normal messages to arbitrating server, Arbitrating server safeguards that service lock situation is constant.
After in a time constant t_heartbeat, node A, B do not receive the node of other side, then each recognize Interrupted for heartbeat, cluster splits into two sub-clusters, and each sub-cluster only has a node.So, service operation node A determines Phase sends refreshing service and locks message to arbitrating server refreshing service lock;Node B, which is not robbed, accounts for service lock, also can periodically send Apply for service lock message to arbitrating server, the message that arbitrating server receives node A, B makes arbitration process.Mediation service Device maintains a corresponding service lock to each service in cluster.Node first seizes service lock before service is started, Node discharges this service lock after stopping servicing.
It is similar with bi-nodal system in the group system of multinode High Availabitity configuration arbitrating server.Cluster heartbeat net When network is normal, each node sends refreshing service lock message and maintains service lock situation constant to arbitrating server, arbitrating server. When cluster heartbeat network is abnormal, cluster may split into two or more sub-clusters, and the sub-cluster where servicing is called Service operation sub-cluster, the sub-cluster of service not wherein are called service off-duty sub-cluster.The node of service operation sub-cluster All periodically send refreshing service and lock message to arbitrating server, the node for servicing off-duty sub-cluster also periodically sends application service Message is locked to arbitrating server, and the message that arbitrating server receives each node makes corresponding arbitration process.At service lock When unlocked states, each node can seize service lock, and only only one node can seize success, seize success Node can start service, the node becomes service operation node, and the sub-cluster where it becomes service operation sub-cluster.
As shown in figure 1, the structure chart for the service lock safeguarded for arbitrating server, content includes service lock title, service lock State, service lock refresh time stamp, the node where service, the member node etc. of service operation sub-cluster.
Such as Fig. 2 a, shown in 2b, respectively refreshing service locks message, applies for the content of service lock message.
In the embodiment content of following description, while cover the highly available cluster system of binode and multinode.
As shown in figure 3, the embodiment of the present invention, which includes collection group terminal, services off-duty sub-cluster to arbitrating server application service The method of lock:
Step 301, when heartbeat network interruption, collection group terminal proxy module detects that clustered node state is changed, And the node is in service off-duty sub-cluster, then application service is sent to arbitrating server in step 302, the sub-cluster Message is locked, and waits the arbitration result of arbitrating server.
In step 303, service off-duty sub-cluster receives the arbitration result of arbitrating server return.
Locked successfully if robbed, start serviced corresponding to service lock in step 304, the node becomes service operation section Point, sub-cluster where it become service operation sub-cluster.Meanwhile the collection group terminal of other member nodes of service operation sub-cluster Proxy module also detects that it is in service operation sub-cluster, has had changed into the backup section of service operation sub-cluster Point, hereafter, the backup node of service operation sub-cluster no longer send application service lock message to arbitrating server, are changed to send brush New demand servicing locks message, and the time of refreshing service lock success message is sent to service operation node, for use as service operation section The refreshing service of point locks the successfully time.
In step 303, node receive arbitrating server return seize service lock failed message after, in t_giveup It can continue to send application service lock message to arbitrating server in time.If received in the t_giveup times and seize service lock Success message then starts adapter service, and service is turned out still in other sons if not receiving service lock always and seizing success message Cluster is run, or is started by other sub-clusters, and the node will not take over service, avoid splitting brain.
In the present embodiment, if using the IP by global network PING heartbeat disconnected nodes come the shape of decision node State, system does not allow PING (i.e. system will not respond icmp request bags), switch ports themselves damage, network storm etc. situations such as Under, all can not effectively decision node state, can not effectively judge the state of service, there is very big safety wind Danger.Inventor receives that refreshing service locks message and seize service lock message by arbitrating server can more accurately detection node State, and can detection service exactly state.
As shown in figure 4, be the service operation sub-cluster refreshing service lock flow chart of the present invention, including:
Step 401, service operation node just sends refreshing service lock message to arbitrating server.Arbitrating server receives The refreshing service lock message of the node, the content update that it can lock message according to refreshing service service lock information, returned to the node Backwash new demand servicing locks success message.
In step 402, service operation node judges whether refreshing service lock operation succeeds.Operate and terminate if success, such as It is unsuccessful, judge that current time and last time refreshing service lock whether the difference of time has exceeded t_ by step 403 timeout.Service operation node thinks that the service operation sub-cluster disconnects with arbitrating server if more than t_timeout It is connected to, service operation sub-cluster becomes soliton cluster.Service operation sub-cluster loses the arbitration function of arbitrating server.
After service operation cluster becomes soliton cluster, in order to keep the continuity of service, reliability, step 404 separates Two kinds of situation processing:(1) when the number of nodes of service operation sub-cluster is more than or equal to the 1/2 of the number of nodes of former cluster, clothes Business operation node services without stopping, and keeps externally service;(2) number of nodes of service operation sub-cluster is less than the section of former cluster Point quantity 1/2 when, service operation node perform service stopping operation, within the t_giveup times service can not normally stop When, service operation node wants execute server to restart system acting service is finally stopped.
The refreshing service that service operation node maintenance lock the successfully time be in fact each node of service operation sub-cluster most New refreshing service is locked the successfully time, and each node of service operation sub-cluster all periodically sends refreshing service lock to arbitrating server Message, arbitrating server return to refreshing service lock success message.The backup node of service operation sub-cluster receives the brush of oneself The time announcement service operation node, service operation node lock the time and former refreshing service after new demand servicing lock success message The success time makes comparisons, and show that the newest time locks the time as service operation node refreshing service.
In the present embodiment, for binode cluster, service A is run on node 1, then it is assumed that node 1 is service A Service operation node, correspondingly, another node 2 is exactly backup node.If operation service B on node 2, node 2 are B service operation nodes are serviced, node 1 is backup node.For multi-node cluster also in this way, a node can be a service Service operation node, or another service backup node.
As shown in figure 5, arbitrating server of embodiment of the present invention end service lock management program, including:
In the present embodiment, when cluster heartbeat is normal, cluster will not split into two or more sub-clusters, collection at this moment Group is considered as the service operation sub-cluster of maximum, and it includes all nodes.Therefore, all nodes of cluster are all to arbitration Server sends refreshing service lock message, and arbitrating server maintains corresponding service lock status, and constantly updates the brush of service lock New timestamp.
In the present embodiment, when cluster heartbeat network interruption, arbitrating server differentiate service whether stopped be to Close important.In step 501, when arbitrating server receives application service lock message, it has just known cluster heartbeat net Network interrupts, and cluster splits at least two sub-clusters.
In step 502, if service lock is in unlocked state, service lock operation is seized in implementation, by service lock State is set to locked states, and service lock operation node, which is set to, robs lock node, and returns and rob lock success message.If step 503 Service lock is in locked state, and arbitrating server returns to application service lock node and robs lock failed message;If service lock In unknown states, when step 504 detects that the non-refresh time of service lock is more than t_giveup, arbitrating server is implemented to account for Lock operation, service lock status is set to locked states, service lock operation node, which is set to, robs lock node, and returns to rob to lock and successfully disappear Breath.
As shown in fig. 6, the present invention, which provides arbitrating server, receives refreshing service lock Message Processing flow, including:
Corresponding to each service, arbitrating server maintains a service lock, and service lock is that have a t_timeout effective Phase.In step 601, if when arbitrating server receives refreshing service lock message, in step 602, it can take according to refreshing The information of the content update service lock of business lock message.If service operation node transformation, or node member change, arbitration clothes Business device can be locked from each refreshing service learns these information in message, and constantly renewal service lock information.Brushed when receiving When new demand servicing locks message, it is necessary to which member's item of the service lock of renewal is exactly the refresh time stamp of service lock, and arbitrating server often connects A refreshing service lock message is received, the refresh time stamp of service lock will be updated, to safeguard the newest refresh time of service lock.
As shown in fig. 7, the method that the present invention provides the flow of arbitrating server periodic detection service lock refresh time stamp, bag Include:
Arbitrating server regularly checks the refresh time stamp of service lock.In step 701, if arbitrating server detects The difference of the refresh time of current time and service lock stamp has exceeded t_timeout, and arbitrating server thinks service operation sub-cluster Become soliton cluster.Because arbitrating server not can determine that the state of service, therefore, in step 702, arbitrating server handle Service lock status is set to unknown shape states.
Within more than the t_timeout times, arbitrating server is all not received by node refreshing service lock message or application Service lock message, or after arbitrating server hinders restart system for some reason, arbitrating server is considered as all sections of it and cluster Point all disconnects.Now, arbitrating server is the state that not can determine that service, it may be possible to which cluster heartbeat is without in Disconnected, whole cluster does not divide, and services normal operation;It could also be possible that cluster just divides;It could also be possible that service is early Have been stopped for.In this case, arbitrating server is all set to service lock unknown states.When arbitrating server again with collection From when group node connects, if arbitrating server receives node refreshing service lock message, just service lock status is set to Locked states, and the information of the content update service lock according to refreshing service lock message.If arbitrating server again with collection Group node connects and receives application service lock message, applies for that the member node quantity in service lock message is more than the section of former cluster When putting the 1/2 of quantity, the state of service lock is set to unlocked states by arbitrating server after the t_giveup times.
In the present embodiment, it is not in " dead that the service lock that the arbitrating server in the noticeable present invention is safeguarded, which is, Lock " phenomenon, so-called " deadlock " phenomenon are that the service operation node for seizing service lock hinders or node is dead and can not release for some reason Put, other nodes can not successfully seize service lock.Arbitrating server introduces " term of validity " property of service lock in the present invention, robs The service operation sub-cluster for accounting for service lock must be periodically " to arbitrating server refreshing service in the term of validity (t_timeout) Lock timestamp.The refreshing service that arbitrating server does not receive service operation sub-cluster in " term of validity " locks message, and connects Member node quantity in the application service lock message received will return service lock when being more than the 1/2 of the number of nodes of former cluster Receive, be set to unlocked states, then seize again;Equally, service operation sub-cluster does not connect at " term of validity " (t_timeout) Receive refreshing service lock success message, and its sub-cluster number of nodes when being less than the 1/2 of former clustered node quantity it is necessary to stopping Service.
The embodiment of the present invention provides a kind of method that high-availability cluster based on arbitrating server splits brain prevention, in heartbeat net Network interrupts, and after cluster splits into multiple sub-clusters, each node still can continue to monitor the state of each service lock, and continuation can To take over, start the external offer service of service, the continuity of service is at utmost realized.Technology provided in an embodiment of the present invention Scheme solves do not need system reboot or shared storage isolation as far as possible at present in the case of, in cluster heartbeat network interruption Afterwards, cluster continue to reliably, at utmost external offer service incessantly.
As shown in figure 8, the present invention provides the device that a kind of high-availability cluster based on arbitrating server splits brain prevention, bag Include:
Collect group terminal proxy module 801, have those other node members, Yi Jijian for the sub-cluster where detecting this node Survey service whether where it sub-cluster run.
In the present embodiment, collection group terminal proxy module 801 is a module in cluster.It being capable of real-time reception clustered node Member's change events information.When node member's change or heartbeat interrupt, whether collection group terminal proxy module judges service at this Some node operation in sub-cluster.If servicing some node operation in the sub-cluster, continue refreshing service lock letter Breath, if service is not run in the sub-cluster, need to send application service lock message, attempt adapter service.
Arbitrating server end services lock module 802, for handling service off-duty sub-cluster querying node, seizing service lock Message, processing service operation subset group node refreshing service lock message, distribution service lock are safeguarded, more new demand servicing to winning node Lock information.
In the present embodiment, service lock maintenance module is the main modular of mediation service program, and it handles arbitrating server The message received is held, makes arbitration result.When cluster heartbeat network interruption, whether service lock maintenance module differentiates service Stopping is vital, and it may handle refreshing service lock message or application service lock message.
In the present embodiment, when service lock message is applied in processing, service lock maintenance module checks the state of service lock, such as Fruit service lock is in locked state, and it just returns to application service lock node and robs lock failed message;If service lock is in Unknown states, when applying for that the member node quantity in service lock message is more than the 1/2 of the number of nodes of former cluster, arbitration clothes The state of service lock is set to locked states by business device after the t_giveup times, and is updated and accounted for lock node name and account for the lock time Stamp, declaration, which is robbed, locks successfully.
When handling refreshing service lock message, service lock maintenance module locks the content update service of message according to refreshing service The information of lock.Service lock maintenance module often handles a refreshing service lock message, when will handle the refreshing of corresponding service lock Between stab, to safeguard that the newest refresh time of service lock stabs.Service lock maintenance module regularly checks the refresh time stamp of service lock, If the difference of the refresh time of current time and service lock stamp has exceeded t_timeout, it considers that service operation sub-cluster is Become soliton cluster, arbitrating server not can determine that the state of service, and service lock status is set to unknown states.
The embodiment of the present invention provides the device that a kind of high-availability cluster based on arbitrating server splits brain prevention, in heartbeat net In the case that network interrupts, arbitrating server receives each node and issues its message, accurate according to these message, arbitrating server Really safeguard the information of service lock.If the state of service lock is unlocked, arbitrating server fills node perhaps and participates in seizing clothes Business lock, winning node will start the external offer service of service;If the state of service lock is locked, service is illustrated Start on some node and service is externally provided, node does not take over service, avoids and splits brain generation.The embodiment of the present invention provides Technical scheme solve at present as far as possible need not be by system reboot or shared storage isolation in the case of, in cluster heartbeat network After interruption, cluster continue to reliably, at utmost external offer service incessantly.
The embodiment of the present invention provides the method and apparatus that a kind of high-availability cluster based on arbitrating server splits brain prevention, all It can be directly applied in the cluster of high availability.
Hardware is may be directly applied to reference to the method disclosed herein for implementing description or algorithm steps, outer reason device performs Software module, or the two synthesis implemented.
Described above, only protection scope of the present invention is not limited thereto, any technology people for being familiar with the art Member, according to the thought of the present invention, there will be changes in specific embodiments and applications, therefore, in this specification Appearance should not be construed as limiting the invention.

Claims (2)

1. the high-availability cluster based on arbitrating server splits brain preventing method, it is characterised in that:
To arbitrating server application service lock, the cluster section of service lock must not be obtained before the startup service of cluster server node Must not put and start service;When node is dead or during heartbeat failure, the sub-cluster of off-duty service passes through periodically to arbitrating server Application service lock services to decide whether to take over;Apply then taking over service to service lock, do not apply then not taking over to service lock; So as to avoid service in multiple sub-clusters while run;
Wherein:It is that cluster splits into several sub-clusters to split brain state, out of touch each other and think that other nodes are in heaven, and is attempted From " node in heaven " adapter resource;So as to cause service multiple nodes and meanwhile run, the damage of shared data storage it is a series of sternly Weight problem;
Need to obtain service lock before starting service,
Service off-duty node starts in trial adapter service, periodically to arbitrating server application service within the t_giveup times Lock, when the respective service lock of arbitrating server is in unlocked states, service off-duty node will seize service lock, go forward side by side Row service take-over;
Service lock is periodically flushed in service operation node,
Sub-cluster where the service operation node selects a communication node and arbitrating server communication, periodically sends and refreshes clothes Business lock message carries out the refreshing of service lock timestamp to arbitrating server;
Service fault can stop servicing and discharging service lock,
When service hinders for some reason in operation node and stop, servicing and service lock is released back into arbitrating server;Meeting inside the sub-cluster One backup node of selection attempts application service lock and the service of taking over, and the backup node is successful to arbitrating server application service lock Afterwards, service take-over will be carried out, and turns into new service operation node;If backup node starts serv-fail, service will be stopped simultaneously Service lock is discharged again;
When all backup nodes in service operation sub-cluster continuously apply for service lock and adapter serv-fail, unless there are new section Dotted state change events, otherwise the sub-cluster will not attempt to apply for service lock and take over the service;
The processing that cluster breaks contact with arbitrating server,
Service operation sub-cluster can select a node as arbitrating server communication node;When service operation sub-cluster communication section Point detects that current time and refreshing service lock the difference of successfully time and exceed the predetermined t_timeout times, then it is assumed that with arbitration Server disconnects, and service operation sub-cluster can be attempted to elect other nodes and arbitrating server to be communicated, when all sections Point can not communicate with arbitrating server, and service operation sub-cluster becomes soliton cluster, loses the arbitration work(of arbitrating server Energy;
Described service operation sub-cluster becomes soliton cluster, if the sub-cluster number of nodes is less than or equal to former clustered node The 1/2 of quantity, service operation node must stop servicing;When service can not normally stop within the t_giveup times, service fortune Row node, which will perform, restarts system acting, and backup node can not take over service in service operation sub-cluster, also not retransmit Refreshing service locks message to arbitrating server;
Described service operation sub-cluster becomes soliton cluster, if the number of nodes of service operation sub-cluster is more than former cluster Number of nodes 1/2, service operation node without stop service, continue to keep externally to service;
The service lock processing at arbitrating server end,
Described arbitrating server detection current time and the difference of service lock refresh time stamp exceed predetermined time t_ Timeout, then arbitrating server think that service operation sub-cluster has disconnected, the state of service lock is set to unknown shapes State;
T_giveup time of the arbitrating server after the state of service lock being set to unknown states puts the state of service lock For unlocked states;
The state of service lock after unlocked is set to by arbitrating server, will service if receiving new service lock application Lock distributes to the node.
2. also include the high availability redundant mechanism of arbitrating server according to the method for claim 1, it is characterised in that:
To avoid arbitrating server from turning into Single Point of Faliure source, odd number platform arbitrating server is set in cluster, when service lock robs lock, Won principle according to the arbitrating server nodes persons of service lock > 1/2 are robbed, carry out service take-over.
CN201310615821.1A 2013-11-23 2013-11-23 Cluster based on arbitrating server splits brain preventing method and device Active CN103684941B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310615821.1A CN103684941B (en) 2013-11-23 2013-11-23 Cluster based on arbitrating server splits brain preventing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310615821.1A CN103684941B (en) 2013-11-23 2013-11-23 Cluster based on arbitrating server splits brain preventing method and device

Publications (2)

Publication Number Publication Date
CN103684941A CN103684941A (en) 2014-03-26
CN103684941B true CN103684941B (en) 2018-01-16

Family

ID=50321320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310615821.1A Active CN103684941B (en) 2013-11-23 2013-11-23 Cluster based on arbitrating server splits brain preventing method and device

Country Status (1)

Country Link
CN (1) CN103684941B (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105450717A (en) * 2014-09-29 2016-03-30 中兴通讯股份有限公司 Method and device for processing brain split in cluster
CN104469699B (en) * 2014-11-27 2018-09-21 华为技术有限公司 Cluster quorum method and more cluster coupled systems
WO2016106682A1 (en) * 2014-12-31 2016-07-07 华为技术有限公司 Post-cluster brain split quorum processing method and quorum storage device and system
CN105426275B (en) 2015-10-30 2019-04-19 成都华为技术有限公司 The method and device of disaster tolerance in dual-active group system
US10275468B2 (en) 2016-02-11 2019-04-30 Red Hat, Inc. Replication of data in a distributed file system using an arbiter
CN106027634B (en) 2016-05-16 2019-06-04 白杨 Message port Exchange Service system
CN106301900B (en) * 2016-08-08 2019-08-23 华为技术有限公司 The method and apparatus of equipment arbitration
CN108063782A (en) * 2016-11-08 2018-05-22 北京国双科技有限公司 Node is delayed machine adapting method and device, node group system
CN107528724B (en) * 2017-07-20 2020-09-29 奇安信科技集团股份有限公司 Optimization processing method and device for node cluster
CN107688547B (en) * 2017-08-23 2020-06-16 苏州浪潮智能科技有限公司 Method and system for switching between main controller and standby controller
CN107786374B (en) * 2017-10-19 2021-02-05 苏州浪潮智能科技有限公司 Oracle cluster file system and method for realizing ince thereof
CN107918570B (en) * 2017-10-20 2021-07-23 杭州沃趣科技股份有限公司 Method for sharing arbitration logic disk by double-active system
CN108134712B (en) * 2017-12-19 2020-12-18 海能达通信股份有限公司 Distributed cluster split brain processing method, device and equipment
CN108600284B (en) * 2017-12-28 2021-05-14 武汉噢易云计算股份有限公司 Ceph-based virtual machine high-availability implementation method and system
CN109614201B (en) * 2018-12-04 2021-02-09 武汉烽火信息集成技术有限公司 OpenStack virtual machine high-availability system for preventing brain cracking
CN109684032B (en) * 2018-12-04 2021-04-27 武汉烽火信息集成技术有限公司 OpenStack virtual machine high-availability computing node device for preventing brain cracking and management method
CN109634716B (en) * 2018-12-04 2021-02-09 武汉烽火信息集成技术有限公司 OpenStack virtual machine high-availability management end device for preventing brain cracking and management method
CN112003916B (en) 2020-08-14 2022-05-13 苏州浪潮智能科技有限公司 Cluster arbitration method, system, equipment and medium based on heterogeneous storage
CN112202601B (en) * 2020-09-23 2023-03-24 湖南麒麟信安科技股份有限公司 Application method of two physical node mongo clusters operated in duplicate set mode
CN112181305B (en) * 2020-09-30 2024-06-07 北京人大金仓信息技术股份有限公司 Database cluster network partition selection method and device
CN113608836A (en) * 2021-08-06 2021-11-05 上海英方软件股份有限公司 Cluster-based virtual machine high availability method and system
CN114727140A (en) * 2022-03-18 2022-07-08 广州方硅信息技术有限公司 Live broadcast intermodal data synchronization method, server cluster and storage medium
CN115190046B (en) * 2022-04-13 2024-01-23 统信软件技术有限公司 Detection method, detection device and computing equipment of server cluster
CN115811461B (en) * 2023-02-08 2023-04-28 湖南国科亿存信息科技有限公司 SAN shared storage cluster brain crack prevention processing method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291243A (en) * 2007-04-16 2008-10-22 广东省新支点技术服务有限公司 Split brain preventing method for highly available cluster system
CN102394914A (en) * 2011-09-22 2012-03-28 浪潮(北京)电子信息产业有限公司 Cluster brain-split processing method and device
CN102402395A (en) * 2010-09-16 2012-04-04 上海中标软件有限公司 Quorum disk-based non-interrupted operation method for high availability system
CN103209095A (en) * 2013-03-13 2013-07-17 广东新支点技术服务有限公司 Method and device for preventing split brain on basis of disk service lock

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100553920B1 (en) * 2003-02-13 2006-02-24 인터내셔널 비지네스 머신즈 코포레이션 Method for operating a computer cluster
KR101001559B1 (en) * 2008-10-09 2010-12-17 아주대학교산학협력단 Hybrid clustering based data aggregation method for multi-target tracking in the wireless sensor network
CN102799394B (en) * 2012-06-29 2015-02-25 华为技术有限公司 Method and device for realizing heartbeat services of high-availability clusters

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291243A (en) * 2007-04-16 2008-10-22 广东省新支点技术服务有限公司 Split brain preventing method for highly available cluster system
CN102402395A (en) * 2010-09-16 2012-04-04 上海中标软件有限公司 Quorum disk-based non-interrupted operation method for high availability system
CN102394914A (en) * 2011-09-22 2012-03-28 浪潮(北京)电子信息产业有限公司 Cluster brain-split processing method and device
CN103209095A (en) * 2013-03-13 2013-07-17 广东新支点技术服务有限公司 Method and device for preventing split brain on basis of disk service lock

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《The design and architecture of the Microsoft Cluster Service-a practical approach to high-availability and scalability》;W. Vogels等;《Fault-Tolerant Computing,1998. Digest of Papers. Twenty-Eighth Annual International Symposium on》;20020806;全文 *
《高可用集群系统仲裁机构设计》;张大年;《中国优秀硕士学位论文全文数据库·信息科技辑》;20111215(第S2期);全文 *

Also Published As

Publication number Publication date
CN103684941A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN103684941B (en) Cluster based on arbitrating server splits brain preventing method and device
CN103209095B (en) Method and device for preventing split brain on basis of disk service lock
CN103744809B (en) Vehicle information management system double hot standby method based on VRRP
CN100387017C (en) High usable self-healing Logic box fault detecting and tolerating method for constituting multi-machine system
CN105471995B (en) Extensive Web service group of planes high availability implementation method based on SOA
Aublin et al. Rbft: Redundant byzantine fault tolerance
US6928589B1 (en) Node management in high-availability cluster
US9594818B2 (en) System and method for supporting dry-run mode in a network environment
US7278055B2 (en) System and method for virtual router failover in a network routing system
CN103530200B (en) A kind of server hot backup system and method
WO2016106682A1 (en) Post-cluster brain split quorum processing method and quorum storage device and system
CN112181660A (en) High-availability method based on server cluster
CN102916825A (en) Management equipment of dual-computer hot standby system, management method and dual-computer hot standby system
US10728099B2 (en) Method for processing virtual machine cluster and computer system
CN103647668A (en) Host group decision system in high availability cluster and switching method for host group decision system
US20140095925A1 (en) Client for controlling automatic failover from a primary to a standby server
CN106850255A (en) A kind of implementation method of multi-computer back-up
CN111385107B (en) Main/standby switching processing method and device for server
CN104980693A (en) Media service backup method and system
CN105933379B (en) A kind of method for processing business, equipment and system
CN108469996A (en) A kind of system high availability method based on auto snapshot
CN107276839A (en) A kind of cloud platform from monitoring method and system
CN106681858A (en) Virtual machine data disaster tolerance method and management device
CN106844083A (en) A kind of fault-tolerance approach and system perceived towards stream calculation system exception
CN113794765A (en) Gate load balancing method and device based on file transmission

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 510663 Guangdong Province, Guangzhou Tianhe Science Park Gaotang New District high Pu Lu No. 1021 601

Applicant after: GUANGDONG ZHONGXING NEWSTART TECHNOLOGY CO., LTD.

Address before: 510663 Guangdong Province, Guangzhou Tianhe Science Park Gaotang New District high Pu Lu No. 1021 601

Applicant before: Guangdong NewStart Technology Service Ltd.

GR01 Patent grant
GR01 Patent grant