CN103684941B - Cluster based on arbitrating server splits brain preventing method and device - Google Patents
Cluster based on arbitrating server splits brain preventing method and device Download PDFInfo
- Publication number
- CN103684941B CN103684941B CN201310615821.1A CN201310615821A CN103684941B CN 103684941 B CN103684941 B CN 103684941B CN 201310615821 A CN201310615821 A CN 201310615821A CN 103684941 B CN103684941 B CN 103684941B
- Authority
- CN
- China
- Prior art keywords
- service
- cluster
- lock
- node
- arbitrating server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Hardware Redundancy (AREA)
Abstract
The present invention discloses the method and apparatus that a kind of high-availability cluster based on arbitrating server splits brain prevention, and the high-availability cluster for belonging to computer cluster technology field splits brain prevention technique.To solve in cluster heartbeat network interruption, the state of other nodes and its operation service can not be accurately differentiated, and the service of can not taking over occur or service in two nodes while operation problem.Scheme provided in an embodiment of the present invention includes:In heartbeat network interruption, the clustered node of off-duty service only obtains respective service by arbitrating server and locked, and service take-over can be just carried out, so as to avoid splitting brain problem;After service stopping, arbitrating server recovery service lock simultaneously allows other clustered nodes to seize it again;During multiple nodes seize service lock simultaneously, an only node is seized successfully and can start service, it is therefore prevented that splits the generation of brain.
Description
Technical field
The invention belongs to computer cluster technology field, suitable for high availability cluster (High-availability
Cluster), more particularly to high-availability cluster splits brain prevention technique field.
Background technology
With the rapid development of communication network technology, the key area such as telecommunications, finance, E-Government is to server availability
Requirement more and more higher.High Availabitity (High Availability, HA) Clustering can effectively reduce operation system because soft
The service stopping time caused by part, hardware fault.
Current highly available cluster system be mainly used as by links such as network or Serial Port Lines communicated between clustered node it is privately owned
Heartbeat network, it is responsible for exchanging the information between synchronization node, monitors the running situation of each node in cluster.When service operation node
Failure, backup node can not receive the heartbeat message of service operation node within a certain period of time, then it is assumed that service operation node is sent out
Give birth to failure and carry out service take-over.But when all heartbeat links break down, it may result in service operation node and standby
Part node starts business simultaneously, causes cluster to split brain (Split-Brain) and corrupted data.
In order to ensure the business sustainability of user and Information Security, prevent cluster split brain be it is essential, at present
General way be by malfunctioning node Fencing restart or will be retained by SCSI3 technology shared storage is carried out Fencing every
From.But inventor has found that these methods have limitation, in actual environment, often do not possess Fencing hardware condition, and
And equally running other important business on backup node, client does not allow operating system to restart or share storage to be isolated.
Split in addition, although the disk lock technology based on shared magnetic battle array can solve cluster in LAN, the occasion part with shared magnetic battle array
Brain problem, but equally exist many limitations, than if desired for repartitioning shared magnetic battle array subregion, do not support no magnetic matrix ring border, no
Support virtual machine environment, do not support wide area network strange land cluster etc..
The content of the invention
Present example purpose is that providing a kind of cluster based on arbitrating server splits brain preventing method and device, overcomes
The deficiencies in the prior art, in the case where that server node Fencing need not be restarted or shared storage Fencing isolates,
Remain able to cluster heartbeat network interruption or it is abnormal when, prevent cluster split brain occur and corrupted data.And overcome magnetic battle array secondary
Shared magnetic battle array must be configured by cutting out disk, it is necessary to carried out subregion again to magnetic battle array, be only used for the limitation of LAN, do not support virtual
The limitations such as machine environment, suitable for without shared magnetic battle array, need not be to magnetic battle array again subregion, cluster virtual machine, wide area network strange land collection
The high-availability cluster environment such as group.
The present invention realizes with device by the following method:
When node or heartbeat network failure, service off-duty sub-cluster first to arbitrating server application and must be taken
Business lock, the adapter that could be serviced, if for any reason, service off-duty sub-cluster can not obtain service lock, then can not
Perform service starting operation.So as to avoid two nodes while start service, prevent cluster from splitting the generation of brain.
Because former service operation nodes heart beat line is interrupted to service one t_giveup time of needs is stopped, so at this
In time, attempting the sub-cluster of adapter service can continue to send application service lock request, until obtaining service lock.
Service operation sub-cluster periodically sends service lock refresh message to arbitrating server, the current clothes of arbitrating server renewal
Business lock timestamp, safeguard that service lock status is constant.Now, the node of non-serving operation sub-cluster can not obtain respective service lock,
Service can not be taken over.
If as reasons such as network failures, arbitrating server can not receive any service lock within the t_timeout times
Refreshing information, then it is assumed that service operation sub-cluster has been crashed or become soliton cluster, and the state of service lock is set to
Unknown states.Hereafter, sufficiently service time is stopped to ensure that former service operation node has, arbitrating server can wait t_
Service lock status is just set to unlocked by the giveup times, and confirmed service has stopped, and allows the service of seizing of other nodes
Lock, avoid during standby host adapter service because origin node service does not stop completely and caused by of short duration split brain problem.
Now, service operation sub-cluster disconnects with arbitrating server and becomes soliton cluster.To ensure service
Continuation is run, is handled in two kinds of situation:(1) when service operation sub-cluster number of nodes is more than the 1/2 of former clustered node quantity,
Continue externally to provide service, avoid because the linkage fault of arbitrating server has influence on the availability of service;(2) service operation
When the number of nodes of cluster is less than or equal to the 1/2 of former clustered node quantity, performs stopping service operations and discharge service lock.Now
The backup node of service operation sub-cluster not takes over service, when service can not normally stop within the t_giveup times, service
Operation node, which will perform, restarts system acting, to facilitate other sub-clusters to take over.As service operation sub-cluster nodes > 1/2
When, non-serving operation sub-cluster is certainly less than 1/2, so now non-serving operation sub-cluster will not be attempted to apply for service lock and connect
Pipe service, split brain risk in the absence of cluster.
To improve service availability, maximizing service continuous service ability, 1/2 nodes can not also be performed by option
Algorithm, now non-serving operation sub-cluster is in spite of > 1/2, as long as node state change or heartbeat failure, can all be performed
Lock operation is robbed, and attempts adapter service.This mode improves service sustainability, but reduces Information Security, increases cluster
Split brain risk.
When the service fails, service stopping operation can be first carried out in cluster, and actively discharges service lock to arbitrating server.
And must stop completing in maximum dwell time t_giveup servicing, service in the t_giveup times and do not stop, then need to hold
Row server reboot operation immediately, it is ensured that service lock is set to unlocked by arbitrating server, before backup node adapter service, clothes
Business has stopped completing.
Another aspect of the present invention, there is provided a kind of to split brain preventing mean based on arbitrating server, its feature includes:
Cluster server end proxy module.Service operation sub-cluster election communication node is periodically sent to arbitrating server to be brushed
New demand servicing locks message, and refreshing service lock message mainly includes service name, refreshing service lock node, refresh time stamp etc.;Service
Off-duty sub-cluster elects communication node before adapter service is attempted, and service lock solicitation message, application are sent to arbitrating server
Service lock message content includes Service name, robs lock node name etc..
Arbitrating server module.When service lock is in unlocked states, service lock is authorized first by arbitrating server
It is individual to enter to rob the node of lock application, service lock is then set to locked states, renewal accounts for lock nodename;When service lock is in
During locked states, special services are locked into line duration stamp and is refreshed, to the backup node from service off-duty sub-cluster
Service lock application return rob lock failed message;Arbitrating server safeguards the information of each service lock, including:Service lock title,
The state of service lock, service lock refresh time stamp, the node at service place.
The present invention realizes a kind of service lock arbitration device based on the Client/Server network architectures, is accounted for based on service lock
Lock node uniqueness, only obtain service lock node could start service, come avoid service 2 nodes and meanwhile startup
Risk, so as to avoid the generation that cluster splits brain.Compared with system reboot Fencing or shared storage Fencing isolation technologies,
Concept of the invention based on service lock, it would be preferable to support the different services of each self-operating of active/standby server, improve server resource and use
Efficiency.Present invention deployment is easy to implement, it is not necessary to the equipment such as shared magnetic battle array, as long as each node of arbitral procedure, cluster can be run
Can the machine of connected reference may be configured to arbitrating server.Virtualized environment, wide area network strange land cluster environment under,
Original Fencing technologies and magnetic battle array arbitration disk technology are inapplicable, and the present invention can be played under above-mentioned environment it is preferably secondary
Sanction is acted on, and service-conformance and data safety guarantee are provided to cluster virtual machine, strange land cluster.It is in addition, of the invention while applicable
In high-availability clusters such as binode, multinodes.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, embodiment or existing will be used below
There is the required accompanying drawing used in technology description to be briefly described.
Fig. 1 is service lock structure chart provided by the invention;
Fig. 2 a are that refreshing service provided by the invention locks content;
Fig. 2 b are application service lock content provided by the invention;
Fig. 3 is that non-serving of the present invention runs sub-cluster application service lock flow chart;
Fig. 4 is that service operation sub-cluster refreshing service of the present invention locks flow chart;
Fig. 5 is the process chart that arbitrating server of the present invention receives application service lock message;
Fig. 6 is that arbitrating server of the present invention receives the flow chart that refreshing service locks message;
Fig. 7 is the flow chart of arbitrating server periodic detection service lock refresh time of the present invention stamp;
Fig. 8 is that the cluster provided by the invention based on arbitrating server splits brain preventing mean schematic diagram;
Embodiment
The present invention is clearly and completely described below in conjunction with drawings and examples, it is clear that described embodiment
Only part of the embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, the common skill in this area
The every other embodiment that art personnel are obtained under the premise of creative work is not made, belong to the model that the present invention protects
Enclose.
In order to solve in cluster heartbeat network interruption, because can not effectively judge other nodes and its operation service state and
Easily make error, and then produce and split brain, the problem of destroying data consistency.The embodiments of the invention provide a kind of base
In arbitrating server, in the case of as far as possible need not be by system reboot or storage isolation, the still effectively side of pre- anticracking brain
Method and device.
The cluster of dual-node configurations arbitrating server is the group system of representative High Availabitity, and system has two sections
Point, server node A and node B, business externally provide service by global network, swapped between node by private network
Nodal information, the running status serviced on monitor node.To ensure the robustness of heartbeat, heartbeat network is typically by two or more
Straight-through network cables or Serial Port Line composition.Service lock is distributed to node A by arbitrating server, after node A startups service successfully, node
A, the heartbeat between B is normal, and cluster is in normal condition, node A, B all signalling of bouquet state normal messages to arbitrating server,
Arbitrating server safeguards that service lock situation is constant.
After in a time constant t_heartbeat, node A, B do not receive the node of other side, then each recognize
Interrupted for heartbeat, cluster splits into two sub-clusters, and each sub-cluster only has a node.So, service operation node A determines
Phase sends refreshing service and locks message to arbitrating server refreshing service lock;Node B, which is not robbed, accounts for service lock, also can periodically send
Apply for service lock message to arbitrating server, the message that arbitrating server receives node A, B makes arbitration process.Mediation service
Device maintains a corresponding service lock to each service in cluster.Node first seizes service lock before service is started,
Node discharges this service lock after stopping servicing.
It is similar with bi-nodal system in the group system of multinode High Availabitity configuration arbitrating server.Cluster heartbeat net
When network is normal, each node sends refreshing service lock message and maintains service lock situation constant to arbitrating server, arbitrating server.
When cluster heartbeat network is abnormal, cluster may split into two or more sub-clusters, and the sub-cluster where servicing is called
Service operation sub-cluster, the sub-cluster of service not wherein are called service off-duty sub-cluster.The node of service operation sub-cluster
All periodically send refreshing service and lock message to arbitrating server, the node for servicing off-duty sub-cluster also periodically sends application service
Message is locked to arbitrating server, and the message that arbitrating server receives each node makes corresponding arbitration process.At service lock
When unlocked states, each node can seize service lock, and only only one node can seize success, seize success
Node can start service, the node becomes service operation node, and the sub-cluster where it becomes service operation sub-cluster.
As shown in figure 1, the structure chart for the service lock safeguarded for arbitrating server, content includes service lock title, service lock
State, service lock refresh time stamp, the node where service, the member node etc. of service operation sub-cluster.
Such as Fig. 2 a, shown in 2b, respectively refreshing service locks message, applies for the content of service lock message.
In the embodiment content of following description, while cover the highly available cluster system of binode and multinode.
As shown in figure 3, the embodiment of the present invention, which includes collection group terminal, services off-duty sub-cluster to arbitrating server application service
The method of lock:
Step 301, when heartbeat network interruption, collection group terminal proxy module detects that clustered node state is changed,
And the node is in service off-duty sub-cluster, then application service is sent to arbitrating server in step 302, the sub-cluster
Message is locked, and waits the arbitration result of arbitrating server.
In step 303, service off-duty sub-cluster receives the arbitration result of arbitrating server return.
Locked successfully if robbed, start serviced corresponding to service lock in step 304, the node becomes service operation section
Point, sub-cluster where it become service operation sub-cluster.Meanwhile the collection group terminal of other member nodes of service operation sub-cluster
Proxy module also detects that it is in service operation sub-cluster, has had changed into the backup section of service operation sub-cluster
Point, hereafter, the backup node of service operation sub-cluster no longer send application service lock message to arbitrating server, are changed to send brush
New demand servicing locks message, and the time of refreshing service lock success message is sent to service operation node, for use as service operation section
The refreshing service of point locks the successfully time.
In step 303, node receive arbitrating server return seize service lock failed message after, in t_giveup
It can continue to send application service lock message to arbitrating server in time.If received in the t_giveup times and seize service lock
Success message then starts adapter service, and service is turned out still in other sons if not receiving service lock always and seizing success message
Cluster is run, or is started by other sub-clusters, and the node will not take over service, avoid splitting brain.
In the present embodiment, if using the IP by global network PING heartbeat disconnected nodes come the shape of decision node
State, system does not allow PING (i.e. system will not respond icmp request bags), switch ports themselves damage, network storm etc. situations such as
Under, all can not effectively decision node state, can not effectively judge the state of service, there is very big safety wind
Danger.Inventor receives that refreshing service locks message and seize service lock message by arbitrating server can more accurately detection node
State, and can detection service exactly state.
As shown in figure 4, be the service operation sub-cluster refreshing service lock flow chart of the present invention, including:
Step 401, service operation node just sends refreshing service lock message to arbitrating server.Arbitrating server receives
The refreshing service lock message of the node, the content update that it can lock message according to refreshing service service lock information, returned to the node
Backwash new demand servicing locks success message.
In step 402, service operation node judges whether refreshing service lock operation succeeds.Operate and terminate if success, such as
It is unsuccessful, judge that current time and last time refreshing service lock whether the difference of time has exceeded t_ by step 403
timeout.Service operation node thinks that the service operation sub-cluster disconnects with arbitrating server if more than t_timeout
It is connected to, service operation sub-cluster becomes soliton cluster.Service operation sub-cluster loses the arbitration function of arbitrating server.
After service operation cluster becomes soliton cluster, in order to keep the continuity of service, reliability, step 404 separates
Two kinds of situation processing:(1) when the number of nodes of service operation sub-cluster is more than or equal to the 1/2 of the number of nodes of former cluster, clothes
Business operation node services without stopping, and keeps externally service;(2) number of nodes of service operation sub-cluster is less than the section of former cluster
Point quantity 1/2 when, service operation node perform service stopping operation, within the t_giveup times service can not normally stop
When, service operation node wants execute server to restart system acting service is finally stopped.
The refreshing service that service operation node maintenance lock the successfully time be in fact each node of service operation sub-cluster most
New refreshing service is locked the successfully time, and each node of service operation sub-cluster all periodically sends refreshing service lock to arbitrating server
Message, arbitrating server return to refreshing service lock success message.The backup node of service operation sub-cluster receives the brush of oneself
The time announcement service operation node, service operation node lock the time and former refreshing service after new demand servicing lock success message
The success time makes comparisons, and show that the newest time locks the time as service operation node refreshing service.
In the present embodiment, for binode cluster, service A is run on node 1, then it is assumed that node 1 is service A
Service operation node, correspondingly, another node 2 is exactly backup node.If operation service B on node 2, node 2 are
B service operation nodes are serviced, node 1 is backup node.For multi-node cluster also in this way, a node can be a service
Service operation node, or another service backup node.
As shown in figure 5, arbitrating server of embodiment of the present invention end service lock management program, including:
In the present embodiment, when cluster heartbeat is normal, cluster will not split into two or more sub-clusters, collection at this moment
Group is considered as the service operation sub-cluster of maximum, and it includes all nodes.Therefore, all nodes of cluster are all to arbitration
Server sends refreshing service lock message, and arbitrating server maintains corresponding service lock status, and constantly updates the brush of service lock
New timestamp.
In the present embodiment, when cluster heartbeat network interruption, arbitrating server differentiate service whether stopped be to
Close important.In step 501, when arbitrating server receives application service lock message, it has just known cluster heartbeat net
Network interrupts, and cluster splits at least two sub-clusters.
In step 502, if service lock is in unlocked state, service lock operation is seized in implementation, by service lock
State is set to locked states, and service lock operation node, which is set to, robs lock node, and returns and rob lock success message.If step 503
Service lock is in locked state, and arbitrating server returns to application service lock node and robs lock failed message;If service lock
In unknown states, when step 504 detects that the non-refresh time of service lock is more than t_giveup, arbitrating server is implemented to account for
Lock operation, service lock status is set to locked states, service lock operation node, which is set to, robs lock node, and returns to rob to lock and successfully disappear
Breath.
As shown in fig. 6, the present invention, which provides arbitrating server, receives refreshing service lock Message Processing flow, including:
Corresponding to each service, arbitrating server maintains a service lock, and service lock is that have a t_timeout effective
Phase.In step 601, if when arbitrating server receives refreshing service lock message, in step 602, it can take according to refreshing
The information of the content update service lock of business lock message.If service operation node transformation, or node member change, arbitration clothes
Business device can be locked from each refreshing service learns these information in message, and constantly renewal service lock information.Brushed when receiving
When new demand servicing locks message, it is necessary to which member's item of the service lock of renewal is exactly the refresh time stamp of service lock, and arbitrating server often connects
A refreshing service lock message is received, the refresh time stamp of service lock will be updated, to safeguard the newest refresh time of service lock.
As shown in fig. 7, the method that the present invention provides the flow of arbitrating server periodic detection service lock refresh time stamp, bag
Include:
Arbitrating server regularly checks the refresh time stamp of service lock.In step 701, if arbitrating server detects
The difference of the refresh time of current time and service lock stamp has exceeded t_timeout, and arbitrating server thinks service operation sub-cluster
Become soliton cluster.Because arbitrating server not can determine that the state of service, therefore, in step 702, arbitrating server handle
Service lock status is set to unknown shape states.
Within more than the t_timeout times, arbitrating server is all not received by node refreshing service lock message or application
Service lock message, or after arbitrating server hinders restart system for some reason, arbitrating server is considered as all sections of it and cluster
Point all disconnects.Now, arbitrating server is the state that not can determine that service, it may be possible to which cluster heartbeat is without in
Disconnected, whole cluster does not divide, and services normal operation;It could also be possible that cluster just divides;It could also be possible that service is early
Have been stopped for.In this case, arbitrating server is all set to service lock unknown states.When arbitrating server again with collection
From when group node connects, if arbitrating server receives node refreshing service lock message, just service lock status is set to
Locked states, and the information of the content update service lock according to refreshing service lock message.If arbitrating server again with collection
Group node connects and receives application service lock message, applies for that the member node quantity in service lock message is more than the section of former cluster
When putting the 1/2 of quantity, the state of service lock is set to unlocked states by arbitrating server after the t_giveup times.
In the present embodiment, it is not in " dead that the service lock that the arbitrating server in the noticeable present invention is safeguarded, which is,
Lock " phenomenon, so-called " deadlock " phenomenon are that the service operation node for seizing service lock hinders or node is dead and can not release for some reason
Put, other nodes can not successfully seize service lock.Arbitrating server introduces " term of validity " property of service lock in the present invention, robs
The service operation sub-cluster for accounting for service lock must be periodically " to arbitrating server refreshing service in the term of validity (t_timeout)
Lock timestamp.The refreshing service that arbitrating server does not receive service operation sub-cluster in " term of validity " locks message, and connects
Member node quantity in the application service lock message received will return service lock when being more than the 1/2 of the number of nodes of former cluster
Receive, be set to unlocked states, then seize again;Equally, service operation sub-cluster does not connect at " term of validity " (t_timeout)
Receive refreshing service lock success message, and its sub-cluster number of nodes when being less than the 1/2 of former clustered node quantity it is necessary to stopping
Service.
The embodiment of the present invention provides a kind of method that high-availability cluster based on arbitrating server splits brain prevention, in heartbeat net
Network interrupts, and after cluster splits into multiple sub-clusters, each node still can continue to monitor the state of each service lock, and continuation can
To take over, start the external offer service of service, the continuity of service is at utmost realized.Technology provided in an embodiment of the present invention
Scheme solves do not need system reboot or shared storage isolation as far as possible at present in the case of, in cluster heartbeat network interruption
Afterwards, cluster continue to reliably, at utmost external offer service incessantly.
As shown in figure 8, the present invention provides the device that a kind of high-availability cluster based on arbitrating server splits brain prevention, bag
Include:
Collect group terminal proxy module 801, have those other node members, Yi Jijian for the sub-cluster where detecting this node
Survey service whether where it sub-cluster run.
In the present embodiment, collection group terminal proxy module 801 is a module in cluster.It being capable of real-time reception clustered node
Member's change events information.When node member's change or heartbeat interrupt, whether collection group terminal proxy module judges service at this
Some node operation in sub-cluster.If servicing some node operation in the sub-cluster, continue refreshing service lock letter
Breath, if service is not run in the sub-cluster, need to send application service lock message, attempt adapter service.
Arbitrating server end services lock module 802, for handling service off-duty sub-cluster querying node, seizing service lock
Message, processing service operation subset group node refreshing service lock message, distribution service lock are safeguarded, more new demand servicing to winning node
Lock information.
In the present embodiment, service lock maintenance module is the main modular of mediation service program, and it handles arbitrating server
The message received is held, makes arbitration result.When cluster heartbeat network interruption, whether service lock maintenance module differentiates service
Stopping is vital, and it may handle refreshing service lock message or application service lock message.
In the present embodiment, when service lock message is applied in processing, service lock maintenance module checks the state of service lock, such as
Fruit service lock is in locked state, and it just returns to application service lock node and robs lock failed message;If service lock is in
Unknown states, when applying for that the member node quantity in service lock message is more than the 1/2 of the number of nodes of former cluster, arbitration clothes
The state of service lock is set to locked states by business device after the t_giveup times, and is updated and accounted for lock node name and account for the lock time
Stamp, declaration, which is robbed, locks successfully.
When handling refreshing service lock message, service lock maintenance module locks the content update service of message according to refreshing service
The information of lock.Service lock maintenance module often handles a refreshing service lock message, when will handle the refreshing of corresponding service lock
Between stab, to safeguard that the newest refresh time of service lock stabs.Service lock maintenance module regularly checks the refresh time stamp of service lock,
If the difference of the refresh time of current time and service lock stamp has exceeded t_timeout, it considers that service operation sub-cluster is
Become soliton cluster, arbitrating server not can determine that the state of service, and service lock status is set to unknown states.
The embodiment of the present invention provides the device that a kind of high-availability cluster based on arbitrating server splits brain prevention, in heartbeat net
In the case that network interrupts, arbitrating server receives each node and issues its message, accurate according to these message, arbitrating server
Really safeguard the information of service lock.If the state of service lock is unlocked, arbitrating server fills node perhaps and participates in seizing clothes
Business lock, winning node will start the external offer service of service;If the state of service lock is locked, service is illustrated
Start on some node and service is externally provided, node does not take over service, avoids and splits brain generation.The embodiment of the present invention provides
Technical scheme solve at present as far as possible need not be by system reboot or shared storage isolation in the case of, in cluster heartbeat network
After interruption, cluster continue to reliably, at utmost external offer service incessantly.
The embodiment of the present invention provides the method and apparatus that a kind of high-availability cluster based on arbitrating server splits brain prevention, all
It can be directly applied in the cluster of high availability.
Hardware is may be directly applied to reference to the method disclosed herein for implementing description or algorithm steps, outer reason device performs
Software module, or the two synthesis implemented.
Described above, only protection scope of the present invention is not limited thereto, any technology people for being familiar with the art
Member, according to the thought of the present invention, there will be changes in specific embodiments and applications, therefore, in this specification
Appearance should not be construed as limiting the invention.
Claims (2)
1. the high-availability cluster based on arbitrating server splits brain preventing method, it is characterised in that:
To arbitrating server application service lock, the cluster section of service lock must not be obtained before the startup service of cluster server node
Must not put and start service;When node is dead or during heartbeat failure, the sub-cluster of off-duty service passes through periodically to arbitrating server
Application service lock services to decide whether to take over;Apply then taking over service to service lock, do not apply then not taking over to service lock;
So as to avoid service in multiple sub-clusters while run;
Wherein:It is that cluster splits into several sub-clusters to split brain state, out of touch each other and think that other nodes are in heaven, and is attempted
From " node in heaven " adapter resource;So as to cause service multiple nodes and meanwhile run, the damage of shared data storage it is a series of sternly
Weight problem;
Need to obtain service lock before starting service,
Service off-duty node starts in trial adapter service, periodically to arbitrating server application service within the t_giveup times
Lock, when the respective service lock of arbitrating server is in unlocked states, service off-duty node will seize service lock, go forward side by side
Row service take-over;
Service lock is periodically flushed in service operation node,
Sub-cluster where the service operation node selects a communication node and arbitrating server communication, periodically sends and refreshes clothes
Business lock message carries out the refreshing of service lock timestamp to arbitrating server;
Service fault can stop servicing and discharging service lock,
When service hinders for some reason in operation node and stop, servicing and service lock is released back into arbitrating server;Meeting inside the sub-cluster
One backup node of selection attempts application service lock and the service of taking over, and the backup node is successful to arbitrating server application service lock
Afterwards, service take-over will be carried out, and turns into new service operation node;If backup node starts serv-fail, service will be stopped simultaneously
Service lock is discharged again;
When all backup nodes in service operation sub-cluster continuously apply for service lock and adapter serv-fail, unless there are new section
Dotted state change events, otherwise the sub-cluster will not attempt to apply for service lock and take over the service;
The processing that cluster breaks contact with arbitrating server,
Service operation sub-cluster can select a node as arbitrating server communication node;When service operation sub-cluster communication section
Point detects that current time and refreshing service lock the difference of successfully time and exceed the predetermined t_timeout times, then it is assumed that with arbitration
Server disconnects, and service operation sub-cluster can be attempted to elect other nodes and arbitrating server to be communicated, when all sections
Point can not communicate with arbitrating server, and service operation sub-cluster becomes soliton cluster, loses the arbitration work(of arbitrating server
Energy;
Described service operation sub-cluster becomes soliton cluster, if the sub-cluster number of nodes is less than or equal to former clustered node
The 1/2 of quantity, service operation node must stop servicing;When service can not normally stop within the t_giveup times, service fortune
Row node, which will perform, restarts system acting, and backup node can not take over service in service operation sub-cluster, also not retransmit
Refreshing service locks message to arbitrating server;
Described service operation sub-cluster becomes soliton cluster, if the number of nodes of service operation sub-cluster is more than former cluster
Number of nodes 1/2, service operation node without stop service, continue to keep externally to service;
The service lock processing at arbitrating server end,
Described arbitrating server detection current time and the difference of service lock refresh time stamp exceed predetermined time t_
Timeout, then arbitrating server think that service operation sub-cluster has disconnected, the state of service lock is set to unknown shapes
State;
T_giveup time of the arbitrating server after the state of service lock being set to unknown states puts the state of service lock
For unlocked states;
The state of service lock after unlocked is set to by arbitrating server, will service if receiving new service lock application
Lock distributes to the node.
2. also include the high availability redundant mechanism of arbitrating server according to the method for claim 1, it is characterised in that:
To avoid arbitrating server from turning into Single Point of Faliure source, odd number platform arbitrating server is set in cluster, when service lock robs lock,
Won principle according to the arbitrating server nodes persons of service lock > 1/2 are robbed, carry out service take-over.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310615821.1A CN103684941B (en) | 2013-11-23 | 2013-11-23 | Cluster based on arbitrating server splits brain preventing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310615821.1A CN103684941B (en) | 2013-11-23 | 2013-11-23 | Cluster based on arbitrating server splits brain preventing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103684941A CN103684941A (en) | 2014-03-26 |
CN103684941B true CN103684941B (en) | 2018-01-16 |
Family
ID=50321320
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310615821.1A Active CN103684941B (en) | 2013-11-23 | 2013-11-23 | Cluster based on arbitrating server splits brain preventing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103684941B (en) |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105450717A (en) * | 2014-09-29 | 2016-03-30 | 中兴通讯股份有限公司 | Method and device for processing brain split in cluster |
CN104469699B (en) * | 2014-11-27 | 2018-09-21 | 华为技术有限公司 | Cluster quorum method and more cluster coupled systems |
WO2016106682A1 (en) * | 2014-12-31 | 2016-07-07 | 华为技术有限公司 | Post-cluster brain split quorum processing method and quorum storage device and system |
CN105426275B (en) | 2015-10-30 | 2019-04-19 | 成都华为技术有限公司 | The method and device of disaster tolerance in dual-active group system |
US10275468B2 (en) | 2016-02-11 | 2019-04-30 | Red Hat, Inc. | Replication of data in a distributed file system using an arbiter |
CN106027634B (en) | 2016-05-16 | 2019-06-04 | 白杨 | Message port Exchange Service system |
CN106301900B (en) * | 2016-08-08 | 2019-08-23 | 华为技术有限公司 | The method and apparatus of equipment arbitration |
CN108063782A (en) * | 2016-11-08 | 2018-05-22 | 北京国双科技有限公司 | Node is delayed machine adapting method and device, node group system |
CN107528724B (en) * | 2017-07-20 | 2020-09-29 | 奇安信科技集团股份有限公司 | Optimization processing method and device for node cluster |
CN107688547B (en) * | 2017-08-23 | 2020-06-16 | 苏州浪潮智能科技有限公司 | Method and system for switching between main controller and standby controller |
CN107786374B (en) * | 2017-10-19 | 2021-02-05 | 苏州浪潮智能科技有限公司 | Oracle cluster file system and method for realizing ince thereof |
CN107918570B (en) * | 2017-10-20 | 2021-07-23 | 杭州沃趣科技股份有限公司 | Method for sharing arbitration logic disk by double-active system |
CN108134712B (en) * | 2017-12-19 | 2020-12-18 | 海能达通信股份有限公司 | Distributed cluster split brain processing method, device and equipment |
CN108600284B (en) * | 2017-12-28 | 2021-05-14 | 武汉噢易云计算股份有限公司 | Ceph-based virtual machine high-availability implementation method and system |
CN109614201B (en) * | 2018-12-04 | 2021-02-09 | 武汉烽火信息集成技术有限公司 | OpenStack virtual machine high-availability system for preventing brain cracking |
CN109684032B (en) * | 2018-12-04 | 2021-04-27 | 武汉烽火信息集成技术有限公司 | OpenStack virtual machine high-availability computing node device for preventing brain cracking and management method |
CN109634716B (en) * | 2018-12-04 | 2021-02-09 | 武汉烽火信息集成技术有限公司 | OpenStack virtual machine high-availability management end device for preventing brain cracking and management method |
CN112003916B (en) | 2020-08-14 | 2022-05-13 | 苏州浪潮智能科技有限公司 | Cluster arbitration method, system, equipment and medium based on heterogeneous storage |
CN112202601B (en) * | 2020-09-23 | 2023-03-24 | 湖南麒麟信安科技股份有限公司 | Application method of two physical node mongo clusters operated in duplicate set mode |
CN112181305B (en) * | 2020-09-30 | 2024-06-07 | 北京人大金仓信息技术股份有限公司 | Database cluster network partition selection method and device |
CN113608836A (en) * | 2021-08-06 | 2021-11-05 | 上海英方软件股份有限公司 | Cluster-based virtual machine high availability method and system |
CN114727140A (en) * | 2022-03-18 | 2022-07-08 | 广州方硅信息技术有限公司 | Live broadcast intermodal data synchronization method, server cluster and storage medium |
CN115190046B (en) * | 2022-04-13 | 2024-01-23 | 统信软件技术有限公司 | Detection method, detection device and computing equipment of server cluster |
CN115811461B (en) * | 2023-02-08 | 2023-04-28 | 湖南国科亿存信息科技有限公司 | SAN shared storage cluster brain crack prevention processing method and device and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101291243A (en) * | 2007-04-16 | 2008-10-22 | 广东省新支点技术服务有限公司 | Split brain preventing method for highly available cluster system |
CN102394914A (en) * | 2011-09-22 | 2012-03-28 | 浪潮(北京)电子信息产业有限公司 | Cluster brain-split processing method and device |
CN102402395A (en) * | 2010-09-16 | 2012-04-04 | 上海中标软件有限公司 | Quorum disk-based non-interrupted operation method for high availability system |
CN103209095A (en) * | 2013-03-13 | 2013-07-17 | 广东新支点技术服务有限公司 | Method and device for preventing split brain on basis of disk service lock |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100553920B1 (en) * | 2003-02-13 | 2006-02-24 | 인터내셔널 비지네스 머신즈 코포레이션 | Method for operating a computer cluster |
KR101001559B1 (en) * | 2008-10-09 | 2010-12-17 | 아주대학교산학협력단 | Hybrid clustering based data aggregation method for multi-target tracking in the wireless sensor network |
CN102799394B (en) * | 2012-06-29 | 2015-02-25 | 华为技术有限公司 | Method and device for realizing heartbeat services of high-availability clusters |
-
2013
- 2013-11-23 CN CN201310615821.1A patent/CN103684941B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101291243A (en) * | 2007-04-16 | 2008-10-22 | 广东省新支点技术服务有限公司 | Split brain preventing method for highly available cluster system |
CN102402395A (en) * | 2010-09-16 | 2012-04-04 | 上海中标软件有限公司 | Quorum disk-based non-interrupted operation method for high availability system |
CN102394914A (en) * | 2011-09-22 | 2012-03-28 | 浪潮(北京)电子信息产业有限公司 | Cluster brain-split processing method and device |
CN103209095A (en) * | 2013-03-13 | 2013-07-17 | 广东新支点技术服务有限公司 | Method and device for preventing split brain on basis of disk service lock |
Non-Patent Citations (2)
Title |
---|
《The design and architecture of the Microsoft Cluster Service-a practical approach to high-availability and scalability》;W. Vogels等;《Fault-Tolerant Computing,1998. Digest of Papers. Twenty-Eighth Annual International Symposium on》;20020806;全文 * |
《高可用集群系统仲裁机构设计》;张大年;《中国优秀硕士学位论文全文数据库·信息科技辑》;20111215(第S2期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN103684941A (en) | 2014-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103684941B (en) | Cluster based on arbitrating server splits brain preventing method and device | |
CN103209095B (en) | Method and device for preventing split brain on basis of disk service lock | |
CN103744809B (en) | Vehicle information management system double hot standby method based on VRRP | |
CN100387017C (en) | High usable self-healing Logic box fault detecting and tolerating method for constituting multi-machine system | |
CN105471995B (en) | Extensive Web service group of planes high availability implementation method based on SOA | |
Aublin et al. | Rbft: Redundant byzantine fault tolerance | |
US6928589B1 (en) | Node management in high-availability cluster | |
US9594818B2 (en) | System and method for supporting dry-run mode in a network environment | |
US7278055B2 (en) | System and method for virtual router failover in a network routing system | |
CN103530200B (en) | A kind of server hot backup system and method | |
WO2016106682A1 (en) | Post-cluster brain split quorum processing method and quorum storage device and system | |
CN112181660A (en) | High-availability method based on server cluster | |
CN102916825A (en) | Management equipment of dual-computer hot standby system, management method and dual-computer hot standby system | |
US10728099B2 (en) | Method for processing virtual machine cluster and computer system | |
CN103647668A (en) | Host group decision system in high availability cluster and switching method for host group decision system | |
US20140095925A1 (en) | Client for controlling automatic failover from a primary to a standby server | |
CN106850255A (en) | A kind of implementation method of multi-computer back-up | |
CN111385107B (en) | Main/standby switching processing method and device for server | |
CN104980693A (en) | Media service backup method and system | |
CN105933379B (en) | A kind of method for processing business, equipment and system | |
CN108469996A (en) | A kind of system high availability method based on auto snapshot | |
CN107276839A (en) | A kind of cloud platform from monitoring method and system | |
CN106681858A (en) | Virtual machine data disaster tolerance method and management device | |
CN106844083A (en) | A kind of fault-tolerance approach and system perceived towards stream calculation system exception | |
CN113794765A (en) | Gate load balancing method and device based on file transmission |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 510663 Guangdong Province, Guangzhou Tianhe Science Park Gaotang New District high Pu Lu No. 1021 601 Applicant after: GUANGDONG ZHONGXING NEWSTART TECHNOLOGY CO., LTD. Address before: 510663 Guangdong Province, Guangzhou Tianhe Science Park Gaotang New District high Pu Lu No. 1021 601 Applicant before: Guangdong NewStart Technology Service Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |