WO2016082443A1 - 集群仲裁方法和多集群配合系统 - Google Patents

集群仲裁方法和多集群配合系统 Download PDF

Info

Publication number
WO2016082443A1
WO2016082443A1 PCT/CN2015/077092 CN2015077092W WO2016082443A1 WO 2016082443 A1 WO2016082443 A1 WO 2016082443A1 CN 2015077092 W CN2015077092 W CN 2015077092W WO 2016082443 A1 WO2016082443 A1 WO 2016082443A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
group
clusters
representative
arbitration
Prior art date
Application number
PCT/CN2015/077092
Other languages
English (en)
French (fr)
Inventor
陈晓丽
曾敬勇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP15863647.2A priority Critical patent/EP3214865B1/en
Priority to EP18183960.6A priority patent/EP3461065B1/en
Publication of WO2016082443A1 publication Critical patent/WO2016082443A1/zh
Priority to US15/606,214 priority patent/US20170270015A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/06Selective distribution of broadcast services, e.g. multimedia broadcast multicast service [MBMS]; Services to user groups; One-way selective calling services
    • H04W4/08User group management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2035Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant without idle spare hardware
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2048Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share neither address space nor persistent storage
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/04Arrangements for maintaining operational condition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/805Real-time

Definitions

  • the present invention relates to the field of mobile communications, and in particular, to a cluster arbitration method and a multi-cluster coordination system.
  • a dual-active data center means that both data centers are in operation, can simultaneously undertake business, and improve the overall service capacity and system resource utilization of the data center.
  • the two data centers are backups of each other. When one of the data centers fails, the data is zero lost, and the service can automatically switch to another data center.
  • a dual-active data center usually consists of a storage layer, a network layer, and an application layer. There are several clusters deployed in the active-active data center. One part of each cluster is located on one side of one data center, and the other part of each cluster is located on the other data center side. Each sub-cluster of each data center works together. .
  • each cluster uses its own arbitration mechanism to arbitrate, resulting in the arbitration result of each cluster not necessarily consistent, that is, some clusters may be located in one of the data.
  • the sub-cluster in the center survives, and the sub-cluster in which some clusters are located in another data center survives, and the entire service access is probabilistic.
  • the embodiment of the invention provides a cluster arbitration method, which can reduce the probability of occurrence of service interruption.
  • a first aspect of the embodiments of the present invention provides a cluster arbitration method, including:
  • the first group of clusters includes a portion of the first cluster and a portion of the second cluster
  • the second group of clusters includes another of the first clusters a portion and another portion of the second cluster, the first cluster and the second cluster cooperating;
  • the first group of clusters and the second group of clusters respectively determine respective preemptive representatives, and the preemptive representatives of the first group of clusters and the preemptive representatives of the second group of clusters respectively perform + Take the following steps:
  • the arbitrating device is preempted, and the group cluster in which the preemption representative of the arbitrating device is successfully preempted according to the preset arbitration mechanism survives.
  • the preemptive representative of the first group of clusters and the preemptive representative of the second group of clusters respectively perform the following steps :
  • the cluster of the peer group preempts the quorum device within the preset time; if not, the first cluster uses the first preset mechanism to preempt the quorum device; The second cluster uses the second preset mechanism to preempt the arbitration device.
  • the method further includes:
  • the preemption representatives of the first group of clusters and the second group of clusters are preempted.
  • the representative preempts the arbitration device, and presets the preemption representative of the second group of clusters to retreat.
  • the preset arbitration mechanism is the preemptive preemption representative of the arbitration device. Successfully preempting the arbitration device;
  • Preempting the preemptive representative of the second group of clusters to perform the retreat specifically includes:
  • the first group of clusters and the second group of clusters are located in a dual-active data center, where the first group The cluster is located in one of the data centers, and the second group of clusters is located in another data center.
  • a second aspect of the embodiments of the present invention provides a multi-cluster coordination system, including:
  • a preset arbitration mechanism is provided in the arbitration equipment;
  • the first group of clusters and the second group of clusters are respectively configured to determine respective preemptive representatives when a fault occurs in the first group of clusters and the second group of clusters is detected;
  • the preemption representative of the first group of clusters and the preemption representative of the second group of clusters are respectively used to determine whether a fault occurs in the group cluster; if not, the arbitration device is preempted, wherein according to the pre The arbitration mechanism is used to survive the preemption of the quorum device.
  • the first preset mechanism and the second preset mechanism are further disposed in the arbitration device.
  • the preemption representative of the first group of clusters and the preemption representative of the second group of clusters are respectively used to detect whether the peer group clusters preempt the arbitration device when the fault occurs in the group cluster. If not, the first cluster uses the first preset mechanism to preempt the arbitration device, and the second cluster uses the second preset mechanism to preempt the arbitration device.
  • the preemptive representative of the second group of clusters is further used to:
  • the preemptive representatives of the second group of clusters respectively determine that no fault occurs in the group cluster, and the preemption representatives of the first group of clusters and the preemptive representatives of the second group of clusters all retreat when the arbitration device preempts.
  • the preset arbitration mechanism is the preemptive preemption representative of the arbitration device. Successfully preempting the arbitration device;
  • the preemption representative of the second group of clusters is specifically configured to preempt the arbitration device after a preset time interval is determined in the group cluster.
  • the multi-cluster cooperation system is a dual-active data center, where the first group of clusters is located in one of the data centers.
  • the second set of clusters is located in another data center.
  • the first group of clusters and the second group of clusters are respectively determined Determining their own preemptive representatives to seize the arbitration equipment in the arbitration equipment, and all the sub-clusters in the cluster that survived the preemption survived, thus ensuring that the arbitration results of different clusters are consistent in the event of a failure, so that the surviving group cluster can continue to provide services. .
  • FIG. 1 is a flow chart of an embodiment of a cluster arbitration method of the present invention
  • FIG. 2 is a schematic structural view of an embodiment of a multi-cluster coordination system according to the present invention.
  • the embodiment of the invention provides a cluster arbitration method and a multi-cluster coordination system, which are used to reduce the probability of occurrence of service interruption.
  • a cluster arbitration method in an embodiment of the present invention includes:
  • some nodes in the first cluster are set in the first group cluster, and another part nodes are set in the second group cluster, and the two partial nodes respectively form two sub-clusters of the first cluster.
  • Some nodes in the second cluster are located in the first group cluster, and another part nodes are located in the second group cluster, the two parts The nodes form two sub-clusters of the second cluster, respectively.
  • the first cluster and the second cluster work together, and the first cluster and the second cluster simultaneously undertake services and backup each other.
  • the first group of clusters and the second group of clusters are dual-active data centers.
  • the storage layers of the two data centers are each deployed with a VIS6600T.
  • the two VIS6600Ts form a VIS cluster, which is the host of the two data centers.
  • the business provides both read and write services.
  • the application layer of the two data centers is deployed with an Oracle RAC cluster, where some nodes of the Oracle RAC cluster are set in one data center and the other node is set in another data center.
  • the clusters in the first group cluster and the second group cluster are not limited to the first cluster and the second cluster, and may also include other clusters.
  • the first group of clusters and the second group of clusters further include a third cluster, wherein part of the nodes of the third cluster are located in the first group of clusters, and another part of the nodes are located in the second group of clusters.
  • the first clusters in the first group of clusters and the second clusters in the first group of clusters communicate with each other.
  • the first cluster in the second group of clusters and the second cluster in the second group of clusters communicate with each other.
  • the sub-cluster of the first cluster in the first group of clusters and the sub-cluster in the second group of clusters acquire the operation of the other party through the cluster IP heartbeat link
  • the sub-cluster of the second cluster in the first group of clusters The sub-cluster in the second group of clusters obtains the operation of the other party through the cluster IP heartbeat link.
  • each cluster in the cluster can determine that a fault has occurred in the cluster.
  • the cluster that communicates with the faulty cluster in another cluster cannot obtain the operation of the faulty cluster, it can be determined that the faulty cluster is faulty, and the cluster fault message is sent to other clusters in the cluster.
  • the sub-cluster in the first group of clusters and the sub-cluster in the second group of clusters cannot obtain the operation status of the other party, or cause the second cluster to be in the first
  • a sub-cluster in a group of clusters and a sub-cluster in the second group of clusters cannot obtain the usage of the other party, it is also determined that a fault occurs in the first group of clusters and the second group of clusters.
  • the first group of clusters and the second group of clusters respectively determine respective preemptive representatives, and the preemptive representatives of the first group of clusters and the preemptive representatives of the second group of clusters respectively perform steps 103.
  • the first group of clusters and the second group of clusters determine their respective preemptive representatives according to a pre-set mechanism.
  • the preemptive representative is used to represent the group cluster to preempt the quorum device, and all clusters in the group cluster in which the preemption representative of the quorum device is preempted can survive and continue to provide services, and each sub-cluster in the other group cluster stops providing services. .
  • the preemptive representation There are many mechanisms for determining the preemptive representation. For example, the node with the smallest node number may be selected as the preemption representative in advance, or the node with the latest startup time may be used as the preemption representative, etc., and is not limited herein.
  • the preemptive representative may not be a node in the group cluster, but a plurality of nodes or a sub-cluster, etc., and is not limited herein.
  • the mechanism for determining the preemptive representative of the first group of clusters and the second group of clusters may be the same or different, and is not limited herein. After the first group of clusters and the second group of clusters respectively determine their respective preemptive representatives, the two preemptive representatives respectively perform step 103.
  • the preemption device determines whether there is a fault in the group cluster.
  • the preemption device preempts the arbitration device according to the preset arbitration mechanism.
  • preset arbitration mechanisms There are a variety of preset arbitration mechanisms, which are prior art and will not be described here.
  • the group cluster in which the preemption representative of the successful arbitration device is preempted will continue to survive, while the other group cluster will "suicide" and stop providing business services.
  • the preemption representative finds that there is a fault in the group cluster, the preemption behavior is exited.
  • the first group of clusters and the second group of clusters respectively determine their respective preemptive representatives to preempt the arbitration device, and all the sub-clusters in the cluster that successfully seized are surviving, thereby ensuring that they appear.
  • the arbitration results of different clusters are consistent at the time of failure, so that the surviving group cluster can continue to provide services.
  • step 102 of the cluster arbitration method of the present invention the preemption representative of the first group of clusters and the preemptive representatives of the second group of clusters respectively perform step 104.
  • the cluster of the peer group it is detected whether the cluster of the peer group succeeds in preempting the quorum device in the preset time. If not, the first cluster uses the first preset mechanism to perform the arbitrating device on the quorum device. Preemption; the second cluster uses the second preset mechanism to preempt the arbitration device.
  • the preemption behavior of each group cluster is determined, the preemption behavior of the group cluster is detected, and the preemption representative of the cluster of the other group is successfully preempted by the arbitration device. If not, it means that there are faults in both clusters. Therefore, the first cluster in the first group of clusters and the first cluster in the second group of clusters preempt the quorum device by using a first preset mechanism; the second cluster in the first group of clusters And the second cluster in the second group of clusters uses a second preset mechanism to preempt the arbitration device.
  • the first preset mechanism and the second preset mechanism are the original arbitration mechanisms of the first cluster and the second cluster, respectively, and the first preset mechanism and the second preset mechanism may be the same or different.
  • the clusters can still do their best to ensure business continuity.
  • the first group of clusters and the second group Which group of clusters in the cluster will continue to provide services, and which group of clusters will "suicide" and stop serving, is determined by which group of clusters of preemptive representatives successfully seized the arbitration equipment.
  • the preemptive representative of the first group of clusters and the preemptive representative of the second group of clusters both preempt the arbitrating device the preemptive representative of the second group of clusters shall be confiscated.
  • the preemptive representative of the second group of clusters shall be confiscated.
  • the preset arbitration mechanism first preempts the arbitration device in the time to preempt the arbitration device, then when the two preemptive representative teams arbitrate the device for preemption, the preemptive representative of the second group of clusters is determined in advance. After there is no fault in the group cluster, wait for the preset time, and then grab the arbitration device. In this way, it can be ensured that the preemption representative in the first group of clusters first preempts the arbitration. device.
  • data center 1 and data center 2 respectively deploy a VIS6600T, and the two VIS6600Ts form a VIS cluster.
  • An Oracle RAC cluster is set up in the application layer in the active-active data center, where some nodes of the Oracle RAC cluster are located in the data center 1 and another node is located in the data center 2.
  • the virtual machine servers of the two data centers also form a virtual machine cluster, and the core switches of the two data centers form a core switch cluster.
  • Arbitration equipment is also available in the active data center.
  • a cluster IP heartbeat link and an FC data transmission network are used between the two data centers of the active-active data center to transmit control information, configuration information, and data synchronization.
  • the dual-active data center pre-sets the VIS cluster, Oracle RAC cluster, virtual machine cluster, and core switch in the data center 1 to belong to Group1.
  • the VIS cluster, Oracle RAC cluster, virtual machine cluster, and core switch in the data center 2 belong to Group2. .
  • data centers 1 and 2 respectively select the node with the smallest node number among them as the preemption representative.
  • the preemptive representatives of data centers 1 and 2 respectively determine whether there is a fault in each cluster in the data center where they are located. If there is a fault in each cluster in one data center, and there is no fault in each cluster in the other data center. If the occurrence occurs, the preemption representative of the data center that has no fault occurs preempts the arbitration device and the preemption succeeds.
  • the preemption of the quorum device in the preemption of the two data centers is successful.
  • the preemption of the successful preemption represents the survival of each cluster in the data center. So that the data center continues to provide business services, while the clusters in the other data center "suicide", all stop providing business services.
  • each preemptive representative also checks whether the preemption representative of the peer group cluster successfully seizes the arbitration device within the preset time, and then determines When the other party does not preempt the success, the VIS cluster, the Oracle RAC cluster, the virtual machine cluster, and the core switch cluster in the two data centers respectively use the original arbitration mechanism of the respective clusters to preempt the arbitration device.
  • the multi-cluster cooperation system 200 in the embodiment of the present invention includes:
  • first set of clusters 201 includes a portion 211 of the first cluster and a portion 221 of the second cluster
  • the second set of clusters 202 including the first cluster Another portion 212 and another portion 222 of the second cluster
  • the first cluster and the second cluster cooperate with each other
  • the arbitration device 203 is provided with a preset arbitration mechanism.
  • the first group of clusters 201 and the second group of clusters 202 are respectively configured to determine respective preemptive representatives when a fault occurs in the first group of clusters 201 and the second group of clusters 202;
  • the preemption representative of the first group of clusters 201 and the preemptive representatives of the second group of clusters 202 are respectively used to determine whether a fault occurs in the group cluster; if not, the arbitration device 203 is preempted, wherein The preset arbitration mechanism survives the group cluster in which the preemption of the arbitration device is successfully preempted.
  • the first group of clusters and the second group of clusters respectively determine their respective preemptive representatives to preempt the arbitration devices in the arbitration device, and all the sub-clusters in the cluster that successfully preempted survived.
  • the arbitration results of different clusters are consistent in the event of a failure, so that the surviving group cluster can continue to provide services.
  • the arbitration device 203 further includes a first preset mechanism and a second preset mechanism;
  • the preemptive representative of the first group of clusters 201 and the preemptive representatives of the second group of clusters 202 are respectively used to detect whether the cluster of the other group performs the arbitration device within a preset time when it is determined that a fault occurs in the group cluster. Preemption; if not, the first cluster uses the first preset mechanism to preempt the arbitration device, and the second cluster 22 uses the second preset mechanism to preempt the arbitration device.
  • the preemption representative of the second group of clusters 202 is further configured to perform a preemption when the preemption representative of the first group of clusters 201 and the preemption representatives of the second group of clusters 202 both preempt the arbitration device.
  • the preset arbitration mechanism is that the preemptive preemption representative of the arbitrating device preempts the arbitration device successfully; the preemptive representative of the second group of clusters 202 is used to determine that there is no fault in the group cluster.
  • the arbitration device is preempted after the interval is preset.
  • the multi-cluster cooperation system is a dual-active data center, wherein the first group of cluster bits Within one of the data centers, the second set of clusters is located in another data center.
  • the multi-cluster coordination system is a dual-active data center.
  • data center 1 and data center 2 respectively deploy a VIS6600T, and the two VIS6600Ts form a VIS cluster.
  • An Oracle RAC cluster is set up in the application layer in the active-active data center, where some nodes of the Oracle RAC cluster are located in the data center 1 and another node is located in the data center 2.
  • the virtual machine servers of the two data centers also form a virtual machine cluster, and the core switches of the two data centers form a core switch cluster.
  • Arbitration equipment is also available in the active data center.
  • a cluster IP heartbeat link and an FC data transmission network are used between the two data centers of the active-active data center to transmit control information, configuration information, and data synchronization.
  • the dual-active data center pre-sets the VIS cluster, Oracle RAC cluster, virtual machine cluster, and core switch in the data center 1 to belong to Group1.
  • the VIS cluster, Oracle RAC cluster, virtual machine cluster, and core switch in the data center 2 belong to Group2. .
  • data centers 1 and 2 respectively select the node with the smallest node number among them as the preemption representative.
  • the preemptive representatives of data centers 1 and 2 respectively determine whether there is a fault in each cluster in the data center where they are located. If there is a fault in each cluster in one data center, and there is no fault in each cluster in the other data center. If the occurrence occurs, the preemption representative of the data center that has no fault occurs preempts the arbitration device and the preemption succeeds.
  • the preemption of the quorum device in the preemption of the two data centers is successful.
  • the preemption of the successful preemption represents the survival of each cluster in the data center. So that the data center continues to provide business services, while the clusters in the other data center "suicide", all stop providing business services.
  • each preemptive representative also checks whether the preemption representative of the peer group cluster successfully seizes the arbitration device within the preset time, and then determines When the other party does not preempt the success, the VIS cluster, the Oracle RAC cluster, the virtual machine cluster, and the core switch cluster in the two data centers respectively use the original arbitration mechanism of the respective clusters to preempt the arbitration device.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Hardware Redundancy (AREA)

Abstract

本发明实施例公开了一种集群仲裁方法和多集群配合系统。本发明实施例方法包括:检测第一组集群或第二组集群中是否有故障发生,其中所述第一组集群包括第一集群的一部分和第二集群的一部分,所述第二组集群包括所述第一集群的另一部分和所述第二集群的另一部分,所述第一集群和所述第二集群互相配合;当检测到有故障发生时,所述第一组集群和第二组集群分别确定各自的抢占代表,所述第一组集群的抢占代表和所述第二组集群的抢占代表分别执行以下步骤:确定所在的组集群内是否有故障发生;若没有,则对仲裁设备进行抢占,其中根据预置仲裁机制对所述仲裁设备抢占成功的抢占代表所在的组集群存活。本发明能够降低出现业务访问中断的概率。

Description

集群仲裁方法和多集群配合系统 技术领域
本发明涉及移动通信领域,尤其涉及一种集群仲裁方法和多集群配合系统。
背景技术
双活数据中心是指两个数据中心都处于运行状态,可以同时承担业务,提高数据中心的整体服务能力和系统资源利用率。两个数据中心互为备份,当其中一个数据中心故障时,数据零丢失,业务能自动切换到另一数据中心。
双活数据中心通常由存储层、网络层和应用层组成。双活数据中心中部署着几个集群,其中每一个集群的一部分位于其中一个数据中心一侧,每一个集群的另一部分位于另一个数据中心一侧,每个数据中心的各子集群相互配合工作。
然而,双活数据中每一个集群的仲裁机制不同,当出现故障时,每一个集群采用各自的仲裁机制进行仲裁,导致每一个集群的仲裁结果不一定一致,即可能出现部分集群位于其中一个数据中心内的子集群存活,部分集群位于另一个数据中心内的子集群存活,进而概率性的出现整个业务访问中断的情况。
发明内容
本发明实施例提供了一种集群仲裁方法,能够降低出现业务访问中断的概率。
本发明实施例第一方面提供一种集群仲裁方法,包括:
检测第一组集群或第二组集群中是否有故障发生,其中所述第一组集群包括第一集群的一部分和第二集群的一部分,所述第二组集群包括所述第一集群的另一部分和所述第二集群的另一部分,所述第一集群和所述第二集群互相配合;
当检测到有故障发生时,所述第一组集群和第二组集群分别确定各自的抢占代表,所述第一组集群的抢占代表和所述第二组集群的抢占代表分别执+ 行以下步骤:
确定所在的组集群内是否有故障发生;
若没有,则对仲裁设备进行抢占,其中根据预置仲裁机制对所述仲裁设备抢占成功的抢占代表所在的组集群存活。
结合本发明实施例的第一方面,本发明实施例的第一方面的第一种实现方式中,所述第一组集群的抢占代表和所述第二组集群的抢占代表还分别执行以下步骤:
若确定所在的组集群内有故障发生,则检测预置时间内对方组集群是否对仲裁设备进行抢占,若没有,则所述第一集群采用第一预置机制对所述仲裁设备进行抢占;所述第二集群采用第二预置机制对所述仲裁设备进行抢占。
结合本发明实施例的第一方面,本发明实施例的第一方面的第二种实现方式中,所述确定所在的组集群内是否有故障发生之后还包括:
当所述第一组集群的抢占代表和所述第二组集群的抢占代表分别确定所在的组集群内没有故障发生时,所述第一组集群的抢占代表和所述第二组集群的抢占代表均对仲裁设备进行抢占,且预置所述第二组集群的抢占代表进行退让。
结合本发明实施例的第一方面的第二种实现方式,本发明实施例的第一方面的第三种实现方式中,所述预置仲裁机制为最先抢占到所述仲裁设备的抢占代表对所述仲裁设备抢占成功;
预置所述第二组集群的抢占代表进行退让具体包括:
预置所述第二组集群的抢占代表在确定所在的组集群内没有故障发生后间隔预置时间再对所述仲裁设备进行抢占。
结合本发明实施例的第一方面,本发明实施例的第一方面的第四种实现方式中,所述第一组集群和第二组集群位于双活数据中心内,其中所述第一组集群位于其中一个数据中心内,所述第二组集群位于另一个数据中心内。
本发明实施例第二方面提供一种多集群配合系统,包括:
第一组集群、第二组集群和仲裁设备,其中第一组集群包括第一集群的一部分和第二集群的一部分,所述第二组集群包括所述第一集群的另一部分和所述第二集群的另一部分,所述第一集群和所述第二集群互相配合,所述 仲裁设备内设有预置仲裁机制;
所述第一组集群和第二组集群分别用于当检测到第一组集群和第二组集群内有故障发生时,确定各自的抢占代表;
所述第一组集群的抢占代表和所述第二组集群的抢占代表分别用于确定所在的组集群内是否有故障发生;若没有,则对所述仲裁设备进行抢占,其中根据所述预置仲裁机制对所述仲裁设备抢占成功的抢占代表所在的组集群存活。
结合本发明实施例的第二方面,本发明实施例的第二方面的第一种实现方式中,所述仲裁设备内还设有第一预置机制和第二预置机制;
所述第一组集群的抢占代表和所述第二组集群的抢占代表还分别用于当确定所在的组集群内有故障发生时,检测预置时间内对方组集群是否对仲裁设备进行抢占;若没有,则所述第一集群采用所述第一预置机制对所述仲裁设备进行抢占,所述第二集群采所述用第二预置机制对所述仲裁设备进行抢占。
结合本发明实施例的第二方面,本发明实施例的第二方面的第二种实现方式中,所述第二组集群的抢占代表还用于当所述第一组集群的抢占代表和所述第二组集群的抢占代表分别确定所在的组集群内没有故障发生,且所述第一组集群的抢占代表和所述第二组集群的抢占代表均对仲裁设备进行抢占时进行退让。
结合本发明实施例的第二方面的第二种实现方式,本发明实施例的第二方面的第三种实现方式中,所述预置仲裁机制为最先抢占到所述仲裁设备的抢占代表对所述仲裁设备抢占成功;
所述第二组集群的抢占代表具体用于在确定所在的组集群内没有故障发生后间隔预置时间再对所述仲裁设备进行抢占。
结合本发明实施例的第二方面,本发明实施例的第二方面的第四种实现方式中,所述多集群配合系统为双活数据中心,其中所述第一组集群位于其中一个数据中心内,所述第二组集群位于另一个数据中心内。
从以上技术方案可以看出,本发明实施例具有以下优点:
本发明实施例中,当出现故障时,由于第一组集群和第二组集群分别确 定各自的抢占代表去抢占仲裁设备中的仲裁设备,而抢占成功的那组集群中的所有子集群存活,进而保证在出现故障时不同集群的仲裁结果一致,使得存活的组集群能够继续提供服务。
附图说明
图1为本发明的集群仲裁方法的一个实施例的流程图;
图2为本发明的多集群配合系统的一个实施例的结构示意图。
具体实施方式
本发明实施例提供了一种集群仲裁方法和多集群配合系统,用于降低出现业务访问中断的概率。
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
本发明的说明书和权利要求书及上述附图中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、系统、产品或设备固有的其它步骤或单元。
请参阅图1,本发明的一个实施例中集群仲裁方法包括:
101、检测第一组集群和第二组集群中是否有故障发生,其中所述第一组集群包括第一集群的一部分和第二集群的一部分,所述第二组集群包括所述第一集群的另一部分和所述第二集群的另一部分,所述第一集群和所述第二集群互相配合。
本实施例中,第一集群中部分节点设于第一组集群中,另一部分节点设于第二组集群中,该两部分节点分别形成第一集群的两个子集群。第二集群中部分节点设于第一组集群中,另一部分节点设于第二组集群中,该两部分 节点分别形成第二集群的两个子集群。第一集群和第二集群互相配合工作,且第一组集群和第二组集群同时承担业务,并互为备份。
具体举例来说,第一组集群和第二组集群为双活数据中心,该两个数据中心的存储层各部署一台VIS6600T,该两台VIS6600T组成一个VIS集群,为该两个数据中心主机业务同时提供读写服务。该两个数据中心的应用层部署有Oracle RAC集群,其中该Oracle RAC集群的部分节点设置在其中一个数据中心,另一部分节点设置在另一个数据中心。
需注意的时,第一组集群和第二组集群中的集群并不限于第一集群和第二集群,还可以包括其他集群。例如,第一组集群和第二组集群中还包括第三集群,其中第三集群的部分节点设于第一组集群中,另一部分节点设于第二组集群中。
第一集群在第一组集群内的子集群和第二集群在第一组集群内的子集群互相通信。同样的,第一集群在第二组集群内的子集群和第二集群在第二组集群内的子集群互相通信。且第一集群在第一组集群中的子集群和在第二组集群中的子集群定时通过集群IP心跳链路获取对方的运营情况,以及第二集群在第一组集群中的子集群和在第二组集群中的子集群定时通过集群IP心跳链路获取对方的运营情况。
当其中一个组集群中某个集群故障时,该组集群中其他集群无法与该集群通信,那么该组集群中的各集群可确定本组集群内有故障发生。而在另一个组集群中与该故障集群通信的集群无法获取到该故障集群的运营情况时,可确定该故障集群发生故障,并将该集群故障的消息发送至本组集群中的其他集群。
或者,当集群IP心跳链路故障时,导致第一集群在第一组集群中的子集群和在第二组集群中的子集群不能获取到对方的运营情况时,或者导致第二集群在第一组集群中的子集群和在第二组集群中的子集群不能获取到对方的运用情况时,也可确定第一组集群和第二组集群中有故障发生。
102、当检测到有故障发生时,所述第一组集群和第二组集群分别确定各自的抢占代表,所述第一组集群的抢占代表和所述第二组集群的抢占代表分别执行步骤103。
当确定有故障发生时,第一组集群和第二组集群根据预先设置好的机制来确定各自的抢占代表。该抢占代表用于代表所在的组集群去抢占仲裁设备,而抢占到仲裁设备的抢占代表所在的组集群中所有集群能够存活,继续提供服务,另一个组集群中的各子集群均停止提供服务。
确定抢占代表的机制可以有多种。例如,可以预先设置好将节点号最小的节点选为抢占代表,或者将启动时间最晚的节点作为抢占代表等等,在此不作限制。或者,抢占代表也可以不是组集群中的一个节点,而是多个节点或者一个子集群等等,在此不作限制。
第一组集群和第二组集群确定抢占代表的机制可以一样,也可以不一样,在此不作限制。第一组集群和第二组集群分别确定出各自的抢占代表后,该两个抢占代表分别执行步骤103。
103、确定所在的组集群内是否有故障发生,若没有,则对仲裁设备进行抢占,其中根据预置仲裁机制对所述仲裁设备抢占成功的抢占代表所在的组集群存活。
由于对仲裁设备抢占成功的抢占代表所在的组集群中所有的子集群将全部存活继续提供服务,而一个组集群内的各子集群是相互配合工作的,因此若该组集群内有故障发生导致部分子集群不能提供服务,也会导致业务中断。因此,抢占代表在对仲裁设备进行抢占之前,均确定所在的组集群内是否有故障发生。
抢占代表在确认所在组集群内没有故障之后,再根据预置仲裁机制对仲裁设备进行抢占。预置仲裁机制有多种,此为现有技术,在此不作赘述。对仲裁设备抢占成功的抢占代表所在的组集群将继续存活,而另一个组集群则“自杀”,停止提供业务服务。
若抢占代表发现所在的组集群内有故障发生,那么退出抢占行为。
本发明实施例中,当出现故障时,由于第一组集群和第二组集群分别确定各自的抢占代表去抢占仲裁设备,而抢占成功的那组集群中的所有子集群存活,进而保证在出现故障时不同集群的仲裁结果一致,使得存活的组集群能够继续提供服务。
然而,虽然概率较小,但仍有可能出现两个抢占代表均发现所在的组集 群内有故障发生而都没有参与抢占行为的情况。因此,优选的,本发明的集群仲裁方法中的步骤102中,所述第一组集群的抢占代表和所述第二组集群的抢占代表分别还执行步骤104。
104、若确定所在的组集群内有故障发生,则检测预置时间内对方组集群是否对仲裁设备抢占成功,若没有,则所述第一集群采用第一预置机制对所述仲裁设备进行抢占;所述第二集群采用第二预置机制对所述仲裁设备进行抢占。
每一个组集群的抢占代表在确定所在的组集群内有故障发生时,在退出抢占行为的同时,还检测预置时间内对方组集群的抢占代表是否对仲裁设备抢占成功。若没有,则表示两个组集群中均有故障发生。因此,所述第一组集群内的第一集群和所述第二组集群内的第一集群采用第一预置机制对所述仲裁设备进行抢占;所述第一组集群内的第二集群和所述第二组集群内的第二集群采用第二预置机制对所述仲裁设备进行抢占。其中,该第一预置机制和第二预置机制分别是第一集群和第二集群原有的仲裁机制,该第一预置机制和第二预置机制可以相同也可以不同。
这样,即使在第一组集群或者第二组集群在无法全部存活的情况下,各集群仍能够尽最大努力保证业务连续。
本实施例中,在出现链路故障或者其他故障,而且第一组集群中的各子集群和第二组集群中的各子集群仍能够分别继续存活的情况下,第一组集群和第二组集群中将由哪个组集群继续提供服务,哪个组集群将“自杀”而停止服务,是决定于哪个组集群的抢占代表对仲裁设备抢占成功。
实际运用中,还可以预先设置在这种情况下由哪个组集群优先存活。例如,可以预先设置第一组集群优先存活,那么在第一组集群的抢占代表和所述第二组集群的抢占代表均对仲裁设备进行抢占时,所述第二组集群的抢占代表进行退让,以确保第一组集群的抢占代表能够成功抢占仲裁设备。
具体举例来说,预置仲裁机制为时间上最先抢占到仲裁设备的为抢占到仲裁设备,那么两个抢占代表队仲裁设备进行抢占时,预先设定第二组集群的抢占代表在确定所在的组集群中没有故障发生后,等待预置时间,然后再去抢占仲裁设备。这样,可以确保第一组集群中的抢占代表最先抢占到仲裁 设备。
为便于理解,下面以一个实际应用场景对本发明实施例的集群仲裁方法进行描述。
在双活数据中心中的存储层中,数据中心1和数据中心2分别部署一台VIS6600T,该两台VIS6600T组成一个VIS集群。在双活数据中心中的应用层中设有Oracle RAC集群,其中该Oracle RAC集群的部分节点设于数据中心1处,另一部分节点设于数据中心2处。该两个数据中心的虚拟机服务器还构成一个虚拟机集群,以及该两个数据中心各自的核心交换机构成一个核心交换机集群。双活数据中心内还设有仲裁设备。
双活数据中心的两个数据中心之间采用集群IP心跳链路和FC数据传输网络来传递控制信息、配置信息和数据同步。
双活数据中心预先设置好数据中心1中的VIS集群、Oracle RAC集群、虚拟机集群和核心交换机归属于Group1,数据中心2中的VIS集群、Oracle RAC集群、虚拟机集群和核心交换机归属于Group2。
当集群IP心跳链路出现故障时,数据中心1和2分别将各自中节点号最小的节点选为抢占代表。数据中心1和2的抢占代表分别确定各自所在的数据中心内各集群中是否有故障发生,若其中一个数据中心内的各集群中有故障发生,而另一个数据中心内的各集群中没有故障发生,则没有故障发生的数据中心的抢占代表对仲裁设备进行抢占,且抢占成功。
若两个数据中心内的各集群均没有故障发生,则该两个数据中心的抢占代表中最先抢占到仲裁设备的一个抢占成功,该抢占成功的抢占代表所在的数据中心中各集群继续存活,以使该数据中心继续提供业务服务,而另一个数据中心中的各集群“自杀”,全部停止提供业务服务。
若两个数据中心的抢占代表分别检测到各自所在的数据中心内的各集群中发生故障时,每一个抢占代表还检测预置时间内对方组集群的抢占代表是否对仲裁设备抢占成功,再确定对方没有抢占成功时,该两个数据中心中的VIS集群、Oracle RAC集群、虚拟机集群和核心交换机集群分别采用各自集群原有的仲裁机制对仲裁设备进行抢占。
上面对本发明实施例中的集群仲裁方法进行了描述,下面对本发明实施 例中的多集群配合系统进行描述,请参阅图2,本发明实施例中多集群配合系统200包括:
第一组集群201、第二组集群202和仲裁设备203,其中第一组集群201包括第一集群的一部分211和第二集群的一部分221,所述第二组集群202包括所述第一集群的另一部分212和所述第二集群的另一部分222,所述第一集群和所述第二集群互相配合,所述仲裁设备203内设有预置仲裁机制。
所述第一组集群201和第二组集群202分别用于当检测到第一组集群201和第二组集群202内有故障发生时,确定各自的抢占代表;
所述第一组集群201的抢占代表和所述第二组集群202的抢占代表分别用于确定所在的组集群内是否有故障发生;若没有,则对所述仲裁设备203进行抢占,其中根据所述预置仲裁机制对所述仲裁设备抢占成功的抢占代表所在的组集群存活。
本发明实施例中,当出现故障时,由于第一组集群和第二组集群分别确定各自的抢占代表去抢占仲裁设备中的仲裁设备,而抢占成功的那组集群中的所有子集群存活,进而保证在出现故障时不同集群的仲裁结果一致,使得存活的组集群能够继续提供服务。
优选的,所述仲裁设备203内还设有第一预置机制和第二预置机制;
所述第一组集群201的抢占代表和所述第二组集群202的抢占代表还分别用于当确定所在的组集群内有故障发生时,检测预置时间内对方组集群是否对仲裁设备进行抢占;若没有,则所述第一集群采用所述第一预置机制对所述仲裁设备进行抢占,所述第二集群22采所述用第二预置机制对所述仲裁设备进行抢占。
优选的,所述第二组集群202的抢占代表还用于当所述第一组集群201的抢占代表和所述第二组集群202的抢占代表均对仲裁设备进行抢占时进行退让。
优选的,所述预置仲裁机制为最先抢占到所述仲裁设备的抢占代表对所述仲裁设备抢占成功;所述第二组集群202的抢占代表用于在确定所在的组集群内没有故障发生后间隔预置时间再对所述仲裁设备进行抢占。
优选的,所述多集群配合系统为双活数据中心,其中所述第一组集群位 于其中一个数据中心内,所述第二组集群位于另一个数据中心内。
为便于理解,下面以一个实际应用场景对本发明实施例的多集群配合系统进行描述。
本实施例中,多集群配合系统为双活数据中心。在双活数据中心中的存储层中,数据中心1和数据中心2分别部署一台VIS6600T,该两台VIS6600T组成一个VIS集群。在双活数据中心中的应用层中设有Oracle RAC集群,其中该Oracle RAC集群的部分节点设于数据中心1处,另一部分节点设于数据中心2处。该两个数据中心的虚拟机服务器还构成一个虚拟机集群,以及该两个数据中心各自的核心交换机构成一个核心交换机集群。双活数据中心内还设有仲裁设备。
双活数据中心的两个数据中心之间采用集群IP心跳链路和FC数据传输网络来传递控制信息、配置信息和数据同步。
双活数据中心预先设置好数据中心1中的VIS集群、Oracle RAC集群、虚拟机集群和核心交换机归属于Group1,数据中心2中的VIS集群、Oracle RAC集群、虚拟机集群和核心交换机归属于Group2。
当集群IP心跳链路出现故障时,数据中心1和2分别将各自中节点号最小的节点选为抢占代表。数据中心1和2的抢占代表分别确定各自所在的数据中心内各集群中是否有故障发生,若其中一个数据中心内的各集群中有故障发生,而另一个数据中心内的各集群中没有故障发生,则没有故障发生的数据中心的抢占代表对仲裁设备进行抢占,且抢占成功。
若两个数据中心内的各集群均没有故障发生,则该两个数据中心的抢占代表中最先抢占到仲裁设备的一个抢占成功,该抢占成功的抢占代表所在的数据中心中各集群继续存活,以使该数据中心继续提供业务服务,而另一个数据中心中的各集群“自杀”,全部停止提供业务服务。
若两个数据中心的抢占代表分别检测到各自所在的数据中心内的各集群中发生故障时,每一个抢占代表还检测预置时间内对方组集群的抢占代表是否对仲裁设备抢占成功,再确定对方没有抢占成功时,该两个数据中心中的VIS集群、Oracle RAC集群、虚拟机集群和核心交换机集群分别采用各自集群原有的仲裁机制对仲裁设备进行抢占。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应 当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (10)

  1. 一种集群仲裁方法,其特征在于,包括:
    检测第一组集群或第二组集群中是否有故障发生,其中所述第一组集群包括第一集群的一部分和第二集群的一部分,所述第二组集群包括所述第一集群的另一部分和所述第二集群的另一部分,所述第一集群和所述第二集群互相配合;
    当检测到有故障发生时,所述第一组集群和第二组集群分别确定各自的抢占代表,所述第一组集群的抢占代表和所述第二组集群的抢占代表分别执行以下步骤:
    确定所在的组集群内是否有故障发生;
    若没有,则对仲裁设备进行抢占,其中根据预置仲裁机制对所述仲裁设备抢占成功的抢占代表所在的组集群存活。
  2. 根据权利要求1所示的集群仲裁方法,其特征在于,所述第一组集群的抢占代表和所述第二组集群的抢占代表还分别执行以下步骤:
    若确定所在的组集群内有故障发生,则检测预置时间内对方组集群是否对仲裁设备进行抢占,若没有,则所述第一集群采用第一预置机制对所述仲裁设备进行抢占;所述第二集群采用第二预置机制对所述仲裁设备进行抢占。
  3. 根据权利要求1所示的集群仲裁方法,其特征在于,所述确定所在的组集群内是否有故障发生之后还包括:
    当所述第一组集群的抢占代表和所述第二组集群的抢占代表分别确定所在的组集群内没有故障发生时,所述第一组集群的抢占代表和所述第二组集群的抢占代表均对仲裁设备进行抢占,且预置所述第二组集群的抢占代表进行退让。
  4. 根据权利要求3所示的集群仲裁方法,其特征在于,所述预置仲裁机制为最先抢占到所述仲裁设备的抢占代表对所述仲裁设备抢占成功;
    所述预置所述第二组集群的抢占代表进行退让具体包括:
    预置所述第二组集群的抢占代表在确定所在的组集群内没有故障发生后间隔预置时间再对所述仲裁设备进行抢占。
  5. 根据权利要求1所述的集群仲裁方法,其特征在于,所述第一组集群和第二组集群位于双活数据中心内,其中所述第一组集群位于其中一个数据中心内,所述第二组集群位于另一个数据中心内。
  6. 一种多集群配合系统,其特征在于,包括:
    第一组集群、第二组集群和仲裁设备,其中第一组集群包括第一集群的一部分和第二集群的一部分,所述第二组集群包括所述第一集群的另一部分和所述第二集群的另一部分,所述第一集群和所述第二集群互相配合,所述仲裁设备内设有预置仲裁机制;
    所述第一组集群和第二组集群分别用于当检测到第一组集群和第二组集群内有故障发生时,确定各自的抢占代表;
    所述第一组集群的抢占代表和所述第二组集群的抢占代表分别用于确定所在的组集群内是否有故障发生;若没有,则对所述仲裁设备进行抢占,其中根据所述预置仲裁机制对所述仲裁设备抢占成功的抢占代表所在的组集群存活。
  7. 根据权利要求6所述的多集群配合系统,其特征在于,
    所述仲裁设备内还设有第一预置机制和第二预置机制;
    所述第一组集群的抢占代表和所述第二组集群的抢占代表还分别用于当确定所在的组集群内有故障发生时,检测预置时间内对方组集群是否对仲裁设备进行抢占;若没有,则所述第一集群采用所述第一预置机制对所述仲裁设备进行抢占,所述第二集群采所述用第二预置机制对所述仲裁设备进行抢占。
  8. 根据权利要求6所述的多集群配合系统,其特征在于,
    所述第二组集群的抢占代表还用于当所述第一组集群的抢占代表和所述第二组集群的抢占代表分别确定所在的组集群内没有故障发生,且所述第一组集群的抢占代表和所述第二组集群的抢占代表均对仲裁设备进行抢占时进行退让。
  9. 根据权利要求8所述的多集群配合系统,其特征在于,所述预置仲裁机制为最先抢占到所述仲裁设备的抢占代表对所述仲裁设备抢占成功;
    所述第二组集群的抢占代表具体用于在确定所在的组集群内没有故障发 生后间隔预置时间再对所述仲裁设备进行抢占。
  10. 根据权利要求6所述的多集群配合系统,其特征在于,
    所述多集群配合系统为双活数据中心,其中所述第一组集群位于其中一个数据中心内,所述第二组集群位于另一个数据中心内。
PCT/CN2015/077092 2014-11-27 2015-04-21 集群仲裁方法和多集群配合系统 WO2016082443A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP15863647.2A EP3214865B1 (en) 2014-11-27 2015-04-21 Cluster arbitration method and multi-cluster coordination system
EP18183960.6A EP3461065B1 (en) 2014-11-27 2015-04-21 Cluster arbitration method and multi-cluster cooperation system
US15/606,214 US20170270015A1 (en) 2014-11-27 2017-05-26 Cluster Arbitration Method and Multi-Cluster Cooperation System

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410705888.9A CN104469699B (zh) 2014-11-27 2014-11-27 集群仲裁方法和多集群配合系统
CN201410705888.9 2014-11-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/606,214 Continuation US20170270015A1 (en) 2014-11-27 2017-05-26 Cluster Arbitration Method and Multi-Cluster Cooperation System

Publications (1)

Publication Number Publication Date
WO2016082443A1 true WO2016082443A1 (zh) 2016-06-02

Family

ID=52914920

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/077092 WO2016082443A1 (zh) 2014-11-27 2015-04-21 集群仲裁方法和多集群配合系统

Country Status (4)

Country Link
US (1) US20170270015A1 (zh)
EP (2) EP3214865B1 (zh)
CN (1) CN104469699B (zh)
WO (1) WO2016082443A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104469699B (zh) * 2014-11-27 2018-09-21 华为技术有限公司 集群仲裁方法和多集群配合系统
WO2017015961A1 (zh) 2015-07-30 2017-02-02 华为技术有限公司 一种用于双活数据中心的仲裁方法、装置及系统
CN105426275B (zh) 2015-10-30 2019-04-19 成都华为技术有限公司 双活集群系统中容灾的方法及装置
CN107147511A (zh) * 2016-03-01 2017-09-08 深圳市深信服电子科技有限公司 数据中心控制方法及装置
CN108063787A (zh) * 2017-06-26 2018-05-22 杭州沃趣科技股份有限公司 基于分布式一致性状态机实现双活架构的方法
CN109947591B (zh) * 2017-12-20 2023-03-24 腾讯科技(深圳)有限公司 数据库异地灾备系统及其部署方法、部署装置
CN110535714B (zh) 2018-05-25 2023-04-18 华为技术有限公司 一种仲裁方法及相关装置
CN110830324B (zh) * 2019-10-28 2021-09-03 烽火通信科技股份有限公司 一种检测数据中心网络连通性的方法、装置及电子设备
CN112463669B (zh) * 2020-11-23 2022-12-09 苏州浪潮智能科技有限公司 一种存储仲裁管理方法、系统、终端及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8108715B1 (en) * 2010-07-02 2012-01-31 Symantec Corporation Systems and methods for resolving split-brain scenarios in computer clusters
US20120197822A1 (en) * 2011-01-28 2012-08-02 Oracle International Corporation System and method for using cluster level quorum to prevent split brain scenario in a data grid cluster
CN103684941A (zh) * 2013-11-23 2014-03-26 广东新支点技术服务有限公司 基于仲裁服务器的集群裂脑预防方法和装置
CN104158707A (zh) * 2014-08-29 2014-11-19 杭州华三通信技术有限公司 一种检测并处理集群脑裂的方法和装置
CN104469699A (zh) * 2014-11-27 2015-03-25 华为技术有限公司 集群仲裁方法和多集群配合系统

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6279032B1 (en) * 1997-11-03 2001-08-21 Microsoft Corporation Method and system for quorum resource arbitration in a server cluster
US6393485B1 (en) * 1998-10-27 2002-05-21 International Business Machines Corporation Method and apparatus for managing clustered computer systems
US7139925B2 (en) * 2002-04-29 2006-11-21 Sun Microsystems, Inc. System and method for dynamic cluster adjustment to node failures in a distributed data system
US8145938B2 (en) * 2009-06-01 2012-03-27 Novell, Inc. Fencing management in clusters
CN101702721B (zh) * 2009-10-26 2011-08-31 北京航空航天大学 一种多集群系统的可重组方法
CN102394807B (zh) * 2011-08-23 2015-03-04 京北方信息技术股份有限公司 一种分散调度自治的流程引擎负载均衡集群系统及方法
US8650281B1 (en) * 2012-02-01 2014-02-11 Symantec Corporation Intelligent arbitration servers for network partition arbitration
CN103813369A (zh) * 2012-11-13 2014-05-21 北京信威通信技术股份有限公司 一种分布式的电信交换设备备份方法
CN103209095B (zh) * 2013-03-13 2017-05-17 广东中兴新支点技术有限公司 一种基于磁盘服务锁的裂脑预防的方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8108715B1 (en) * 2010-07-02 2012-01-31 Symantec Corporation Systems and methods for resolving split-brain scenarios in computer clusters
US20120197822A1 (en) * 2011-01-28 2012-08-02 Oracle International Corporation System and method for using cluster level quorum to prevent split brain scenario in a data grid cluster
CN103684941A (zh) * 2013-11-23 2014-03-26 广东新支点技术服务有限公司 基于仲裁服务器的集群裂脑预防方法和装置
CN104158707A (zh) * 2014-08-29 2014-11-19 杭州华三通信技术有限公司 一种检测并处理集群脑裂的方法和装置
CN104469699A (zh) * 2014-11-27 2015-03-25 华为技术有限公司 集群仲裁方法和多集群配合系统

Also Published As

Publication number Publication date
US20170270015A1 (en) 2017-09-21
EP3214865B1 (en) 2018-11-14
EP3461065B1 (en) 2020-07-29
CN104469699A (zh) 2015-03-25
EP3461065A1 (en) 2019-03-27
EP3214865A4 (en) 2017-10-18
EP3214865A1 (en) 2017-09-06
CN104469699B (zh) 2018-09-21

Similar Documents

Publication Publication Date Title
WO2016082443A1 (zh) 集群仲裁方法和多集群配合系统
US11818212B2 (en) Storage area network attached clustered storage system
US11163653B2 (en) Storage cluster failure detection
US10020980B2 (en) Arbitration processing method after cluster brain split, quorum storage apparatus, and system
EP3627767B1 (en) Fault processing method and device for nodes in cluster
JP6382454B2 (ja) 分散ストレージ及びレプリケーションシステム、並びに方法
CN106330475B (zh) 一种通信系统中管理主备节点的方法和装置及高可用集群
US20140173330A1 (en) Split Brain Detection and Recovery System
WO2017067484A1 (zh) 一种虚拟化数据中心调度系统和方法
US20150339200A1 (en) Intelligent disaster recovery
CN105704187A (zh) 一种集群脑裂的处理方法及装置
CN105490847A (zh) 一种私有云存储系统中节点故障实时检测及处理方法
US11544162B2 (en) Computer cluster using expiring recovery rules
US10645163B2 (en) Site-aware cluster management
CN114301763A (zh) 分布式集群故障的处理方法及系统、电子设备及存储介质
CN110266795A (zh) 一种基于Openstack平台控制方法
US11947431B1 (en) Replication data facility failure detection and failover automation
CN117560268A (zh) 集群管理方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15863647

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2015863647

Country of ref document: EP