KR20040075174A

KR20040075174A - Method for preventing data damage in high availability system

Info

Publication number: KR20040075174A
Application number: KR1020030010640A
Authority: KR
Inventors: 김상헌
Original assignee: 엘지엔시스(주)
Priority date: 2003-02-20
Filing date: 2003-02-20
Publication date: 2004-08-27

Abstract

PURPOSE: A method for preventing the data loss of a high availability system is provided to prevent the data loss generated by the duplicated execution of a server on several nodes when the network is divided, and secure the maximum availability on the divided network through the diversification of a heartbeat channel and an algorithm for managing node membership and deciding a sub group. CONSTITUTION: An obstacle state of the system is checked through the heartbeat. The obstacle state and the information for the sub groups respectively divided by the heartbeat state are stored in a shared disk by using the predetermined heartbeat. The stored information and the service state information are transmitted by selecting the optimal heartbeat excluding the predetermined heartbeat. The requested service is performed by selecting the sub group and the heartbeat guaranteeing the maximum availability based on the transmitted information.

Description

Method for preventing data damage in high availability system

본 발명은, 다수의 노드로 구성되는 고가용성 시스템에 있어서 네트워크 분할로 인한 데이터 손상을 방지하기 위한 것이다.The present invention is to prevent data corruption due to network partition in a high availability system composed of multiple nodes.

도1은 다수의 서버시스템( 노드(Node) )으로 구성되는 일반적인 고가용성(HA : High Availability) 시스템(100)의 구성예를 도시한 것으로서, 도1의 HA 시스템(100)은 Task1, Task2의 두가지 서비스를 Node A, Node C에서 각각 수행하며, 서비스를 수행하는 Node A 또는 Node C에 장애가 발생하여 더 이상 서비스를 수행할 수 없을 때 Node B가 서비스를 복구(Failover)하도록 구성되어 있다.FIG. 1 illustrates an example of a configuration of a general high availability (HA) system 100 including a plurality of server systems (nodes), and the HA system 100 of FIG. Two services are performed at Node A and Node C, respectively, and Node B is configured to fail over when a failure occurs in Node A or Node C.

도2는 시스템 다운이나 커널 패닉등의 장애 발생시 상기 HA 시스템(100)의 서비스 복구과정을 도시한 것으로서, Node A에 장애가 발생하였을 경우 Node B는 허트비트(Heartbeat)를 통해 Node A의 장애를 감지하여 Node A가 수행하던 서비스를 대신 수행하게 된다.FIG. 2 is a diagram illustrating a service recovery process of the HA system 100 when a failure such as a system down or a kernel panic occurs. When a failure occurs in Node A, Node B detects a failure of Node A through a heartbeat. Thus, the service performed by Node A is executed instead.

이와 같이, 상기 HA 시스템(100)은 시스템 다운이나 커널 패닉 등과 같은 장애로 시스템이 완전히 동작을 정지하여 더 이상 요청된 서비스를 수행할 수 없는 상태가 되면 다른 시스템에서 서비스를 복구하여 처리함으로써, 서비스 처리가 신속하게 이루어지도록 한다.As such, the HA system 100 recovers a service from another system when the system stops completely due to a failure such as a system down or a kernel panic to perform a requested service. Make the process quick.

한편, 상기 HA 시스템(100)은 각 노드의 장애를 감지하는 중요한 통신채널인허트비트 채널을 통하여 상태정보를 교환하는데, 상기 HA 시스템(100)에서 주로 사용하는 허트비트의 종류로는 이터넷 카드(Ethernet Card)로 연결된 2개 이상의 전용 네트워크 허트비트(Private Network Heartbeat)와, 서비스를 제공하기 위해 사용하는 공용 네트워크 허트비트(Public Network Heartbeat)와, 시리얼 포트에 연결된 허트비트(Serial Heartbeat)와, 공유디스크를 연결하는 디스크 허트비트(Disk Heartbeat)가 있는데, 만약 상기 HA 시스템(100)이 상기 4종류의 허트비트 중에서 한 종류의 허트비트만을 사용하게 되면 부정확한 판단으로 인해 시스템이 비정상적으로 동작할 수 있다.Meanwhile, the HA system 100 exchanges state information through a heartbeat channel, which is an important communication channel for detecting a failure of each node, and a type of heartbeat mainly used by the HA system 100 includes an Ethernet card ( Two or more private network heartbeats connected via an Ethernet card, a public network heartbeat used to provide services, and a serial heartbeat connected to a serial port. There is a disk heartbeat that connects a disk. If the HA system 100 uses only one kind of heartbeat among the four types of heartbeat, the system may operate abnormally due to inaccurate judgment. have.

도3은 상기 4종류의 허트비트 중에서 이더넷 허트비트만을 사용하는 HA 시스템(100)이 허트비트의 단절로 인해 비정상적으로 동작하는 경우를 도시한 것으로서, Node A를 감지하는 허트비트가 단절되어 통신이 불가능할 경우 Node B는 Node A가 다운되었다고 판단하여 Node A가 수행하던 서비스를 수행하게 된다. 이때 서비스(Task1)가 두 노드에서 동시에 수행되면서 응용 프로그램이 사용하는 디스크 상의 데이터가 손상될 수 있는데, 이러한 현상은 HA 시스템(100)이 사용하는 허트비트에 장애가 발생하여 네트워크가 분할될 경우에 발생하며, 이때 한 그룹의 HA 시스템(100)은 다수의 서브그룹을 형성하게 된다.FIG. 3 illustrates a case in which the HA system 100 using only the Ethernet heartbeat is abnormally operated due to the disconnection of the heartbeat among the four types of heartbeats. If it is impossible, Node B determines that Node A is down and performs the service that Node A performed. At this time, as the service (Task1) is executed on both nodes at the same time, data on the disk used by the application may be damaged. This phenomenon occurs when the network is partitioned due to a failure in the heartbeat used by the HA system 100. In this case, the HA system 100 of one group forms a plurality of subgroups.

도4는 상기 4종류의 허트비트 중에서 이더넷 허트비트와 디스크 허트비트를 사용하는 HA 시스템(100)이 디스크 허트비트를 이용하여 서비스의 동시수행을 방지하는 경우를 도시한 것으로서, Node A의 HA 데몬(Daemon)은 이더넷 허트비트에 장애가 발생하였을 경우 디스크 허트비트를 통해 공유디스크의 특정영역에 Node A의상태정보를 기록하며, Node B는 공유디스크에 존재하는 Node A의 상태정보를 통해 Node A가 살아있으면 Node A가 수행하던 Task1을 수행하지 않게 된다.FIG. 4 illustrates a case in which the HA system 100 using the Ethernet heartbeat and the disk heartbeat among the four types of heartbeats prevents the simultaneous execution of a service by using the disk heartbeat, and the HA daemon of Node A. (Daemon) records the status information of Node A in a specific area of the shared disk through the disk heartbeat when the Ethernet Herbit fails, and Node B uses Node A status information of Node A on the shared disk. If it is alive, Node A will not execute Task1.

그러나, 지금까지 전술한 종래기술은 서비스를 수행하는 노드와 대기중인 노드 사이의 감시를 통해 HA 데몬의 오동작을 방지하는 것으로서, 다수의 노드에서 네트워크가 분할되었을 경우 발생하는 문제점을 해결하는 데 한계가 있다.However, the prior art described above is to prevent the HA daemon from malfunctioning by monitoring between a node performing a service and a standby node, and there is a limit in solving a problem that occurs when a network is divided in a plurality of nodes. have.

그리고, 전술한 종래기술은 노드 수가 적고 단순한 네트워크 분할에 대한 해결책이 주를 이루고 있어, 만약 노드 수가 큰 환경에서 네트워크 분할로 인하여 HA 시스템이 다수의 서브그룹으로 분할되었을 경우에는 노드 및 서비스에 대한 상태정보의 손상과 데이터의 손상을 방지하지 못하는 문제점이 있었다.In addition, in the above-described conventional technology, a solution for simple network partitioning with a small number of nodes is mainly used. If the HA system is divided into a plurality of subgroups due to network partitioning in an environment where a large number of nodes is present, the state of nodes and services is increased. There is a problem that can not prevent information corruption and data corruption.

따라서, 본 발명은 상기와 같은 문제점을 해결하기 위하여 창작된 것으로서, 네트워크 분할시 여러노드에서 서비스의 중복수행으로 인해 발생하는 데이터의 손상을 방지하고, 허트비트 채널의 다양화와 고가용성 시스템의 노드 맴버쉽 관리 및 서브그룹 결정 알고리즘을 통해 분할된 네트워크 상에서 최대의 가용성을 확보할 수 있도록 하는 고가용성 시스템의 데이터 손상 방지방법을 제공하는 데 그 목적이 있는 것이다.Accordingly, the present invention was devised to solve the above problems, and prevents data corruption caused by redundant service execution at multiple nodes in network partitioning, and diversifies the heartbeat channel and nodes in a high availability system. Its purpose is to provide a method for data corruption prevention of high-availability systems through membership management and subgroup decision algorithms to ensure maximum availability on partitioned networks.

도1은 다수의 서버시스템(Node)으로 구성되는 일반적인 고가용성(HA : High Availability) 시스템(100)의 구성예를 도시한 것이고,1 illustrates an example of a configuration of a general high availability (HA) system 100 including a plurality of server systems (Nodes),

도2는 HA 시스템(100)의 장애 발생에 따른 서비스 복구과정을 도시한 것이고,2 illustrates a service recovery process according to a failure of the HA system 100,

도3은 이더넷 허트비트만을 사용하는 HA 시스템(100)이 허트비트의 단절로 인해 비정상적으로 동작하는 경우를 도시한 것이고,FIG. 3 illustrates a case in which the HA system 100 using only the Ethernet heartbeat operates abnormally due to disconnection of the heartbeat.

도4는 이더넷 허트비트와 디스크 허트비트를 사용하는 HA 시스템(100)이 디스크 허트비트를 이용하여 서비스의 동시수행을 방지하는 경우를 도시한 것이고,FIG. 4 illustrates a case in which the HA system 100 using the Ethernet heartbeat and the disk heartbeat prevents simultaneous execution of a service by using the disk heartbeat.

도5는 본 발명에 따른 고가용성 시스템의 데이터 손상 방지방법이 구현된 HA 시스템(100)의 구성예를 도시한 것이고,5 illustrates an example of a configuration of an HA system 100 in which a method for preventing data corruption in a high availability system according to the present invention is implemented.

도6은 도5의 HA 시스템(100)에서 각 노드의 허트비트와 맴버쉽 관리를 통해 네트워크 분할에 의한 데이터 손상을 방지하는 알고리즘을 설명하기 위한 예제를 도시한 것이고,FIG. 6 illustrates an example for explaining an algorithm for preventing data corruption due to network partitioning through management of heartbeat and membership of each node in the HA system 100 of FIG. 5.

도7 및 도8은 네트워크 분할 발생시 맴버쉽 정보를 관리하는 알고리즘을 설명하기 위한 예제를 도시한 것이고,7 and 8 illustrate an example for explaining an algorithm for managing membership information when a network partition occurs.

도9는 2개의 서브그룹으로 분할되어 재구성된 HA 시스템(100)을 도시한 것이다.9 shows an HA system 100 divided into two subgroups.

※ 도면의 주요부분에 대한 부호의 설명※ Explanation of code for main part of drawing

100 : 고가용성 시스템(HA 시스템)100: high availability system (HA system)

상기와 같은 목적을 달성하기 위한 본 발명에 따른 고가용성 시스템의 데이터 손상 방지방법은, 다수의 서버시스템(노드)으로 구성되어 다수의 허트비트를 사용하여 요청된 서비스를 수행하는 고가용성 시스템에 있어서, 상기 허트비트를 통해 상기 시스템의 장애상태를 검사하는 제 1단계; 상기 검사결과에 따른 시스템의 장애상태와 상기 허트비트의 상태에 따라 분할된 각 서브그룹에 대한 정보를, 지정된 특정 허트비트를 사용하여 공유디스크 상에 저장하는 제 2단계; 상기 지정된 특정 허트비트를 제외한 최적의 허트비트를 선택하여 상기 저장된 정보 및 서비스 상태정보를 전송하는 제 3단계; 및 상기 전송되는 정보에 근거하여, 최대의 가용성을 보장하는 서브그룹 및 허트비트를 선택하여 요청된 서비스를 수행하는 제 4단계를 포함하여 이루어지는 것에 그 특징이 있는 것이다.Data corruption prevention method of a high availability system according to the present invention for achieving the above object, in a high availability system consisting of a plurality of server systems (nodes) to perform the requested service using a plurality of hert bits. A first step of checking a failure state of the system through the heartbeat; A second step of storing information on each subgroup divided according to a fault state of the system according to the test result and a state of the heartbeat on the shared disk using a specified specific heartbeat; A third step of selecting an optimal heartbeat other than the specified specific heartbeat and transmitting the stored information and service state information; And a fourth step of performing a requested service by selecting a subgroup and a heartbeat that guarantee maximum availability based on the transmitted information.

이하, 본 발명에 따른 고가용성 시스템의 데이터 손상 방지방법의 일 실시예에 대해, 첨부된 도면에 의거하여 상세히 설명한다.Hereinafter, an embodiment of a data corruption prevention method of a high availability system according to the present invention will be described in detail with reference to the accompanying drawings.

도5는 본 발명에 따른 고가용성 시스템의 데이터 손상 방지방법이 구현된 고가용성 시스템(HA 시스템)(100)의 구성예를 도시한 것으로서, 본 발명은 다수의 노드로 구성된 HA 시스템(100)을 목표로 하므로. 본 발명에 따른 도5의 HA 시스템(100)은 3종류, 즉 전용 네트워크 허트비트와 공용 네트워크 허트비트, 그리고 디스크 허트비트를 사용하게 된다.FIG. 5 illustrates an example of a configuration of a high availability system (HA system) 100 in which a method for preventing data corruption in a high availability system according to the present invention is implemented. The present invention provides an HA system 100 including a plurality of nodes. As aim. The HA system 100 of FIG. 5 according to the present invention uses three types, that is, a dedicated network heartbeat, a public network heartbeat, and a disk heartbeat.

상기 3종류의 허트비트를 통해 전송되는 정보는 하기에 도시한 바와 같다.Information transmitted through the three types of heartbeats is as shown below.

상기 허트비트는 기본적으로 상기 HA 시스템(100)을 구성하는 각 노드의 시스템 장애를 검사(Health Check)하고, 상기 HA 시스템(100)의 상태정보를 전송하는 채널이다. 그러나, 상기 디스크 허트비트는 네트워크 허트비트와 다른 속성을 갖는 허트비트로서 단지 노드의 시스템 장애를 검사하는 채널로 사용된다.The heartbeat is basically a channel for checking a system failure of each node constituting the HA system 100 and transmitting state information of the HA system 100. However, the disk heartbeat is a heartbeat having a different property from the network heartbeat and is used only as a channel for checking a system failure of a node.

도6은 상기 HA 시스템(100)에서 각 노드의 허트비트와 맴버쉽 관리를 통해 네트워크 분할에 의한 데이터 손상을 방지하는 알고리즘을 설명하기 위한 예제로서, 각 노드의 HA 데몬은 HB0, HB1, HB2의 네트워크 허트비트에 브로드케스트(Broadcast)와 핑(Ping)을 사용하여 시스템의 장애상태를 검사하며, 또한 상기 HA 데몬은 HB3(디스크 허트비트)을 사용하여 특정 디스크 영역에 시스템 장애상태를 기록하거나 다른 시스템의 장애상태를 파악한다.FIG. 6 is an example for explaining an algorithm for preventing data corruption due to network partitioning through management of heartbeats and membership of each node in the HA system 100. The HA daemon of each node is a network of HB0, HB1, and HB2. Broadcast and ping the heartbeat to check the system's failure status, and the HA daemon uses HB3 (disk heartbeat) to record the system failure status in a specific disk area or other system. Determine the fault condition of

또, 각 노드의 HA 데몬은 허트비트를 통해 파악하는 시스템의 장애상태와 허트비트의 상태에 따라서 분할된 그룹의 정보를 도6에 도시한 형태로 메모리와 특정 디스크 영역에 기록하며, 디스크 허트비트(HB3)를 제외한 'GOOD' 상태의 첫 번째 허트비트(HB0)를 선택하여 노드 및 서비스의 상태정보를 전송한다.In addition, the HA daemon of each node records group information divided into memory and a specific disk area in the form shown in FIG. 6 according to the fault state of the system and the state of the heartbeat identified through the heartbeat. Transmit node and service status information by selecting the first Herbit bit (HB0) of 'GOOD' status except (HB3).

도7 및 도8은 네트워크 분할 발생시 맴버쉽 정보를 관리하는 알고리즘을 설명하기 위한 예제로서, 도7에서와 같이 6개의 노드로 구성된 HA 시스템(100)은 HB0, HB1, HB2의 순서로 허트비트 네트워크에 장애가 발생하여 일부 노드간에 통신 불가능 상태가 발생하며, 이와 같은 허트비트 네트워크의 장애 발생시 맴버쉽 정보의 변화는 도8에 예시한 바와 같다.7 and 8 are examples for explaining an algorithm for managing membership information when a network partition occurs. As shown in FIG. 7, the HA system 100 composed of six nodes is connected to the heartbeat network in the order of HB0, HB1, and HB2. A failure occurs and a state in which communication is impossible between some nodes occurs, and when the failure of the heartbeat network occurs, the change of membership information is illustrated in FIG. 8.

상기 HA 데몬은, HB0에 장애가 발생하면 HB0의 상태를 'Invalid'로 기록하고, 정상적으로 동작하는 다음 허트비트(HB1)를 선택하여 상기 HA 시스템(100)이 정상적으로 동작하도록 하며, HB0의 장애에 의해 분할된 서브그룹 'ABC'와 'DEF'를 파악하여 메모리에 기록한다.The HA daemon records the state of HB0 as 'Invalid' when a failure occurs in HB0, selects the next heartbeat (HB1) that operates normally, and causes the HA system 100 to operate normally. The divided subgroups 'ABC' and 'DEF' are identified and recorded in the memory.

또 상기 HA 데몬은, HB1, HB2에 장애가 발생하였을 때에도 마찬가지로 상기와 동일한 과정을 반복하며, 모든 네트워크 허트비트에 장애가 발생하여 상기 HA 시스템(100)의 노드 및 서비스 상태정보를 교환할 수 없을 경우에는 도8의 (4)와 같이 각 서브그룹의 대표노드인 노드 A, C, D, F는 모든 네트워크 허트비트의 서브그룹 정보를 HB3을 통해 공유디스크의 특정영역에 기록하게 되는데, 서브그룹 정보의 파악 및 기록시에는 다음의 조건을 만족하여야 한다.In addition, the HA daemon repeats the same process as above when HB1 and HB2 fail, and when the network heartbeat fails and the node and service state information of the HA system 100 cannot be exchanged. As shown in (4) of FIG. 8, nodes A, C, D, and F, which are representative nodes of each subgroup, record subgroup information of all network heartbeat bits in a specific area of a shared disc through HB3. The following conditions shall be met when identifying and recording.

1. 각 서브그룹의 대표노드는 서브그룹의 맴버 중 HA 시스템(100)에서 먼저 등록된 노드가 담당한다.1. A representative node of each subgroup is in charge of a node registered first in the HA system 100 among members of a subgroup.

2. 각 허트비트 단위로 파악되는 서브그룹에서 하나의 노드는 반드시 하나의 서브그룹이 맴버가 될 수 있다.2. In a subgroup identified by each heartbeat unit, one node may be a member of a subgroup.

다음으로, 상기 HA 데몬은 공유디스크에 기록된 맴버쉽 정보를 바탕으로 분할된 상기 HA 시스템(100)을 어떻게 재구성할 것인지를 결정한다.Next, the HA daemon determines how to reconfigure the partitioned HA system 100 based on the membership information recorded on the shared disk.

도8의 예제에서, HA 시스템(100)은 'AB', 'ABC', 'ABCDEF', 'DEF', 'CDEF', 'F'의 서브그룹으로 분할되었으며, 상기 HA 데몬은 다음과 같은 기준을 적용하여 중요 서브그룹과 상태정보 전송을 위해 사용할 허트비트를 결정하게 된다.In the example of FIG. 8, the HA system 100 has been divided into subgroups of 'AB', 'ABC', 'ABCDEF', 'DEF', 'CDEF', and 'F'. By applying this, it is decided the heartbeat to be used for the transmission of important subgroup and status information.

1. 서브그룹에 포함된 맴버 노드의 수가 가장 큰 노드와 관련 허트비트를 선정한다.1. Select the node with the largest number of member nodes in the subgroup and the related heartbeat.

2. 서브그룹에 포함된 맴버 노드의 수가 동일할 경우 해당 서브그룹과 관련된 허트비트에 의해 분할된 서브그룹의 수가 작은 서브그룹과 허트비트를 선정한다.2. If the number of member nodes included in a subgroup is the same, select a subgroup and a heartbeat with a small number of subgroups divided by the heartbeat associated with the subgroup.

3. 위의 기준이 동일할 경우, HA 시스템 구성시 우선순위를 갖는 것으로 기록된 서비스를 수행하는 노드를 포함하는 서브그룹과 허트비트를 선정한다.3. If the above criteria are the same, select the subgroup and the heartbeat including the node performing the service recorded as having priority in the HA system configuration.

4. 위의 기준이 동일할 경우, 서브그룹과 관련된 허트비트의 순서에 의해 결정하는데, 본 발명에 따른 예제에서는 HB0, HB1, HB2의 순서로 우선순위를 부여한다.4. If the above criteria are the same, it is determined by the order of the heartbeats associated with the subgroups. In the example according to the present invention, priority is given in the order of HB0, HB1, HB2.

도7 및 도8의 예제는 다음과 같은 기준을 적용하여 결과적으로 'ABCDE', 'F'의 서브그룹으로 분할되었으며, 상태정보 전송을 위한 주 허트비트로서 HB1을 사용한다.7 and 8 are divided into sub-groups of 'ABCDE' and 'F' as a result of applying the following criteria, and use HB1 as the main heartbeat for status information transmission.

도9는 2개의 서브그룹으로 분할되어 재구성된 HA 시스템(100)을 도시한 것으로서, 각 서브그룹은 별개의 HA 시스템처럼 동작하기 때문에, 2개의 서브그룹 사이에서 장애 발생시 서비스를 복구하도록 설정된 서비스가 존재할 경우, HA 시스템(100)의 재구성시에는 해당 서비스에 대한 장애복구를 도9의 Task(Oracle)처럼 제한하게 된다.9 illustrates an HA system 100 divided into two subgroups, and since each subgroup operates like a separate HA system, a service configured to recover a service when a failure occurs between two subgroups is performed. When present, upon reconfiguration of the HA system 100, the failover of the service is limited as in Task (Oracle) of FIG.

이상 전술한 본 발명의 바람직한 실시예는 예시의 목적을 위해 개시된 것으로, 당업자라면 이하 첨부된 특허청구범위에 개시된 본 발명의 기술적 사상과 그 기술적 범위 내에서, 다양한 다른 실시예들을 개량, 변경, 대체 또는 부가 등이 가능할 것이다.The above-described preferred embodiments of the present invention are disclosed for purposes of illustration, and those skilled in the art can improve, change, and substitute various other embodiments within the technical spirit and scope of the present invention disclosed in the appended claims below. Or addition may be possible.

상기와 같이 이루어지는 본 발명에 따른 고가용성 시스템의 데이터 손상 방지방법은, 네트워크 분할로 인한 고가용성 시스템의 데이터 손상을 방지할 수 있으며, 중복된 허트비트 채널을 동시에 사용하므로 고가용성 시스템의 안정성을 확보할 수 있다.Data corruption prevention method of the high availability system according to the present invention made as described above, it is possible to prevent data corruption of the high availability system due to network partitioning, and to ensure the stability of the high availability system by using the redundant heartbeat channel at the same time can do.

또 본 발명에 따른 본 발명에 따른 고가용성 시스템의 데이터 손상 방지방법은, 각 허트비트에 대한 노드 맴버쉽(Membership)을 관리하여 네트워크가 분할되는 장애 발생시에도 최대의 가용성을 보장하는 서브그룹을 결정하여 고가용성 시스템을 지속적으로 운용할 수 있으며, 노드의 맴버쉽 관리를 통해 분할된 네트워크에서 발생하는 고가용성 시스템의 노드 및 서비스 상태정보의 불일치 및 손상을 방지하며, 또한 다수의 노드로 구성된 고가용성 시스템이 네트워크 분할로 인해 다수의 서브그룹으로 분할된 경우에도 효과적으로 동작하는 매우 유용하고 효율적인 발명인 것이다.In addition, the data corruption prevention method of the high-availability system according to the present invention, by managing the node membership for each heartbeat (bit) to determine a subgroup that guarantees the maximum availability even in the event of a network partition failure High availability system can be operated continuously, and node membership management prevents inconsistency and damage of node and service status information of high availability system in a divided network, and high availability system composed of multiple nodes It is a very useful and efficient invention that works effectively even when divided into a plurality of subgroups due to network partitioning.

Claims

A high availability system comprising a plurality of server systems (nodes) to perform a requested service using a plurality of heartbeats, the system comprising: a first step of checking a failure state of the system through the heartbeat;

A second step of storing information on each subgroup divided according to a fault state of the system according to the test result and a state of the heartbeat on the shared disk using a specified specific heartbeat;

A third step of selecting an optimal heartbeat other than the specified specific heartbeat and transmitting the stored information and service state information; And

And a fourth step of performing a requested service by selecting a subgroup and a heartbeat to ensure maximum availability based on the transmitted information.

The method of claim 1,

The plurality of heartbeats are dedicated network heartbeats, public network heartbeats, and disk heartbeats.

The method of claim 2,

And wherein the specified specific heartbeat is the disk heartbeat.

The method of claim 1,

The first to fourth steps, the data corruption prevention method of the high availability system, characterized in that performed by the high availability system daemon (Daemon) of each node.

The method of claim 4, wherein

The high availability system daemon of each node, by using a broadcast (broadcast) and ping (Ping) to check the failure state of the system, characterized in that the high availability system data corruption prevention method.

The method of claim 1,

The third step is a method for preventing data corruption in a high availability system, characterized in that the first heartbeat in the GOOD state is selected as an optimal heartbeat among the heartbeats other than the specified specific heartbeat.

The method of claim 1,

The fourth step is a method for preventing data corruption in a high availability system, characterized in that the subgroup having the largest number of member nodes included in each subgroup and a related heartbeat are selected.

The method of claim 1,

In the fourth step, if the number of member nodes included in each subgroup is the same, the subgroup having the smallest number of subgroups divided by the heartbeat associated with the corresponding subgroup and the related heartbeat are selected. How to avoid data corruption in high availability systems.