KR20040051042A

KR20040051042A - Balk process method for high availability system

Info

Publication number: KR20040051042A
Application number: KR1020020078867A
Authority: KR
Inventors: 이지은
Original assignee: 엘지엔시스(주)
Priority date: 2002-12-11
Filing date: 2002-12-11
Publication date: 2004-06-18

Abstract

PURPOSE: A method for processing a failure of a high availability system is provided to precisely detect/process the failure of a node by reexamining a state of the node generating the failure to all nodes if the failure occurs, and sharing an examining result. CONSTITUTION: If the failure is detected from the specified node, all nodes respectively reexamine a failure state. All nodes transmit each state information to all other nodes through a private network. After collecting the state information of other nodes, each node checks the failure state of the specified node by integrating the state information.

Description

Fault handling in high availability systems {BALK PROCESS METHOD FOR HIGH AVAILABILITY SYSTEM}

본 발명은 고가용성 시스템의 장애처리방법에 관한 것으로, 특히 특정 노드에 장애가 발생하면 그 장애가 발생된 특정노드의 상태를 모든 노드에서 재검사하여 그 검사결과를 공유함으로써, 특정 노드의 장애를 정밀하게 검지하여 처리하도록 한 고가용성 시스템의 장애처리방법에 관한 것이다.The present invention relates to a failure handling method of a high availability system. In particular, when a failure occurs in a specific node, the failure of a specific node is precisely detected by re-checking the state of the specific node in which the failure occurs and sharing the test result. To troubleshoot a high availability system.

고가용성 시스템은, 고가용성 구성에 포함되는 노드가 동작 중에 장애가 발생하면, 장애 발생 노드의 상태 정보를 파악하여 장애 발생 노드에서 운영하던 작업을 고가용성 시스템 구성에 포함된 나머지 노드가 인계하여 작업을 지속적으로 운영하여 서비스의 중지시간을 최소화하도록 하는 것이다.When a node included in the high availability configuration fails during operation, the high availability system obtains the status information of the failed node and takes over the operation that was operated on the failed node by the remaining nodes included in the high availability system configuration. It is operated continuously to minimize downtime of the service.

종래의 고가용성 시스템은 특정 노드에 장애가 발생한 경우, 고가용성 구성에 포함되는 모든 노드가 각자 장애 상태를 재검사하여 해당 노드의 상태 정보를 갱신하고 장애에 대응하도록 되어 있다.In the conventional high availability system, when a specific node fails, all nodes included in the high availability configuration retest the failure state to update the status information of the corresponding node and respond to the failure.

상기 고가용성 구성에 포함되는 노드가, 특정 노드의 장애를 재검사하는 방법으로는, 프라이비트 네트워크(private network)를 통한 방법과, 퍼블릭 네트워크 (public network)를 통한 방법, 공유 파일 시스템 채널(file system channel)을 통한 방법이 있다.Nodes included in the high availability configuration, the method of re-checking the failure of a specific node, a method through a private network (public network), a method through a public network, a shared file system channel (file system) channel).

여기서, 상기 공유 파일 시스템 채널(file system channel)을 통한 방법은, 노드가 수행하는 서비스의 인계 구성에 따라 사용 가능 여부가 결정된다.Here, whether or not the method through the shared file system channel can be used depends on a takeover configuration of a service performed by a node.

이때, 재검사에 의한 노드 장애 상태 검사 결과는, 해당 노드의 고가용성 데몬(daemon)의 장애, 프라이비트 네트워크(private network)의 장애, 프라이비트 네트워크(private network)와 퍼블릭 네트워크(public network)의 장애, 노드 다운 장애로 분류된다.At this time, the node failure state test result by retesting, failure of the high availability daemon of the node, failure of the private network (private network), failure of the private network and public network (public network) This is classified as a node down failure.

도1은 4개의 노드와 2개의 서비스로 구성된 고가용성 시스템의 구성도로서, 서비스 1의 최초 수행 노드는 노드1이고, 인계순서는 노드1 →노드2 →노드4이고, 서비스2의 최초 수행 노드는 노드3이며, 인계순서는 노드3 →노드4이다.1 is a schematic diagram of a high availability system consisting of four nodes and two services, wherein the first performing node of service 1 is node 1, the turnover order is node 1 → node 2 → node 4, and the first performing node of service 2 is shown. Is node 3, and the turnover order is node 3-> node 4.

상기 고가용성 시스템 수행을 위해 모든 노드에서 고가용성 데몬 (데몬1~데몬4)을 수행하면, 노드들 중 서비스의 최초 수행 노드는 서비스를 수행하게 되고, 모든 서비스가 수행된후, 모든 노드들은 각자 다른 노드들과 다른 노드에서 수행중인 서비스들의 상태를 검사한다.When the high availability daemon (Daemon 1 ~ Daemon 4) is executed in all nodes to perform the high availability system, the first performing node of the services among the nodes performs the service, and after all the services are performed, all nodes have their own Check the status of other nodes and services running on other nodes.

이때, 상기 서비스1의 최초 수행 노드가 노드1 이므로, 그 노드1은 서비스1을 수행하고, 서비스2의 최초 수행 노드가 노드3 이므로, 그 노드3은 서비스 2를 수행한 상태에서, 모든 노드들의 고가용성 데몬들은 프라이비트 네트워크(private network)를 통해 서로의 상태 정보를 주고 받으며 각자 다른 노드들의 상태를 검사한다.At this time, since the first performing node of the service 1 is node 1, the node 1 performs service 1, and the first performing node of the service 2 is node 3, so that the node 3 performs service 2 of all nodes. High availability daemons exchange state information with each other over a private network and check the status of different nodes.

만약, 특정 노드에 장애가 발생하면, 나머지 노드들은 해당 노드의 상태 정보를 수신하지 못하게 되는데, 이와같이 특정 노드의 상태 정보를 수신하지 못하게 되면 해당 노드의 장애를 감지하고 장애를 재검사한다.If a failure occurs in a specific node, the remaining nodes cannot receive the status information of the corresponding node. If the status information of the specific node is not received in this way, the failure of the corresponding node is detected and the failure is re-inspected.

상기 재검사 순서는 프라이비트 네트워크(private network)를 통한 검사 →퍼블릭 네트워크(public network)를 통한 검사→공유 파일 시스템 채널(file system channel)을 통한 검사의 순서이며, 여기서, 노드의 장애 판단은 하기와 같다.The retesting sequence is a sequence of inspection through a private network → inspection through a public network → inspection through a shared file system channel. same.

먼저, 프라이비트 네트워크(Private network)를 통해 검사하여 해당 노드에서 응답이 있으면 해당 노드의 고가용성 데몬에 장애가 발생한 경우로 판단하고,응답이 없으면, 도2의 '2'를 수행한다.First, if a response from the node is checked through a private network, it is determined that a failure occurs in the high availability daemon of the node. If there is no response, '2' of FIG. 2 is performed.

그 다음, 퍼블릭 네트워크(Public network)를 통해 검사하여 해당 노드에서 응답이 있으면 해당 노드의 프라이비트 네트워크(private network)에 장애가 발생한 경우로 판단하고, 응답이 없으면 도3의 '3'을 수행한다.Next, if a response is received from the corresponding node through the public network, it is determined that a failure occurs in the private network of the node, and if there is no response, '3' of FIG. 3 is performed.

만약, 상기 서비스 인계 순서에 따라, 공유 파일 시스템 채널(file system channel)을 통해 검사가 가능한 노드에서는, 공유 파일 시스템 채널(file system channel)을 통해 검사하여 해당 노드에서 응답이 있으면 해당 노드의 프라이비트 네트워크(private network)와 퍼블릭 네트워크(public network)에 장애가 발생한 경우로 판단하고 응답이 없으면 노드 다운 장애가 발생한 경우로 판단한다.In the node that can be checked through the shared file system channel according to the service takeover order, if there is a response from the node by checking through the shared file system channel, the node's private It is determined that a failure occurs in a private network and a public network, and when there is no response, a node down failure is determined.

한편, 서비스 인계 순서에 따라 공유 파일 시스템 채널(file system chan nel)을 통해 검사가 불가능한 노드는, 도4와 같이 노드 다운 장애가 발생한 경우로 판단한다.On the other hand, a node that cannot be inspected through the shared file system channel according to the service handover sequence is determined as a node down failure as shown in FIG.

예를 들어, 노드1의 프라이비트 네트워크(private network)와 퍼블릭 네트워크(public network)에 장애가 발생하였을 경우, 노드2와 노드4는 서비스1을 인계하는 노드에 포함되어 있고, 이에 따라 공유 파일 시스템 채널(file system channel)을 통해 검사가 가능하므로 공유 file system channel을 통해 검사하여 노드1의 프라이비트 네트워크(private network)와 퍼블릭 네트워크(public network)에 장애가 발생한 경우로 판단한다.For example, if node 1's private network and public network fail, node 2 and node 4 are included in the node taking over service 1, and thus the shared file system channel. Since inspection is possible through the (file system channel), it is determined through the shared file system channel to determine if the private network and the public network of Node 1 have failed.

이때, 노드 3은 서비스1을 인계하는 노드에 포함되어 있지 않아 공유 파일시스템 채널(file system channel)을 통해 검사가 불가능하므로 노드 다운 장애가 발생한 경우로 판단한다.At this time, since node 3 is not included in the node taking over service 1, inspection is impossible through the shared file system channel, so it is determined that the node down failure has occurred.

그러나, 상술한 고가용성 구성에 포함되는 모든 노드가 각자 장애 상태를 재검사하여 해당 노드의 상태 정보를 갱신하고 장애에 대응하도록 되어 있으므로, 동일한 장애에 대해 각각의 노드가 장애를 다르게 판단하게 되는 경우가 발생하게 되는데, 이러한 경우 고가용성 시스템의 노드 상태 정보에 일관성이 깨어져 버리게 되어 관리자가 시스템의 상태를 오인하게 되는 문제점이 있다.However, since all nodes included in the above-described high availability configuration have their respective failure states re-checked to update the status information of the corresponding nodes and respond to the failures, each node may determine the failure differently for the same failure. In this case, there is a problem in that the node state information of the high availability system is inconsistent and the administrator misidentifies the state of the system.

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로, 특정 노드에 장애가 발생하면 그 장애가 발생된 특정노드의 상태를 모든 노드에서 재검사하여 그 검사결과를 공유함으로써, 특정 노드의 장애를 정밀하게 검지하여 처리하도록 한 고가용성 시스템의 장애 처리방법을 제공함에 그 목적이 있다.The present invention has been made to solve the above problems, and when a failure occurs in a specific node, by re-checking the state of the specific node where the failure occurs in all nodes and sharing the test result, the failure of the specific node is accurately detected. The purpose of the present invention is to provide a failure handling method of a high availability system.

도1은 4개의 노드와 2개의 서비스로 구성된 고가용성 시스템의 구성도.1 is a block diagram of a high availability system consisting of four nodes and two services.

도2는 도1에 있어서, 노드1의 데몬 장애 발생시 프라이비트 네트워크를 통한 검사를 보인도.FIG. 2 is a diagram showing the inspection via a private network when a daemon failure of Node 1 occurs.

도3은 도1에 있어서,노드1의 프라이비트 네트워크 장애 발생시 퍼블릭 네트워크를 통한 검사를 보인도.3 is a diagram illustrating the inspection over the public network when a private network failure of node 1 occurs.

도4는 도1에 있어서, 노드1의 프라이비트 네트워크와 퍼블릭 네트워크 장애 발생시 공유 파일 시스템 채널을 통한 검사를 보인도.Figure 4 shows the inspection over a shared file system channel in the event of a private network and public network failure of node 1 in Figure 1;

도5는 본 발명 고가용성 시스템의 장애 처리방법에 대한 동작흐름도.5 is a flowchart illustrating a failure handling method of the high availability system of the present invention.

도6은 도5에 있어서, 노드1의 프라이비트 네트워크와 퍼블릭 네트워크 장애 발생시 각 노드의 개별적 검사 동작을 보인 개략도.FIG. 6 is a schematic diagram showing an individual inspection operation of each node when a private network and a public network failure of node 1 occur in FIG. 5; FIG.

도7은 도5에 있어서, 각 노드의 재검사 결과를 송신하는 동작을 보인도.FIG. 7 shows an operation of transmitting a retest result of each node in FIG. 5; FIG.

상기와 같은 목적을 달성하기 위한 본 발명은, 특정 노드의 장애가 감지되면, 모든 노드가 각각 장애 상태를 재검사하는 제1 과정과; 모든 노드가 각각의 상태정보를 프라이비트 네트워크를 통해 다른 모든 노드로 전송하는 제2 과정과; 각 노드가 다른 노드들의 상태정보를 수집한후, 그 상태정보를 종합하여 특정노드의 장애상태를 확인하는 제3 과정으로 수행함을 특징으로 한다.The present invention for achieving the above object, the first process of re-checking the failure state of each node, if a failure of a particular node is detected; A second process in which all nodes transmit respective state information to all other nodes through a private network; After each node collects state information of other nodes, the node performs a third process of checking the failure state of a specific node by synthesizing the state information.

이하, 본 발명에 의한 고가용성 시스템의 장애 처리방법에 대한 작용 및 효과를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, with reference to the accompanying drawings, the operation and effects of the failure handling method of the high availability system according to the present invention will be described in detail.

도5는 본 발명 고가용성 시스템의 장애 처리방법에 대한 동작을 보인 흐름도로서, 이에 도시한 바와같이 특정 노드에 장애가 발생하였는지를 판단하는 제1 과정과; 상기 판단결과, 특정노드에 장애가 발생하면, 모든 노드가 각각 장애상태를 재검사하는 제2 과정과; 모든 노드가 각각의 상태 정보를 프라이비트 네트워크를 통해 다른 모든 노드들에게 전송하는 제3 과정과; 다른 노드들의 상태정보를 수집한후, 그 상태정보를 종합하여 특정노드의 장애 상태를 확인하는 제4 과정으로 이루어지며,이와같은 본 발명의 동작을 설명한다.FIG. 5 is a flowchart illustrating an operation of a failure handling method of a high availability system of the present invention, as shown in the first step of determining whether a failure occurs in a specific node; As a result of the determination, when a failure occurs in a specific node, a second process of rechecking a failure state of all nodes respectively; A third step of all nodes transmitting respective state information to all other nodes through a private network; After collecting the state information of the other nodes, the fourth step of identifying the failure state of a specific node by combining the state information, this operation of the present invention will be described.

먼저, 본 발명에 따른 고가용성 시스템의 장애 처리방법에 대한 동작을, 종래에 언급한 예시인 노드1의 프라이비트 네트워크(private network)와 퍼블릭 네트워크(public network)에 장애가 발생하였을 경우의 동작으로 설명한다.First, an operation of a failure handling method of a high availability system according to the present invention will be described as an operation when a failure occurs in a private network and a public network of a node 1, which is a conventional example. do.

먼저, 특정 노드에 장애가 발생하면, 예를 들어 고가용성 시스템이 수행된 후 노드 1의 프라이비트 네트워크(private network)와 퍼블릭 네트워크(public network)에 장애가 발생하게 되면, 도6과 같이 모든 노드들은 각자 노드1의 상태를 판단하게 된다.First, when a specific node fails, for example, after a high availability system is performed, when a private network and a public network of node 1 fail, all nodes each have their own nodes as shown in FIG. The state of node 1 is determined.

여기서, 도5에서, 노드 2와 노드 4의 경우 프라이비트 네트워크(private network)를 통해 검사한 결과 노드 1에서 응답이 없으므로(1), 퍼블릭 네트워크( public network)를 통해 검사를 수행하고, 검사 결과 노드 1에서 응답이 없으므로(2), 공유 파일 시스템 채널(file system channel)을 통해 검사를 수행하고 검사 결과 노드 1에서 응답이 있으므로(3), 노드 1의 상태를 프라이비트 네트워크(private network)와 퍼블릭 네트워크(public network)에 장애가 발생한 경우로판단한다.Here, in FIG. 5, since the node 2 and the node 4 are inspected through the private network, there is no response from the node 1 (1), and the inspection is performed through the public network. Since there is no response at node 1 (2), the check is performed through a shared file system channel, and as a result of the test, there is a response at node 1 (3), so node 1's status is not associated with the private network. Determine if the public network has failed.

그리고, 노드 3의 경우, 프라이비트 네트워크(private network)을 통해 검사한 결과 노드 1에서 응답이 없으므로(4), 퍼블릭 네트워크(public network)를 통해 검사를 수행하고, 상기 검사 결과 노드 1에서 응답이 없고 공유 파일 시스템 채널(file system channel)을 통해 검사를 수행하는 것이 불가능하므로(5),노드 다운으로 판단한다.In the case of Node 3, since there is no response from Node 1 as a result of the test through the private network (4), the test is performed through the public network, and the response is returned from Node 1 as the test result. And it is impossible to perform a check over a shared file system channel (5), so it is determined to be a node down.

그 다음, 도7과 같이 모든 노드가 각자 노드 1의 장애 상태에 대하여 판단한 결과를 다른 노드에게 전송하는데, 노드 2와 노드 4는 프라이비트 네트워크 (private network)와 퍼블릭 네트워크(public network)에 장애가 발생하였다고 판단한 결과를, 노드 3은 노드 다운으로 판단한 결과를 각자 전송한다.Then, as shown in FIG. 7, all nodes transmit a result of determining the failure state of Node 1 to another node, and Node 2 and Node 4 fail in the private network and the public network. The node 3 transmits the result determined by the node down.

그 다음, 다른 노드들이 전송한 결과를 수신한 결과를 각 노드에서 수집한후, 이 정보를 종합하여 판단하면 모든 노드에서 동일하게 노드1의 상태를 프라이비트 네트워크(private network)와 퍼블릭 네트워크(public network)에 장애가 발생하였다고 판단하게 되는 것이다.Then, after each node collects the results received by other nodes and collects the information, the node 1 status is equally determined by all nodes in the private network and the public network. network failure).

여기서, 상기 정보를 종합하여 판단하는 기준은, 프라이비트 네트워크와 퍼블릭 네트워크 장애가 발생한 경우를, 노드 다운 장애가 발생한 경우 보다 우선 순위를 높게 부여하여, 특정 노드의 장애 상태를 확인한다.In this case, the criterion for determining the aggregated information may be to determine a failure state of a specific node by assigning a priority to a case where a private network and a public network failure occur, rather than a case where a node down failure occurs.

상기 본 발명의 상세한 설명에서 행해진 구체적인 실시 양태 또는 실시예는 어디까지나 본 발명의 기술 내용을 명확하게 하기 위한 것으로 이러한 구체적 실시예에 한정해서 협의로 해석해서는 안되며, 본 발명의 정신과 다음에 기재된 특허청구의 범위내에서 여러가지 변경 실시가 가능한 것이다.The specific embodiments or examples made in the detailed description of the present invention are intended to clarify the technical contents of the present invention to the extent that they should not be construed as limited to these specific examples and should not be construed as consultations. Various changes can be made within the scope of.

이상에서 상세히 설명한 바와같이 본 발명은, 고가용성 시스템 운영 중 특정 노드의 장애 발생시, 그 장애가 발생된 특정노드의 상태를 모든 노드에서 재검사하여 그 검사결과를 공유함으로써, 고가용성 상태정보의 일관성을 유지하는 효과가 있다.As described in detail above, the present invention maintains consistency of high availability state information by re-checking the state of a specific node in which the failure occurs and sharing the test result when all nodes fail when the high availability system operates. It is effective.

Claims

If a failure of a specific node is detected, a first step of re-checking each failure state of all nodes;

A second process in which all nodes transmit respective state information to all other nodes through a private network;

After each node collects the status information of the other nodes, the third step of identifying the failure status of a specific node by combining the status information, the failure handling method of a high availability system.

The method of claim 1, wherein the third process comprises:

When the private network and the public network failure occurs, giving a higher priority than the node down failure occurs, and confirming the failure status of a specific node comprising the step of handling failure of a high availability system.