KR100693663B1

KR100693663B1 - System and Method for detecting obstacle of node

Info

Publication number: KR100693663B1
Application number: KR1020030099585A
Authority: KR
Inventors: 윤형수
Original assignee: 엘지엔시스(주)
Priority date: 2003-12-30
Filing date: 2003-12-30
Publication date: 2007-03-14
Also published as: KR20050068326A

Abstract

본 발명은 허트비트 채널로 구성된 시스템에서 노드에 대한 자원 동작 정보가 수신되면, 상기 자원 동작 정보를 분석하여 동작이 정지된 자원이 존재하는지의 여부를 판단하고, 상기 판단결과 동작이 정지된 자원이 존재하면, 해당 노드에 장애가 발생한 것으로 판단하며, 오류가 발생한 노드의 장애 발생 정보를 다른 노드에게 송신하고 상기 장애가 발생한 노드가 수행하던 서비스를 수행하는 것으로서, 시스템에 대한 장애 감지이므로 고가용성이나 클러스터링 시스템에도 사용하여 노드 장애의 감지/극복을 빨리 이루어 지속적인 서비스를 제공할 수 있다. According to the present invention, when resource operation information for a node is received in a system configured with a heartbeat channel, the resource operation information is analyzed to determine whether there is a resource whose operation is stopped, and as a result of the determination, the resource whose operation is stopped is determined. If it exists, it is determined that the node has failed, and the failure information of the failed node is transmitted to another node and the service performed by the failed node is performed. It can also be used to quickly detect and overcome node failures to provide continuous service.

노드 장애, 허트비트Node failure, heartbeat

Description

System and Method for detecting obstacle of node

도 1은 종래의 정보 통신 시스템의 구조를 나타낸 도면. 1 is a diagram showing the structure of a conventional information communication system.

도 2는 종래의 노드의 장애 판단 방법을 나타낸 흐름도. 2 is a flowchart illustrating a failure determination method of a conventional node.

도 3은 본 발명의 바람직한 일 실시예에 따른 노드 장애를 감시하기 위한 정보통신 시스템의 구성을 개략적으로 나타낸 도면. 3 is a diagram schematically showing the configuration of an information communication system for monitoring a node failure according to an embodiment of the present invention.

도 4는 본 발명의 바람직한 일 실시예에 따른 노드 자원에 대한 환경 설정 정보 변경 방법에 대한 흐름도. 4 is a flowchart illustrating a method for changing configuration information on node resources according to an embodiment of the present invention.

도 5는 본 발명의 바람직한 일 실시예에 따른 노드의 장애 판단 방법을 나타낸 흐름도. 5 is a flowchart illustrating a failure determination method of a node according to an exemplary embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

300 : 노드 310 : 관리 모듈300: node 310: management module

320 : 공유 디스크320: shared disk

본 발명은 허트비트 채널로 구성된 시스템 환경에서, 노드의 제어 보드를 이용하여 상대방 노드가 장애인지를 검사하여 노드들간의 빠른 장애 감지/극복을 할 수 있는 노드 장애 감지 방법 및 시스템에 관한 것이다. The present invention relates to a node failure detection method and system that can quickly detect and overcome failures between nodes by inspecting whether a counterpart node is disabled using a control board of a node in a system environment configured with a heartbeat channel.

일반적으로, 정보통신 시스템은, 2대 이상의 다중 프로세스 시스템, 공유디스크, 허트비트 네트워크와 정보통신 소프트웨어로 구성되어지며, 정보통신 소프트웨어는 시스템의 상태를 분석하여 다른 노드에 그 정보를 전달하고, 노드에 장애가 발생하면, 그 노드에서 수행하던 서비스를 다른 노드로 인계하는 역할을 수행하는 중요한 구성요소이다.In general, an information communication system is composed of two or more multi-process systems, a shared disk, a heartbeat network and information communication software, and the information communication software analyzes the state of the system and transmits the information to another node. In the event of a failure, it is an important component that takes over the service that the node was performing to another node.

그리고, 상기 정보통신 시스템은 서비스를 제공하는 어떤 시스템에서 하드웨어나 소프트웨어 장애가 발생하여 서비스를 제공할 수 없을 때, 그 장애를 인지하고 장애가 발생한 시스템의 서비스를 인계하여 서비스의 연속성을 보장하는 시스템이다.In addition, the information communication system is a system that recognizes the failure and takes over the service of the failing system and guarantees the continuity of the service when a hardware or software failure occurs in a system providing the service.

최근, 정보기술이 발전하면서 하드웨어의 성능과 안정성은 급격히 향상되고 있지만, 소프트웨어의 복잡도가 상대적으로 증가함으로써 소프트웨어는 많은 장애의 원인을 제공하고 있다. In recent years, as information technology advances, the performance and stability of hardware have been rapidly improved, but the software has provided a lot of obstacles due to the relatively increased complexity of the software.

따라서 정보통신 시스템에서 소프트웨어 장애를 감시하고 극복하는 기능은 점점 그 중요성이 커지고 있다고 할 수 있다. Therefore, the ability to monitor and overcome software failures in information and communication systems is becoming increasingly important.

도 1은 종래의 정보 통신 시스템의 구조를 나타낸 도면이다. 1 is a view showing the structure of a conventional information communication system.

도 1을 참조하면, 정보통신 시스템은 다수의 노드(100a, 100b, ...100n, 아하 100이라 칭함), 상기 노드(100)를 서로 감시하고 정보를 교환하기 위한 허트비트(heartbeat) 네트워크와, 데이터를 공유하기 위한 공유디스크(120)로 구성된다. 1, an information communication system includes a plurality of nodes (100a, 100b, ... 100n, hereinafter 100), a heartbeat network for monitoring and exchanging information with each other. , A shared disk 120 for sharing data.

상기와 같이 구성된 정보 통신 시스템에서 장애를 감지하여 처리하는 방법에 대하여 설명하기로 한다. A method of detecting and processing a failure in the information communication system configured as described above will be described.

정보 통신 시스템은 노드(100)들간에 연결된 허트비트 채널을 통하여 서로 정보를 송수신함으로써 장애를 인지한다.The information communication system recognizes a failure by transmitting and receiving information with each other through a heartbeat channel connected between the nodes 100.

상가 허트비트 채널을 통하여 노드(100)의 장애를 인지하는 방법은 다음과 같은 방법이 있다. The method for recognizing the failure of the node 100 through the commercial heartbeat channel is as follows.

먼저, 네트워크를 이용하는 방법이 있다. First, there is a method using a network.

상기 방법은 노드(100)들간에 연결되어 있는 네트워크를 통하여 상대 노드로부터 오는 패킷으로 상대 노드가 살았는지 또는 상대 노드에서 서비스가 동작중인지를 검사하는 방법이다.The above method is a method for checking whether a counterpart node is alive or a service is operating in the counterpart node by packets coming from the counterpart node through a network connected between the nodes 100.

제1 노드에서 상대 노드로 보내야되는 정보를 노드들간에 약속한 포맷에 맞춰 송신하면, 다른 노드에서는 이 패킷을 수신하여 상대 노드에서 수행중인 서비스가 정상적으로 동작하고 있는지를 검사한다. 상기 검사 결과 상기 상대 노드에서 수행중인 서비스가 정상적으로 동작하지 않으면 장애가 발생한 것으로 인지하여 상대 노드에서 수행중인 서비스를 자신의 노드에서 수행한다. When information transmitted from the first node to the counterpart node is transmitted in accordance with the format promised between the nodes, the other node receives the packet and checks whether a service running in the counterpart node is operating normally. As a result of the check, if the service being executed in the counterpart node does not operate normally, it recognizes that a failure has occurred and performs the service being performed in the counterpart node in its own node.

다음으로 공유 디스크(120)를 이용하는 방법이 있다. Next, there is a method using the shared disk 120.

공유 디스크(120)를 이용하는 방법은 노드(100)들이 공통으로 액세스하여 사 용하는 장치를 이용하여 상대 노드의 장애를 검사하는 방법이다. 공유 디스크(120)는 네트워크를통한 통신을 할수 없을때 상대 노드의 장애를 인지하는데 사용된다.The method of using the shared disk 120 is a method of checking the failure of the counterpart node using a device that the nodes 100 access and use in common. The shared disk 120 is used to recognize a failure of the counterpart node when communication over the network is not possible.

네트워크를 통하여 상대 노드의 정보를 수집할 수 없을때 상대 노드가 공유 디스크(120)에 지속적으로 접촉을 하고 있으면 정상으로 판단하고 일정 시간동안이 지났는데도 접촉을 하지 않으면 장애로 인지한다. 공유 디스크(120)에 사용되는 장치의 예를 든다면, 디스크 에레이, 테이프 라이브러리와 같은 저장장치를 말한다. When the partner node is constantly in contact with the shared disk 120 when it is unable to collect the information of the partner node through the network, it is determined to be normal. An example of a device used for the shared disk 120 is a storage device such as a disk array or a tape library.

상기와 같은 방법에 의하여 행해진 노드의 장애 판단 방법에 대하여 도 2를 참조하여 정리하기로 한다. A failure determination method of a node performed by the above method will be described with reference to FIG. 2.

도 2는 종래의 노드의 장애 판단 방법을 나타낸 흐름도이다. 2 is a flowchart illustrating a failure determination method of a conventional node.

도 2를 참조하면, 제1 노드가 행(hang) 상태로 장애가 발생하면(S200), 정상 노드는 상기 제1 노드로부터 패킷을 수신하지 못한다(S202). 그러면, 상기 정상 노드는 상기 장애가 발생한 장애 노드가 공유 디스크를 사용하지 않는다는 것을 인지한다.Referring to FIG. 2, when a failure occurs in a hang state of the first node (S200), the normal node does not receive a packet from the first node (S202). The healthy node then recognizes that the failed node does not use a shared disk.

단계 202가 수행되면, 상기 정상 노드는 상기 제1 노드에게 노드 자원 동작 정보 요청 명령을 전송한다(S204). 여기서, 상기 노드 자원은 CPU, 메모리, 디스크등을 말할 수 있다. When step 202 is performed, the normal node transmits a node resource operation information request command to the first node (S204). Here, the node resource may refer to a CPU, a memory, a disk, or the like.

그런다음 상기 정상 노드는 상기 제1 노드로부터 상기 노드 자원 동작 정보 요청 명령에 상응한 동작 응답 정보가 수신되는지의 여부를 판단한다(S206).Then, the normal node determines whether operation response information corresponding to the node resource operation information request command is received from the first node (S206).

단계 206의 판단결과 상기 제1 노드로부터 동작 응답 정보가 수신되면, 상기 정상 노드는 상기 제1 노드를 정상으로 판단하여 상기 제1 노드가 수행하던 서비스 를 클라이언트에게 제공하지 못한다(S208). 즉, 상기 정상 노드는 상기 제1 노드로부터 동작 응답 정보가 수신되면, 상기 제1 노드를 정상으로 판단하여 상기 제1 노드가 수행하던 서비스를 수행하지 않는다. 그러므로, 상기 제1 노드가 수행하던 서비스는 클라이언트에게 제공되지 못한다. If the operation response information is received from the first node as a result of the determination in step 206, the normal node determines that the first node is normal and does not provide the client with the service performed by the first node (S208). That is, when the operation response information is received from the first node, the normal node determines that the first node is normal and does not perform the service performed by the first node. Therefore, the service performed by the first node cannot be provided to the client.

만약, 단계 206의 판단결과 상기 제1 노드로부터 동작 응답 정보가 수신되지 않으면, 상기 정상 노드는 상기 제1 노드에 장애가 발생한것으로 판단하여 상기 제1 노드가 수행하던 서비스를 다른 정상 노드가 수행하게 한다(S210).If the operation response information is not received from the first node as a result of the determination of step 206, the normal node determines that the failure has occurred in the first node and causes another normal node to perform the service performed by the first node. (S210).

그러나 상기와 같은 종래에는 운영 시스템이 행(hang)상태인 노드는 다른 노드에게는 정상 노드로 인지되는 문제점이 있다.However, in the related art, a node in which an operating system is in a hang state is recognized as a normal node to other nodes.

또한, 핑 명령어는 OSI계층중 2계층만 사용하기 때문에 운영 시스템이 행 상태에 영향을 받지 않으므로 다른 노드들은 이 노드를 동작 상태로 인지하여 장애 노드의 서비스를 수행하지 않는 문제점이 있다.In addition, since the ping command uses only two layers of the OSI layer, since the operating system is not affected by the row state, other nodes recognize the node as an operating state and thus do not serve the failed node.

따라서, 본 발명의 목적은 허트비트 채널로 구성된 시스템 환경에서, 노드의 제어 보드를 이용하여 상대방 노드가 장애인지를 검사하여 노드들간의 빠른 장애 감지/극복을 할 수 있는 노드 장애 감지 방법 및 시스템을 제공하는데 있다.
Accordingly, an object of the present invention is to provide a node failure detection method and system that can quickly detect / overcome the nodes by checking whether the other node is disabled using the control board of the node in the system environment consisting of the heartbeat channel. It is.

상기 목적들을 달성하기 위하여 본 발명의 일 측면에 따르면, 허트비트 채널로 구성된 시스템의 노드 장애를 감지하는 방법에 있어서, 노드에 대한 자원 동작 정보가 시스템 내의 각 노드의 자원동작정보를 관리하는 관리모듈에 의해 수신되면, 상기 관리모듈은 상기 자원 동작 정보를 분석하여 동작이 정지된 자원이 존재하는지의 여부를 판단하고, 상기 판단결과 동작이 정지된 자원이 존재하면, 해당 노드에 장애가 발생한 것으로 판단하며, 상기 관리모듈이 오류가 발생한 노드의 장애 발생 정보를 다른 노드에게 송신하고 상기 다른 노드가 상기 장애가 발생한 노드가 수행하던 서비스를 수행하는 것을 특징으로 하는 노드 장애 감지 방법을 제공할 수 있다.According to an aspect of the present invention to achieve the above object, in the method for detecting a node failure of a system configured with a heartbeat channel, the resource management information for the node management module for managing the resource operation information of each node in the system When received by the management module, the management module analyzes the resource operation information to determine whether there is a resource whose operation has stopped, and if the resource whose operation has been stopped exists, determines that a failure occurs in the corresponding node. The method may further include providing a node failure detection method, wherein the management module transmits failure occurrence information of a failed node to another node and the other node performs a service performed by the failed node.

상기 장애 발생 정보는 장애가 발생한 노드 고유번호, 상기 장애가 발생한 노드에서 수행하는 서비스 종류를 포함한다. The failure occurrence information includes a unique node number of a failed node and a type of service performed by the failed node.

상기 서비스 수행 명령은 해당 서비스 등록 명령을 포함하여 상기 서비스 수행 명령을 수신한 다른 노드는 상기 서비스 수행 명령내의 서비스 등록 명령에 상응하여 해당 서비스를 등록한 후, 실행한다. The service execution command includes a corresponding service registration command and then executes another node that receives the service execution command after registering the corresponding service according to the service registration command in the service execution command.

본 발명의 다른 측면에 따르면, 허트비트 채널로 구성된 시스템에 있어서, 제어 보드가 내장되어 있는 노드; 상기 제어보드로부터 수신되는 정보를 통하여 노드의 상태를 감시하고 장애 복구 명령을 수행하도록 시스템 내의 각각의 노드와 연결되는 관리모듈을 포함하는 것을 특징으로 하는 노드 장애 감지 시스템이 제공된다.According to another aspect of the present invention, a system configured with a heartbeat channel, comprising: a node having a control board; A node failure detection system is provided that includes a management module connected to each node in the system to monitor the state of the node and perform a failure recovery command through the information received from the control board.

상기 제어보드는 노드의 자원 동작 정보를 수집하는 정보 모듈과 상기 수집된 자원 동작 정보를 상기 관리 모듈에 전송하는 통신 모듈로 구성된다. The control board includes an information module for collecting resource operation information of a node and a communication module for transmitting the collected resource operation information to the management module.

상기 관리 모듈은 노드내에 또는 사용자 단말기에 설치되어 있는 것으로서, 상기 제어보드로부터 전송된 자원 동작 정보를 분석하여 해당 노드의 장애 여부를 판단한다. The management module is installed in the node or the user terminal, and analyzes resource operation information transmitted from the control board to determine whether the corresponding node has failed.

상기 관리 모듈은 장애가 발생한 노드가 감지되면, 다른 노드에 장애 발생 정보를 전송하여 다른 노드에서 상기 장애 발생 노드가 수행하던 서비스를 수행하게 한다. When the failure node is detected, the management module transmits failure occurrence information to another node to perform a service performed by the failed node in another node.

이하 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 3은 본 발명의 바람직한 일 실시예에 따른 노드 장애를 감시하기 위한 정보통신 시스템의 구성을 개략적으로 나타낸 도면이다. 3 is a diagram schematically illustrating a configuration of an information communication system for monitoring a node failure according to an embodiment of the present invention.

도 3을 참조하면, 정보통신 시스템은 다수의 노드(300a, 300b,..., 300n, 이하 300이라 칭함), 상기 노드(300)를 서로 감시하고 정보를 교환하기 위한 허트비트 네트워크, 데이터를 공유하기 위한 공유 디스크(320), 상기 노드(300)의 상태를 감시하고 장애복구를 수행하는 관리 모듈(310)을 포함한다. Referring to FIG. 3, an information communication system monitors a plurality of nodes 300a, 300b,..., 300n, and 300, a heartbeat network for monitoring the nodes 300 and exchanging information with each other. Shared disk 320 for sharing, the management module 310 for monitoring the status of the node 300 and performing a failover.

각 노드(300)는 허트비트 채널을 통하여 연결되어 있고, 각 노드(300)에는 제어보드가 내장되어 있다. Each node 300 is connected through a heartbeat channel, and each node 300 has a built-in control board.

상기 제어 보드는 노드의 자원인 CPU, 메모리, 하드 디스크의 동작 정보를 수집하는 정보 모듈과 상기 관리 모듈(310)과 통신하는 통신 모듈로 구성된다. 상기 정보 모듈은 상기 관리 모듈(310)에서 사용자에 의해 미리 설정된 환경 정보에 상응한 상기 자원의 동작 정보를 수집하여 상기 통신 모듈을 통하여 상기 관리 모듈(310)에 전송한다. The control board includes an information module for collecting operation information of a CPU, a memory, and a hard disk, which are resources of a node, and a communication module for communicating with the management module 310. The information module collects operation information of the resource corresponding to the environment information preset by the user in the management module 310 and transmits the operation information to the management module 310 through the communication module.

상기 관리 모듈(310)은 상기 제어 보드로부터 전송된 자원 동작 정보를 분석하여 해당 노드의 장애 여부를 판단하여 장애가 발생된 것으로 판단된 노드에 대해서는 다른 노드에 알려주어 다른 노드가 상기 장애가 발생한 노드의 서비스를 수행하게한다.The management module 310 analyzes the resource operation information transmitted from the control board, determines whether the corresponding node has failed, and informs another node about a node determined to have failed, so that another node provides services of the failed node. Let's do it.

상기 관리 모듈(310)은 노드 또는 사용자 단말기에 설치되어 상기 제어 보드와 연결된다.The management module 310 is installed in a node or a user terminal and connected to the control board.

또한, 상기 관리 모듈(310)에는 감시해야할 자원 노드를 선택하게 하는 자원 선택 메뉴, 장애 노드에서 수행중인 서비스를 다른 노드에서 수행하기 위한 서비스 등록 메뉴로 구성된 관리 프로그램이 있다. In addition, the management module 310 includes a management program including a resource selection menu for selecting a resource node to be monitored, and a service registration menu for performing a service being performed at a failed node in another node.

따라서, 사용자는 관리모듈(310)에서 자신의 원하는 자원 노드를 선택하여 노드 자원에 대한 환경 설정 정보를 변경할 수 있다.Accordingly, the user may change his or her preference information on the node resource by selecting the desired resource node in the management module 310.

도 4는 본 발명의 바람직한 일 실시예에 따른 노드 자원에 대한 환경 설정 정보 변경 방법에 대한 흐름도이다. 4 is a flowchart illustrating a method of changing environment configuration information for node resources according to an embodiment of the present invention.

도 4를 참조하면, 관리 모듈은 관리 프로그램을 실행하여 제어 보드와 접속한다(S400).Referring to FIG. 4, the management module executes a management program to connect with a control board (S400).

그런다음 상기 관리 모듈은 사용자 인증을 한후(S402), 상기 제어 보드에 각 노드의 자원 동작 정보 요청 명령을 전송한다(S404). 상기 사용자 인증은 사용자로부터 사용자 고유번호, 비밀번호등을 수신하여 상기 수신된 정보가 미리 등록되어 있는지의 여부를 판단하여 사용자 인증을 수행한다. Then, the management module authenticates the user (S402), and transmits a resource operation information request command of each node to the control board (S404). The user authentication receives a user's unique number, password, etc. from the user to determine whether the received information is registered in advance to perform user authentication.

그러면, 상기 제어 보드는 상기 자원 동작 정보 요청 명령에 상응하여 자원 동작 정보를 전송하고, 상기 관리 모듈은 상기 제어보드로부터 전송된 자원 동작 정보를 수신한다(S406). 상기 자원 동작 정보에는 노드에 포함된 모든 자원의 동작 정보를 포함한다. Then, the control board transmits the resource operation information corresponding to the resource operation information request command, and the management module receives the resource operation information transmitted from the control board (S406). The resource operation information includes operation information of all resources included in the node.

그런다음 상기 관리 모듈은 상기 수신된 노드 자원 동작 정보를 디스플레이하고(S408), 상기 사용자에게 노드 자원에 대한 환경 설정 정보의 변경을 원하는지의 여부를 질의한다(S410). 즉, 상기 관리 모듈은 상기 사용자에게 모든 자원에 대한 동작 정보를 원하는지, 원하는 몇몇 자원에 대한 동작 정보를 원하는지를 질의하는 것이다. The management module then displays the received node resource operation information (S408), and inquires whether the user wants to change the configuration information on the node resource (S410). That is, the management module inquires whether the user wants the operation information for all resources or the desired operation information for some resources.

단계 410의 질의 결과 상기 사용자가 상기 노드 자원에 대한 환경 설정 정보의 변경을 원하여 자원의 선택 및 해제를 수행하면, 상기 관리 모듈은 상기 사용자에 의해 수행된 자원의 선택 및 해제에 의하여 노드 자원에 대한 환경 설정 정보를 변경한다(S412).As a result of the query in step 410, when the user selects and releases a resource in order to change environment setting information for the node resource, the management module selects and releases a resource to the node resource by selecting and releasing a resource performed by the user. The environment setting information is changed (S412).

그러면, 상기 제어 보드는 상기 변경된 환경 설정 정보에 상응하는 자원 동작 정보만을 상기 관리 모듈에 전송한다.Then, the control board transmits only resource operation information corresponding to the changed environment setting information to the management module.

도 5는 본 발명의 바람직한 일 실시예에 따른 노드의 장애 판단 방법을 나타낸 흐름도이다. 5 is a flowchart illustrating a failure determination method of a node according to an exemplary embodiment of the present invention.

도 5를 참조하면, 관리 프로그램을 실행하여 제어 보드에 접속하면(S500), 관리 모듈은 미리 설정된 환경 설정 정보에 상응한 노드 자원 동작 정보를 상기 제어 보드로부터 수신한다(S502). 즉, 상기 관리 모듈은 제어 보드에 접속한 후, 상기 제어보드에 노드 자원 동작 정보 요청 명령을 전송한다. 그러면, 상기 제어보드 는 상기 노드 자원 동작 정보 요청 명령에 상응한 자원 동작 정보를 수집하여 상기 관리 모듈에 전송한다. Referring to FIG. 5, when a management program is executed and connected to a control board (S500), the management module receives node resource operation information corresponding to preset configuration information from the control board (S502). That is, the management module transmits a node resource operation information request command to the control board after accessing the control board. Then, the control board collects resource operation information corresponding to the node resource operation information request command and transmits it to the management module.

단계 502의 수행후, 상기 관리 모듈은 상기 수신된 노드 자원 동작 정보를 분석하여 동작이 정지된 자원이 존재하는지의 여부를 판단한다(S504).After performing step 502, the management module analyzes the received node resource operation information to determine whether there is a resource whose operation is stopped (S504).

단계 504의 판단결과 동작이 정지된 자원이 존재하면, 상기 관리 모듈은 해당 노드에 장애가 발생한 것으로 판단하여 다른 노드에게 장애 발생 정보를 전송한다(S506). 여기서, 상기 장애 발생 정보는 장애가 발생한 노드, 상기 장애가 발생한 노드가 수행하던 서비스 종류 등을 포함한다. If there is a resource whose operation is stopped as a result of the determination in step 504, the management module determines that a failure has occurred in the corresponding node and transmits failure occurrence information to another node (S506). Here, the failure occurrence information includes a failure node, a type of service performed by the failure node, and the like.

단계 506의 수행후, 상기 관리 모듈은 다른 노드에게 상기 장애 발생 노드가 수행하던 서비스의 수행 명령을 전송한다(S508). 이때, 상기 관리 모듈은 상기 장애 발생 노드에서 수행하던 서비스를 다른 노드에서 수행하게 하기 위하여 해당 서비스를 등록한다. 즉, 상기 서비스 수행 명령에는 해당 서비스 등록 명령을 포함할 수 있다. After performing step 506, the management module transmits a command to perform a service performed by the failed node to another node (S508). At this time, the management module registers the corresponding service in order to allow another node to perform the service performed by the failed node. That is, the service execution command may include a corresponding service registration command.

상기 서비스 수행 명령을 수신한 노드는 해당 서비스를 등록하여 수행한다. The node receiving the service execution command registers and performs the corresponding service.

만약, 단계 504의 판단결과 동작이 정지된 자원이 존재하지 않으면, 상기 관리 모듈은 해당 노드를 정상으로 판단한다(S510).If there is no resource whose operation is stopped as a result of the determination of step 504, the management module determines that the node is normal (S510).

본 발명은 상기 실시예에 한정되지 않으며, 많은 변형이 본 발명의 사상 내에서 당 분야에서 통상의 지식을 가진 자에 의하여 가능함은 물론이다.The present invention is not limited to the above embodiments, and many variations are possible by those skilled in the art within the spirit of the present invention.

상술한 바와 같이 본 발명에 따르면, 시스템에 대한 장애 감지이므로 고가용성이나 클러스터링 시스템에도 사용하여 노드 장애의 감지/극복을 빨리 이루어 지속적인 서비스를 제공할 수 있는 노드 장애 감시 방법 및 시스템을 제공할 수 있다. As described above, according to the present invention, it is possible to provide a node failure monitoring method and system that can provide continuous service by quickly detecting / breaking a node failure by using it in a high availability or clustering system because it is a failure detection of a system. .

또한, 본 발명에 따르면, 시스템 자원이 부족한 경우에도 자원을 늘려 자원 가용성을 높일 수 있고, 자원을 하드웨어 신호로 검사하므로 구현에 있어서도 비용이 높지 않는 노드 장애 감시 방법 및 시스템을 제공할 수 있다. In addition, according to the present invention, even when system resources are scarce, resource availability can be increased by increasing resources, and since the resources are checked by hardware signals, it is possible to provide a node failure monitoring method and system which is not expensive in implementation.

또한, 본 발명에 따르면, 운영 시스템은 정상적이나 특정 서비스가 비정상적으로 동작을 하는 경우에도 장애 발생 여부를 알수 있으므로 빠른 조치를 취하여 지속적인 서비스를 제공할 수 있고, 사용자의 전산 업무 효율성을 높일 수 있는 노드 장애 방법 및 시스템을 제공할 수 있다.

In addition, according to the present invention, since the operating system can know whether a failure occurs even if the normal or a specific service operates abnormally, it can take a quick action to provide a continuous service, the node that can increase the computational efficiency of the user Failure methods and systems can be provided.

Claims

In the method for detecting a node failure of a system configured with a heartbeat channel,

When the resource operation information for a node is received by the management module managing resource operation information of each node in the system, the management module analyzes the resource operation information to determine whether there is a resource whose operation is stopped. ;

Determining that a failure occurs in a corresponding node when there is a resource whose operation is stopped as a result of the determination;

And transmitting, by the management module, the failure occurrence information of the failed node to another node, and performing the service performed by the failed node.

The method of claim 1,

The failure occurrence information includes a node identification number of a failed node and a type of service performed by the failed node.

The method of claim 1,

The service failure command includes a corresponding service registration command.

The method according to claim 1 or 3,

And receiving the service execution command from another node in response to a service registration command in the service execution command and executing the corresponding service.

In a system configured with a heartbeat channel,

A node with a built-in control board;

And a management module connected to each node in the system to monitor the state of the node through the information received from the control board and to perform a disaster recovery command.

The method of claim 5,

And the control board comprises an information module for collecting resource operation information of a node and a communication module for transmitting the collected resource operation information to the management module.

The method of claim 5,

The management module node failure detection system, characterized in that installed in the node or the user terminal.

The method according to claim 5 or 7,

The management module node failure detection system, characterized in that for determining the failure of the node by analyzing the resource operation information transmitted from the control board.

The method of claim 8,

The management module, when detecting a node in which a failure occurs, transmits failure information to another node to perform a service performed by the failed node in another node.