KR19990050461A

KR19990050461A - Error Handling in High Availability Systems

Info

Publication number: KR19990050461A
Application number: KR1019970069587A
Authority: KR
Inventors: 김재민
Original assignee: 구자홍; 엘지전자 주식회사
Priority date: 1997-12-17
Filing date: 1997-12-17
Publication date: 1999-07-05

Abstract

본 발명은 고 가용성 시스템의 오류 처리방법에 관한것으로서, 고 가용성 시스템의 동작중에 시스템 제어보드외의 입/출력 보드등에서 장애가 발생되는 경우, 시스템 제어 보드 및 입/출력 보드의 오류를 검지하는 로직을 추가하여, 공중망과 시리얼 라인을 통한 상대 시스템의 상태 정보를 검색하는 과정에서 발생하는 오동작으로 인한 서비스 중단을 미연에 방지하여 시스템의 정지 시간을 최소화하도록 한 것이다.The present invention relates to an error handling method of a high availability system, and when a failure occurs in an input / output board other than the system control board during operation of the high availability system, logic for detecting an error of the system control board and the input / output board is added. In order to minimize the downtime of the system, it is possible to prevent the service interruption caused by the malfunction occurring in the process of searching the status information of the counterpart system through the public network and the serial line.

이와 같은 고 가용성 시스템의 오류 처리방법은, 동일한 구성을 가지는 두 개의 시스템이 각각 입/출력 프로세스를 통해 서비스를 상호 공유하면서 동작중에, 오류 검출 프로세스의 실행에 따라 어느 하나의 시스템에서 오류가 발생되면 발생된 오류 정보를 고 가용성 관리 프로세서에 전달하는 과정; 오류 정보를 전달받음과 아울러, 두 개의 시스템중 어느 하나의 입/출력 프로세스에서 오류가 발생되면 입/출력 제어명령에 의하여 오류가 발생된 해당 시스템을 다운시키기 위한 동작을 수행하고, 오류가 발생되지 않는 경우에는 지속적으로 오류 검출를 수행하는 과정을 포함함으로써 달성된다.The error handling method of the high availability system is that two systems having the same configuration share services with each other through an input / output process, and when an error occurs in any one system according to the execution of the error detection process. Delivering the generated error information to the high availability management processor; In addition to receiving error information, if an error occurs in one of the two input / output processes, an operation for bringing down the corresponding system where the error occurred by an input / output control command is performed and an error is not generated. If not, this is accomplished by continuing to perform error detection.

Description

Error Handling in High Availability Systems

본 발명은 고 가용성 시스템(High Availability System)에 관한것으로서, 보다 상세하게는 고 가용성 시스템 동작중에 시스템 제어보드외의 보드등에서 오류 발생시 시스템 제어보드를 리셋(Reset)시켜 시스템의 정지 시간(Down time)을 최소화 시키는 고 가용성 시스템의 오류 처리방법에 관한 것이다.The present invention relates to a high availability system, and more particularly, when a failure occurs in a board other than the system control board during a high availability system operation, the system control board is reset to reduce down time of the system. Error handling method of high availability system to minimize.

일반적으로, 컴퓨터 시스템은 소프트웨어(Software), 하드웨어(Hardware)등에 의한 장애 발생등으로 인한 시스템이 정지되는 것을 방지하고, 프로세서의 내부 동작이 불안정으로 동작하고 있는가등을 검사하는 프로그램을 준비하여 데이터의 완전성을 유지하고, 만일 장애가 발생되면 장애 부위를 조기에 발견하여 자동적으로 장애 상태를 판단할수 있도록 하여 시스템의 신뢰성, 가용성, 보수성등을 향상시키고 있다.Generally, a computer system prevents the system from being stopped due to a failure caused by software, hardware, etc., and prepares a program to check whether the internal operation of the processor is operating in unstable state. It maintains completeness and improves the reliability, availability, and maintainability of the system by detecting failure sites early and automatically determining the failure status if a failure occurs.

이와 같이 시스템의 성능을 향상시키기 위해서 하드웨어 기술과 소프트 웨어의 기술적인 연계가 필요하게 되었고, 통상 이중화 시스템에서 상대 시스템의 상태 감시는 네트워크 채널 및 시스템 주변기기 제어용 버스인 스카시(SCSI : Small Computer System Interface) 버스를 이용하여 수행되었다.In order to improve the performance of the system, technical linkage between hardware technology and software is required. In a redundant system, the status monitoring of the counterpart system is usually a SCSI (SCSI: Small Computer System Interface) bus for controlling network channels and system peripherals. This was done using a bus.

도 1은 종래 기술에 따른 고 가용성 시스템을 보인 개략적인 블록 구성도로서, 이에 도시된 바와 같이 공중 망(Public Lan)에 인터페이스를 통해 연결되는 제 1시스템(20), 제 2시스템(30) 및 클러스터 관리 시스템(10)과 시리얼 라인(Serial Line)을 통해 제 1시스템(20)과 제 2시스템(30)이 상호 연결되며, 이들이 각각 공유하는 공유디스크(40)로 구성된다.FIG. 1 is a schematic block diagram showing a high availability system according to the prior art, in which a first system 20, a second system 30, and an interface are connected to a public LAN as shown. The first system 20 and the second system 30 are interconnected through the cluster management system 10 and the serial line, and are composed of shared disks 40 which are shared.

바람직하게, 제 1시스템(20)과 제 2시스템(30)은 각각 서버/클라이언트 프로세스(24)(26)(34)(36)와, 오류를 검출하는 오류 검출 프로세스(28)(38), 클러스터 관리 대몬 프로세스(CMSD :이하 대몬 프로세스라 약칭함)(22)(32)를 포함한다.Preferably, the first system 20 and the second system 30 are server / client processes 24, 26, 34, 36, error detection processes 28 and 38 for detecting errors, Cluster management daemon processes (CMSD: hereinafter referred to as daemon processes) 22, 32.

이와 같이 구성된 고 가용성 시스템의 오류 처리과정을 도 2를 참고하여 상세히 설명하면 다음과 같다.The error handling process of the high availability system configured as described above will be described in detail with reference to FIG.

먼저, 제 1, 제 2 시스템(20)(30)이 기동되는 경우, 상대(Remote) 시스템과의 동기를 맞추기 위하여 고가용성 셋업 프로세스(HASETUP Process)를 수행한다(ST10)(ST11).First, when the first and second systems 20 and 30 are activated, a high availability setup process (HASETUP Process) is performed to synchronize with the remote system (ST10) (ST11).

고가용성 셋업 프로세스는 제 1, 제 시스템(20)(30)에서 각각 수행되며, 이때 상대 시스템의 상태를 감시하기 위해(Heart beat) 상호 시스템을 연결시키는 시리얼 라인(Serial Line)을 SLIP(Serial line Internet Protocol)으로 셋업시킨다(ST12)(ST13).The high availability setup process is performed in the first and second systems 20 and 30, respectively, in which a serial line connecting the mutual systems is connected to the SLIP (Heart Beat) to monitor the status of the counterpart system. Internet Protocol) (ST12) (ST13).

이후, 고가용성 셋업 프로세스는 시리얼 라인의 상태 감시기능을 이용하여 상대 시스템이 준비된 상태인지를 검사하고, 만약 준비가 되지 않는 상태이면 준비될때까지 상대 시스템을 폴링(Polling)한다. 즉, 제 2시스템(30)을 검사하는 경우, 제 1시스템(20)이 시리얼 라인(Serial line)을 통해 시스템이 액티브(Active)될때까지 정기적으로 제 2시스템(30)의 상태를 검사한다(ST14)(ST15).The high availability setup process then uses the serial line's state monitoring function to check if the counterpart system is ready, and if not, polling the counterpart until it is ready. That is, when inspecting the second system 30, the first system 20 periodically checks the state of the second system 30 until the system is activated through the serial line ( ST14) (ST15).

계속해서, 상대 시스템이 액티브 상태가 되면, 이후 상대 시스템이 고가용성 상태로 액티브 되었는지를 검사한다(ST16)(ST17). 이때, 상대 시스템이 고가용성 환경으로 액티브 된 상태이면 다른 시스템의 자원을 복구(Release)시킨후, 복구 완료 명령을 기다리고, 시리얼 라인을 이용한 상대 시스템의 상태 감시 정보를 시작한다(ST18)(ST19)(ST23).Subsequently, when the counterpart system becomes active, it is subsequently checked whether the counterpart system is active in a high availability state (ST16) (ST17). At this time, if the partner system is active in a high availability environment, after restoring resources of the other system, it waits for a command to complete the recovery, and starts status monitoring information of the partner system using the serial line (ST18) (ST19). (ST23).

만약, 상대 시스템이 고가용성 상태로 액티브 된 상태가 아니면 제 1, 제 2시스템(20)(30)은 모두 초기 상태이므로, 공중망(Public Lan)을 셋업하고, 고 가용성 시스템을 구성하기 위해 필요한 HAM(High Availability Manager)프로세스, 오류 검출 프로세스(Fault Detection) 및 대몬 프로세스를 수행시킨다(ST20)(ST22).If the counterpart system is not active in a high availability state, the first and second systems 20 and 30 are all initial states, so the HAM necessary to set up a public LAN and configure a high availability system is required. (High Availability Manager) process, fault detection process, and daemon process (ST20) (ST22).

즉, 고가용성 셋업 프로세스에 의해 수행된 고가용성 프로세스는, 고가용성 시스템을 구성하기 위하여 상대 시스템을 호출하기 위한 RPC(Remote Procedure Call)을 이용하여 상대 시스템의 상태 정보를 주기적으로 폴링한다. 이때 상대 시스템의 상태 정보를 제공받는 경로는 상기에서 기술한 공중망을 이용한 방법과 시리얼 라인을 이용하는 방법이 사용된다.That is, the high availability process performed by the high availability setup process periodically polls state information of the counterpart system using a remote procedure call (RPC) for calling the counterpart system to configure the high availability system. At this time, the path using the status information of the counterpart system is the method using the public network and the method using the serial line described above.

이와 같은 방법에 의하여 상대 시스템의 상태 정보를 감시하는데, 첫 번째 방법인 공중망을 이용하여 상대 시스템의 상태 정보를 제공받는 방법이 실패하는 경우(ST24), 두 번째 방법인 시리얼 라인을 이용하여 상대방 시스템의 상태 정보를 검사한다(ST26). 시리얼 라인을 통한 상태 정보 검사도 실패인 경우는 상대 시스템에 장애가 발생된 상태이다(ST28). 이때는 클러스터 관리 시스템(30)을 검사하고, 상대 시스템이 제공하던 서비스를 다른 시스템에 제공하기 위한 작업이 수행된다(ST30).In this way, the status information of the other system is monitored. If the method of receiving the status information of the other system using the public network, which is the first method, fails (ST24), the other system using the serial line, the second method, is failed. State information is checked (ST26). If the status information check through the serial line also fails, the counterpart system has failed (ST28). In this case, the cluster management system 30 is inspected, and a task for providing a service provided by the counterpart system to another system is performed (ST30).

이와 같은 서비스 제공은, 먼저 제 1, 제 2시스템(20)(30)의 클라이언트 프로세스(26)(36)들은 서버 프로세스(24)(34)에 장애가 발생한 것과는 무관하게 계속적으로 작업이 이루어져야 한다. 즉, 서비스를 요구하는 클라이언트 프로세스(26)(36)는 지속적인 동작이 필요하며, 이에 따라 인터넷 프로토콜(Internet Protocol)에 의한 서비스 제공을 수행한다. 인터넷 프로토콜에 의한 클라이언트 프로세스(26)(36)의 서비스 제공은 동일한 서버 프로세스(24)(34)의 인터넷 프로토콜 어드레스로 억세스하여도 가능하게 하기 위하여 동일 인터넷 프로토콜 어드레스로 셋업하여 서비스가 되도록 한다.In order to provide such services, first, the client processes 26 and 36 of the first and second systems 20 and 30 must be continuously operated regardless of whether the server processes 24 and 34 have failed. In other words, the client processes 26 and 36 requesting the service need continuous operation, and thus provide the service by the Internet Protocol. The service provision of client processes 26 and 36 by the Internet protocol is set up to be the same Internet protocol address in order to be able to access the Internet protocol address of the same server process 24 and 34.

이후, 장애가 발생된 시스템에서 제공하던 서비스를 상대 시스템에 인계하여 계속적으로 시스템을 동작시키고, 서비스 제공이 완료되면 상기와 같은 서비스 제공 과정을 복구 명령에 따라 복구시키고, 보통 상태로 시스템을 전환시킨다(ST32 ~ ST38).After that, the service provided by the failing system is transferred to the counterpart system and the system is continuously operated. When the service is completed, the service providing process is restored according to the recovery command, and the system is switched to the normal state. ST32 to ST38).

그러나, 상기한 종래 기술에 따른 고 가용성 시스템의 오류 처리방법은 상대 시스템의 정보를 주기적으로 제공받아 장애 여부를 검지하기 위한 경로 즉, 공중망을 이용한 상대 시스템의 상태 감시와 시리얼 라인을 이용한 상대 시스템의 상태 감시 경로가 시스템 제어보드를 통해 이루어짐으로써, 시스템 제어보드외의 다른 보드에서 장애가 발생하고 시스템 제어 보드는 정상적으로 동작하면 시스템은 비정상적이지만 정상 상태로서 동작되는 경우가 발생되는 문제점이 있었다.However, the error handling method of the high availability system according to the related art is a path for detecting a failure by periodically receiving information of the counterpart system, that is, monitoring the status of the counterpart system using a public network and of the counterpart system using a serial line. Since the status monitoring path is made through the system control board, if a failure occurs in a board other than the system control board and the system control board operates normally, the system may operate abnormally but in a normal state.

이런 경우, 네트워크 상에 연결되는 클라이언트 프로세스들은 서비스가 중단된 상태로서, 장애가 일어난 시스템은 상대 시스템에 계속적으로 서비스를 공유해야 하지만 시스템 제어 보드가 정상적으로 동작을 수행하는것처럼 보여 상호 시스템간의 서비스 제공이 이루어지지 않는다.In this case, the client processes connected to the network are in a state where the service is interrupted. The failed system must continue to share the service with the counterpart system, but the system control board seems to be operating normally. I do not lose.

따라서, 본 발명은 고 가용성 시스템의 동작중에 시스템 제어보드외의 입/출력 보드등에서 장애가 발생되는 경우, 시스템 제어 보드 및 입/출력 보드의 오류를 검지하는 로직을 추가하여 공중망과 시리얼 라인의 오동작에 따른 서비스 정지 시간을 최소화 함으로써, 시스템를 보다 안정화 시키는 고 가용성 시스템의 오류 처리방법을 제공함에 그 목적이 있다.Accordingly, the present invention adds logic for detecting an error in the system control board and the input / output board when a failure occurs in an input / output board other than the system control board during the operation of the high availability system. By minimizing service downtime, the objective is to provide a method for handling errors in a high availability system that makes the system more stable.

도 1은 종래 기술에 따른 고 가용성 시스템을 보인 개략적인 블록 구성도이고,1 is a schematic block diagram showing a high availability system according to the prior art,

도 2는 도 1에 따른 고 가용성 시스템의 오류 처리과정을 보인 흐름도이고,2 is a flowchart illustrating an error processing process of the high availability system according to FIG. 1;

도 3은 본 발명에 따른 고 가용성 시스템의 오류 처리과정을 보인 흐름도이다.3 is a flowchart illustrating an error handling process of a high availability system according to the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for main parts of the drawings>

10 : 클러스터 관리 시스템 20, 30 : 제 1, 제 2 시스템10: cluster management system 20, 30: first, second system

22, 32 : 대몬 프로세스 24, 34 : 서버 프로세스22, 32: daemon process 24, 34: server process

26, 36 : 클라이언트 프로세스 28, 38 : 오류 검출 프로세스26, 36: client process 28, 38: error detection process

40 : 공유 디스크40: shared disk

상기한 목적을 달성하기 위한 본 발명에 따른 고 가용성 시스템의 오류 처리방법은, 동일한 구성을 가지는 두 개의 시스템이 각각 입/출력 프로세스를 통해 서비스를 상호 공유하면서 동작중에, 오류 검출 프로세스의 실행에 따라 어느 하나의 시스템에서 오류가 발생되면 상기 발생된 오류 정보를 고 가용성 관리 프로세서에 전달하는 과정; 상기 오류 정보를 전달받음과 아울러, 상기 시스템중 어느 하나의 입/출력 프로세스에서 오류가 발생되면 입/출력 제어명령에 의하여 오류가 발생된 해당 시스템을 다운시키기 위한 동작을 수행하고, 오류가 발생되지 않는 경우에는 지속적으로 오류 검출를 수행하는 과정을 포함한다.In order to achieve the above object, an error handling method of a high availability system according to the present invention includes two systems having the same configuration, each operating while sharing a service through an input / output process, depending on the execution of the error detection process. Transferring the generated error information to a high availability management processor when an error occurs in any one system; In addition to receiving the error information, if an error occurs in any one of the input / output processes of the system, an operation for bringing down the corresponding system where an error occurs by an input / output control command is performed, and an error is not generated. If not, it involves continuously performing error detection.

바람직하게, 제 2과정의 상기 시스템 다운 동작은 상기 오류 정보를 입출력 제어 함수를 이용하여 커널 함수에 전달하는 단계; 상기 오류 정보를 전달받은 커널 함수는 시스템 제어보드를 리셋 시키는 단계를 포함한다.Preferably, the system down operation of the second process may include transferring the error information to a kernel function using an input / output control function; The kernel function receiving the error information includes resetting the system control board.

이하, 본 발명의 바람직한 실시예를 첨부된 도면을 참고하여 상세히 설명하면 다음과 같다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

여기서, 본 발명의 실시예의 구성을 설명함에 있어, 명세서의 서두에서 설명된 종래 기술에 따른 고 가용성 시스템은 본 발명에도 적용되므로 설명의 중복을 피하기 위하여 생략이 가능하며, 또한 동일 부호를 사용한다.Here, in describing the configuration of the embodiment of the present invention, the high availability system according to the prior art described at the beginning of the specification is also applied to the present invention, and may be omitted to avoid duplication of description, and the same reference numerals are used.

먼저, 고가용성 시스템은 상호 제 1, 제 2시스템(20)(30)의 상태 정보를 파악하기 위하여 공중망을 이용한 방법과 시리얼 라인을 이용한 방법을 수행한다. 이때, 하나의 시스템 기동시 상대 시스템과의 동기를 맞추기 위하여 초기 셋업 프로세스를 수행한다.First, the high availability system performs a method using a public network and a method using a serial line to determine state information of the first and second systems 20 and 30. At this time, the initial setup process is performed to synchronize with the counterpart system at the start of one system.

즉, 동작중인 고 가용성 시스템에서 상대 시스템을 감시하기 위한 동작이 지속적으로 실행되며, 상기 공중망과 시리얼 라인을 통하여 정보를 제공받는다. 이러한 정보 제공중에 오류가 발생되면, 이를 감지하여 장애가 발생된 시스템의 서비스 내용을 상대 시스템에 제공하여 계속적으로 서비스를 실행하도록 하고 있다.That is, an operation for monitoring the counterpart system is continuously executed in the high availability system in operation, and information is provided through the public network and the serial line. If an error occurs during the provision of such information, the service detects it and provides the service contents of the failed system to the counterpart system so as to continuously execute the service.

이와같은 일련의 작업은, 공중망을 통한 상대 시스템의 상태 감시가 실패하면, 두 번째 방법인 시리얼 라인을 통한 상대 시스템의 감시 작업을 수행하고, 이때에도 상대 시스템의 감시 작업이 실패하면 상대 시스템에 장애가 발생되는 것으로 간주하여 상대 시스템이 제공하던 서비스를 다른 시스템에 제공한다.In this series of operations, if the status monitoring of the counterpart system through the public network fails, the second method, the monitoring of the counterpart system via the serial line, is performed. It considers that it occurs and provides the service provided by the other system to another system.

이때, 시스템 제어보드외의 다른 보드에서 오류가 발생되면 이를 감지할 수가 없어 시스템이 정상적인 상태로서 동작되는 경우가 발생되며, 이를 방지하기 위하여 도 3에 도시된 바와 같이 오류 검출 프로세스(Fault Detection Process)을 수행함으로써, 입/출력 프로세스와 시스템 제어보드의 상태를 파악하여 계속적인 서비스가 제공되도록 한다.In this case, if an error occurs in a board other than the system control board, it may not be detected and the system may operate as a normal state. In order to prevent this, a fault detection process as illustrated in FIG. By doing so, the status of the input / output process and the system control board can be identified to provide continuous service.

오류 검출 프로세스가 수행되면(ST100) 먼저, 입/출력 프로세서와 시스템 제어보드에서 오류가 발생되는지를 판단하여, 오류가 발생되면 고 가용성 관리 프로세스(High Availability Manager Process)에 발생된 오류에 대한 정보를 전달한다(ST102)(ST104)(ST106).When the error detection process is performed (ST100), first, it is determined whether an error occurs in the input / output processor and the system control board, and when an error occurs, information about the error occurred in the high availability manager process is displayed. It delivers (ST102) (ST104) (ST106).

이때, 시스템 제어 보드는 정상적으로 동작하고, 입/출력 프로세스만 오류가 발생되는 경우에는(ST108), 시스템이 비정상적으로 동작하고 있는 상태로서 발생된 오류를 인식하지 못하기 때문에 상대 시스템과 서비스를 공유 하지 못한다. 이에 따라 시스템 제어보드를 리셋하는 동작을 수행시켜 공중망과 시리얼 라인을 통하여 상대 시스템의 상태 정보의 검사가 불능 상태인 것을 인식시키도록 한다.At this time, if the system control board operates normally and only the input / output process generates an error (ST108), the system does not share the service with the counterpart system because the system does not recognize the error caused by the abnormal operation. can not do it. Accordingly, the operation of resetting the system control board is performed to recognize that the inspection of the status information of the counterpart system is disabled through the public network and the serial line.

먼저, 입/출력 보드를 통한 오류 정보 검지가 불능인 상태에서는 디스크 상에 존재하는 커맨드(Command)들을 이용하여 시스템을 다운시킬수가 없다. 그래서 시스템의 메모리상에 있는 프로세스와 커널(Kernel)함수를 이용하여 시스템을 다운시켜야 한다.First, in a state in which error information detection through the I / O board is disabled, the system cannot be shut down using commands existing on the disk. So you have to take your system down by using processes and kernel functions in the system's memory.

시스템 다운작업은 메모리상에 상주하는 고 가용성 관리 프로세스에 의하여 처리되며, 고 가용성 관리 프로세스는 오류 검출 프로세스에 의하여 입/출력 보드의 오류 정보를 전달받는다. 이에 따라 고 가용성 관리 프로세스는 커널 함수의 입/출력 제어 시스템 콜(IOCTL System Call) 함수를 이용하여 커널 함수로 시스템 제어보드의 리셋 명령을 전달한다(ST112).The system down operation is handled by a high availability management process that resides in memory, and the high availability management process receives error information of an input / output board by an error detection process. Accordingly, the high availability management process transfers the reset command of the system control board to the kernel function by using an input / output control system call function of the kernel function (ST112).

명령을 전달받은 커널 함수는 시스템 제어보드의 특정 영역을 이용하여 시스템 제어보드를 리셋시킨다(ST114).The kernel function receiving the command resets the system control board by using a specific area of the system control board (ST114).

이후, 시스템 제어보드가 리셋 처리되면 상대 시스템에서 상태 정보를 검사하는 방법 즉, 공중망과 시리얼 라인을 통한 상태 정보 검사가 모두 실패하는 것으로 인식되어 오류가 발생된 시스템은 상대 시스템에게 공유하고 있는 서비스를 제공하게된다(ST116).After that, when the system control board is reset, it is recognized that the method of checking the status information in the counterpart system, that is, the status information check through the public network and the serial line, fails, and the system in which the error has occurred is not able to share the service shared with the counterpart system. To provide (ST116).

여기서, 입/출력 프로세스가 오류가 발생되지 않은 상태에서는 계속적으로 오류 검출 프로세스에 의하여 오류 검출을 수행한다(ST110).Here, in the state where the input / output process does not generate an error, error detection is continuously performed by the error detecting process (ST110).

이상에서 상세히 설명한 바와 같이, 본 발명은 고 가용성 시스템의 동작중에 시스템 제어보드외의 입/출력 보드등에서 장애가 발생되는 경우, 시스템 제어 보드 및 입/출력 보드의 오류를 검지하는 로직을 추가하여, 공중망과 시리얼 라인을 통한 상대 시스템의 상태 정보를 검색하는 과정에서 발생하는 오동작으로 인한 서비스 중단을 미연에 방지하여 시스템의 정지 시간을 최소화하게 하는 효과가 있다.As described in detail above, the present invention adds logic for detecting an error of the system control board and the input / output board when an error occurs in an input / output board other than the system control board during operation of the high availability system. It is effective in minimizing downtime of the system by preventing service interruption due to a malfunction occurring while searching the status information of the counterpart system through the serial line.

Claims

While two systems having the same configuration are sharing services with each other through input / output processes, if an error occurs in any one system according to the execution of the error detection process, the generated error information is transmitted to the high availability management processor. Conveying process; In addition to receiving the error information, if an error occurs in any one of the input / output processes of the system, an operation for bringing down the corresponding system where an error occurs by an input / output control command is performed, and an error is not generated. And if not, continuously performing error detection.

The method of claim 1, wherein the system down operation comprises: transmitting the error information to a kernel function using an input / output control function; The kernel function receiving the error information comprises the step of resetting the system control board.