KR20020065188A

KR20020065188A - Method for managing fault in computer system

Info

Publication number: KR20020065188A
Application number: KR1020010005588A
Authority: KR
Inventors: 송명호; 강남욱
Original assignee: 삼성전자 주식회사
Priority date: 2001-02-06
Filing date: 2001-02-06
Publication date: 2002-08-13

Abstract

PURPOSE: A method for managing a fault of a computer system is provided to monitor the fault occurrence of the computer system and report the data for the occurred fault to an upper system. CONSTITUTION: An application program requests a system service to an operating system and receives the return value. The application program calls a fault management program by confirming the return value. The application program transfers the kind of the fault and the related value to the fault management program. The fault management program confirms the kinds of the fault and collects the fault data(S210,S220). After completing the data collection, the fault management program configures a local database in a preset fault management storing region(S230) and reports the fault occurrence to the upper system by transmitting the fault data(S240). The fault data is stored in the local DB(S250) and the fault management program carries out the preset fault recovery operation(S260).

Description

METHODS FOR MANAGING FAULT IN COMPUTER SYSTEM}

본 발명은 컴퓨터 시스템에 관한 것으로서, 특히 컴퓨터 시스템에서 발생한 장애를 관리하기 위한 방법에 관한 것이다.The present invention relates to a computer system, and more particularly to a method for managing a failure occurring in a computer system.

알려진 바와 같이 컴퓨터 시스템은 고유한 기능을 수행하기 위한 다수의 프로세서 보드들로 구성되며 프로세서 보드들은 해당하는 프로세서 보드 전체의 동작을 제어하는 운영체제(Operation System: OS) 프로그램과 상기 운영체제 프로그램에 의하여 동작하며 다양한 기능들을 수행하는 적어도 하나의 응용 프로그램 및 부가적인 프로그램들을 구비한다.As is known, a computer system consists of a plurality of processor boards for performing unique functions, and the processor boards are operated by an operating system (OS) program and an operating system program that controls the operation of the corresponding processor board. It has at least one application program and additional programs that perform various functions.

도 1은 통상적인 컴퓨터 시스템의 프로그램 데이터 저장영역을 나타낸 것이다. 도 1을 참조하면 운영체제 프로그램(110)은 적어도 하나의 응용 프로그램(120), 미들웨어 프로그램(130), 디바이스 드라이버(140), 하드웨어 제어기(150) 등을 실행한다.1 shows a program data storage area of a typical computer system. Referring to FIG. 1, the operating system program 110 executes at least one application program 120, a middleware program 130, a device driver 140, a hardware controller 150, and the like.

상기와 같이 구성되는 컴퓨터 시스템에서 외부 또는 내부에서 미리 정해지는 오류 발생 조건에 근거하여 오류가 발생하는 경우 해당하는 프로그램들은 장애를 감지한 것으로 판단하고 미리 정해지는 장애관리 동작을 수행하게 된다. 종래기술에 의한 프로그램별 장애관리 동작을 보다 상세히 설명하면 하기와 같다.When an error occurs on the basis of an error occurrence condition predetermined externally or internally in the computer system configured as described above, corresponding programs determine that a failure is detected and perform a predetermined failure management operation. Referring to the program-specific failure management operation according to the prior art as follows.

응용 프로그램(120)에서 장애가 발생하면 운영체제 프로그램(110)은 응용 프로그램으로 전달하는 리턴 값(Return-value)에 장애를 알리는 값으로 설정한다. 응용 프로그램(120)은 운영체제 프로그램(110)으로부터 수신된 리턴 값이 장애를 알리는 값이면 해당하는 장애의 종류에 따라 자체적으로 처리한다.When a failure occurs in the application program 120, the operating system program 110 sets the return value to the application program to a value indicating the failure. If the return value received from the operating system program 110 is a value indicating the failure, the application program 120 processes itself according to the type of the failure.

미들웨어 프로그램(130)은 또한 운영체제 프로그램(110)으로부터 수신된 리턴 값을 확인하여 장애가 감지되면 해당하는 장애의 내용을 자체적으로 프린트 아웃한다. 예를 들어 버퍼의 용량이 부족하여 전송 메시지(Send Message)를 전송하지 못한 경우 "Tx Error(Buffer Full)"이라는 오류 메시지를 프린트 아웃하고 전송에러 카운트를 증가시킨다.The middleware program 130 also checks the return value received from the operating system program 110 and, if a failure is detected, prints out the contents of the corresponding failure. For example, if the send message is not sent due to insufficient buffer capacity, the printer prints out the error message "Tx Error (Buffer Full)" and increases the transmission error count.

디바이스 드라이버(140)는 미들웨어 프로그램(130)과 동일하게 리턴 값에 의하여 장애를 감지하고 해당하는 장애의 내용을 자체적으로 프린트 아웃한다. 하드웨어 제어기(150)는 해당하는 하드웨어에서 장애가 감지되면 자체적으로 장애를 관리한다. 예외의 경우로서 인터럽트가 발생하면 운영체제 프로그램(110)은 레지스터 값 및 스택의 내용을 저장한다.Like the middleware program 130, the device driver 140 detects a failure by a return value and prints out the contents of the corresponding failure by itself. If the hardware controller 150 detects a failure in the corresponding hardware, the hardware controller 150 manages the failure. If an interrupt occurs as an exception, the operating system program 110 stores the register value and the contents of the stack.

상기된 바와 같이 종래 기술은 컴퓨터 시스템에서 발생된 장애를 가공하거나 데이터베이스화하지 않으며 해당하는 장애 데이터를 상위 시스템에서 확인할 수 없다. 따라서 온라인 상황에서 발생하는 여러 가지 오류 및 장애에 대한 조치를 수행할 수 없으며 문제 발생시 해결이 불가능하다. 즉 컴퓨터 시스템에서 발생하는 각종 장애에 대한 정보를 상위 시스템으로 보고하지 않기 때문에 문제 원인 파악 및 해결이 어렵다. 따라서 장애의 내용을 저장하고 그 종류를 체계적으로 저장할 수 없었다는 문제점이 발생하였다.As described above, the prior art does not process or database a failure generated in a computer system, and the corresponding failure data cannot be confirmed in a higher system. Therefore, it is impossible to take action on various errors and failures that occur in the online situation, and it is impossible to solve them when a problem occurs. In other words, it is difficult to identify and solve the cause of the problem because the information on various failures occurring in the computer system is not reported to the upper system. Therefore, there was a problem that the contents of the obstacles could not be stored and their types could not be stored systematically.

따라서 상기한 바와 같이 동작되는 종래 기술의 문제점을 해결하기 위하여 창안된 본 발명의 목적은, 컴퓨터 시스템에서 장애의 발생을 감시하고 발생된 장애에 대한 데이터를 정보화하여 상위 시스템으로 보고하는 방법을 제공하는 것이다.Accordingly, an object of the present invention, which was devised to solve the problems of the prior art operating as described above, provides a method of monitoring occurrence of a failure in a computer system and reporting data on the generated failure to a higher system. will be.

상기한 바와 같은 목적을 달성하기 위하여 창안된 본 발명의 실시예는, 운영체제 프로그램과 상기 운영체제 프로그램에 의하여 실행되는 응용 프로그램과 미들웨어 프로그램과 디바이스 드라이버와 하드웨어 제어기를 구비하는 컴퓨터 시스템의 장애관리 방법에 있어서,Embodiments of the present invention, which are invented to achieve the above object, are provided in an error management method of a computer system including an operating system program, an application program executed by the operating system program, a middleware program, a device driver, and a hardware controller. ,

응용 프로그램과 미들웨어 프로그램과 디바이스 드라이버와 하드웨어 제어기중 어느 하나에서 장애가 발생되면 해당하는 프로그램에서 장애관리 프로그램을 호출하고 상기 장애관리 프로그램에게 상기 발생된 장애의 정보를 전달하는 단계;When a failure occurs in any one of an application program, a middleware program, a device driver, and a hardware controller, calling a failure management program from a corresponding program and transmitting information on the generated failure to the failure management program;

상기 장애관리 프로그램에서 상기 발생된 장애의 종류를 확인하는 단계;Confirming the type of the generated failure in the failure management program;

상기 발생된 장애에 해당하는 장애 데이터를 수집하는 단계;Collecting failure data corresponding to the generated failure;

미리 할당된 장애 데이터 저장영역에 로컬 데이터베이스를 구성하는 단계;Configuring a local database in a pre-allocated fault data storage area;

상기 장애 데이터를 상위 시스템으로 보고하는 단계;Reporting the fault data to a higher system;

상기 장애 데이터를 상기 로컬 데이터베이스에 저장하는 단계;Storing the fault data in the local database;

상기 발생된 장애에 해당하는 장애복구 동작을 수행하는 단계를 포함한다.And performing a failover operation corresponding to the generated failure.

도 1은 통상적인 컴퓨터 시스템의 프로그램 데이터 저장영역을 나타낸 도면.1 is a diagram showing a program data storage area of a conventional computer system.

도 2는 본 발명에 의한 컴퓨터 시스템의 프로그램 저장영역을 나타낸 도면.2 illustrates a program storage area of a computer system according to the present invention.

도 3은 본 발명에 의한 소프트웨어 분야의 장애관리 동작을 나타낸 흐름도.Figure 3 is a flow chart illustrating a failure management operation in the software field according to the present invention.

도 4는 본 발명에 의한 장애관리 프로그램의 동작을 나타낸 흐름도.Figure 4 is a flow chart showing the operation of the problem management program according to the present invention.

이하 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 대한 동작 원리를 상세히 설명한다. 도면상에 표시된 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 참조번호로 나타내었으며, 하기에서 본 발명을 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.Hereinafter, with reference to the accompanying drawings will be described in detail the operating principle of the preferred embodiment of the present invention. Like reference numerals are used to designate like elements even though they are shown in different drawings, and detailed descriptions of related well-known functions or configurations are not required to describe the present invention. If it is determined that it can be blurred, the detailed description thereof will be omitted. Terms to be described later are terms defined in consideration of functions in the present invention, and may be changed according to intentions or customs of users or operators. Therefore, the definition should be made based on the contents throughout the specification.

도 2는 본 발명에 의한 컴퓨터 시스템의 프로그램 저장영역을 나타낸 것이다. 도 2를 참조하면 운영체제 프로그램(210)은 응용 프로그램(220), 미들웨어 프로그램(230), 디바이스 드라이버(240), 하드웨어 제어기(250), 장애관리 프로그램(260)을 실행한다. 특히 상기 응용 프로그램(220), 미들웨어 프로그램(230), 디바이스 드라이버(240), 하드웨어 제어기(250)는 장애관리 프로그램(260)을 호출할 수 있으며 장애관리 프로그램(260)은 장애관리를 위하여 할당된 장애데이터 저장영역을 액세스할 수 있다.2 illustrates a program storage area of a computer system according to the present invention. Referring to FIG. 2, the operating system program 210 executes an application program 220, a middleware program 230, a device driver 240, a hardware controller 250, and a problem management program 260. In particular, the application program 220, the middleware program 230, the device driver 240, the hardware controller 250 may call the failure management program 260 and the failure management program 260 is assigned for failure management. Fault data storage can be accessed.

상기와 같이 구성되는 컴퓨터 시스템에서 외부 또는 내부에서 미리 정해지는 오류 발생 조건에 근거하여 오류가 발생하는 경우 상기 프로그램들은 장애를 감지한 것으로 판단하고 장애관리 프로그램을 호출한 다음 장애관리 프로그램에게 발생된 장애에 대한 정보를 전달하게 된다. 본 발명에 의한 프로그램별 장애관리 동작을 보다 상세히 설명하면 하기와 같다.When an error occurs on the basis of an error occurrence condition predetermined externally or internally in the computer system configured as described above, the programs determine that the failure is detected, call the failure management program, and then cause the failure caused by the failure management program. It will pass information about. The failure management operation for each program according to the present invention will be described in detail as follows.

소프트웨어 분야인 응용 프로그램(220)과 미들웨어 프로그램(230) 및 디바이스 드라이버(240)는 운영체제 프로그램(210)으로부터 수신된 리턴 값을 확인하고 만일 리턴 값이 장애를 알리는 값이면 장애관리 프로그램(260)을 호출하도록 프로그래밍된다.The application program 220, the middleware program 230, and the device driver 240, which are software fields, check the return value received from the operating system program 210 and, if the return value indicates a fault, execute the fault management program 260. It is programmed to call.

도 3은 본 발명에 의한 소프트웨어 분야의 장애관리 동작을 나타낸 흐름도로서, 도 3을 참조하면 단계(S110)에서 응용 프로그램(220)(또는 미들웨어 프로그램 또는 디바이스 프로그램)은 운영체제 프로그램(210)에게 시스템 서비스를 요구하고 그에 대한 응답으로 리턴 값을 수신한다. 단계(S120)에서 응용 프로그램(220)(또는 미들웨어 프로그램 또는 디바이스 프로그램)은 수신된 리턴 값이 장애를 알리는 값인지를 확인하여, 만일 장애를 알리는 값이면 단계(S130)에서 장애관리 프로그램을호출한다. 이때 응용 프로그램(220)(또는 미들웨어 프로그램 또는 디바이스 프로그램)은 장애의 종류와 그 값을 장애관리 프로그램으로 전달한다.3 is a flowchart illustrating a failure management operation in a software field according to the present invention. Referring to FIG. 3, in operation S110, an application program 220 (or a middleware program or a device program) may provide a system service to an operating system program 210. Request and receive a return value in response. In operation S120, the application program 220 (or the middleware program or the device program) checks whether the received return value is a failure notification value, and if it is a failure notification value, calls the failure management program in operation S130. . In this case, the application program 220 (or the middleware program or the device program) transfers the type of failure and its value to the failure management program.

상기와 같이 동작하는 소프트웨어 분야에서 미들웨어 프로그램(230)의 경우에 대하여 설명하면 다음과 같다. 미들웨어에서 처리되는 시스템의 자원들, 특히 IPC(Inter-Processor Communication) 수신 및 송신 버퍼, 타이머, 이중화 등에서 가용자원이 없는 경우 또는 응용 프로그램(220)의 요구시 그 변수가 잘못되었거나 외부 입력 데이터가 비정상적인 경우, 운영체제 프로그램(210)은 리턴 값을 장애를 알리는 값으로 설정한다. 예를 들어 버퍼의 용량이 부족하여 전송 메시지(Send Message)를 전송하지 못한 경우 리턴 값은 -1이 되고 장애의 종류는 10으로 설정된다. 미들웨어 프로그램(230)은 상기 리턴 값을 확인하여 장애가 발생한 것으로 판단되면 장애관리 프로그램(260)을 호출하고 해당하는 장애의 종류와 장애의 값을 장애관리 프로그램(260)으로 전달한다.A case of the middleware program 230 in the software field operating as described above will be described below. The resources of the system processed by the middleware, especially when there are no available resources in the IPC (Inter-Processor Communication) receiving and sending buffers, timers, redundancy, etc. In this case, the operating system program 210 sets the return value to a value indicating the failure. For example, if the send message could not be sent due to insufficient buffer capacity, the return value is -1 and the type of failure is set to 10. When the middleware program 230 determines that a failure has occurred by checking the return value, the middleware program 230 calls the failure management program 260 and transmits the type of the corresponding failure and the value of the failure to the failure management program 260.

또한 디바이스 드라이버(140)는 미들웨어 프로그램(130)과 동일하게 리턴 값을 확인하고 장애를 감지하면 장애관리 프로그램(260)을 호출한다. 이때 운영체제 프로그램(210)이 디바이스 드라이버(140)에 관련된 리턴 값을 설정할 수 있도록 하기 위해서는 디바이스 드라이버(140)와 관련하여 I/O 등의 오류에 관련된 장애 종류가 추가되어야 한다.In addition, the device driver 140 checks the return value in the same manner as the middleware program 130 and calls the failure management program 260 when detecting a failure. In this case, in order for the operating system program 210 to set a return value related to the device driver 140, a failure type related to an error such as an I / O needs to be added with respect to the device driver 140.

소프트웨어 분야와는 달리 하드웨어 분야에서 하드웨어 제어기(150)는 해당하는 하드웨어의 출력범위를 측정한 결과 값이 미리 정해지는 범위를 초과하거나 또는 읽기/쓰기 동작이 정상적으로 수행되지 않은 경우 장애가 발생한 것으로 판단하여 장애관리 프로그램(260)을 호출하고 해당하는 장애의 위치와 그 차이 값을 장애관리 프로그램(260)에게 전달한다.Unlike the software field, in the hardware field, the hardware controller 150 determines that a failure has occurred if the value exceeds the predetermined range or the read / write operation is not performed normally as a result of measuring the output range of the corresponding hardware. The management program 260 is called and the location of the corresponding failure and the difference value are transmitted to the failure management program 260.

예외의 경우로서 인터럽트가 발생하면 운영체제 프로그램(110)은 프로그램 카운터와 장애 주소 레지스터의 내용을 저장한 후 그 위치를 장애관리 프로그램(260)에게 알린다.When an interrupt occurs as an exception, the operating system program 110 stores the contents of the program counter and the fault address register and notifies the fault management program 260 of its location.

상기된 바와 같이 장애관리 프로그램(260)이 호출되면 장애관리 프로그램(260)은 본 발명에 의한 장애관리 동작을 수행한다.When the failure management program 260 is called as described above, the failure management program 260 performs a failure management operation according to the present invention.

도 4는 본 발명에 의한 장애관리 프로그램(260)의 동작을 나타낸 흐름도이다. 도 4를 참조하면, 장애관리 프로그램(260)은 단계(S210)에서 각 프로그램들로부터 수신된 정보에 의하여 장애의 종류를 확인하고 단계(S220)에서 해당하는 장애 데이터를 수집한다. 장애 데이터는 운영자에 의하여 미리 정해지는 장애관리 필요요소로서 장애가 발생된 소프트웨어 또는 하드웨어의 위치와 레지스터와 스택의 값 등이 될 수 있다. 예를 들어 소프트웨어 분야에서 발생된 장애의 경우 장애 데이터는 태스크의 ID, 프로그램 카운트, 오류 값 등이 되며, 하드웨어 분야에서 발생된 장애의 경우 장애 데이터는 태스크 ID, 프로그램 카운트, 오류 종류, 오류 값, CPU 레지스터 값 등이 된다. 장애 데이터는 미리 정해지는 포맷으로 생성되며 디버그 명령어에 의하여 출력될 수 있다.4 is a flowchart illustrating the operation of the failure management program 260 according to the present invention. Referring to FIG. 4, the failure management program 260 confirms the type of failure based on the information received from each program in step S210, and collects corresponding failure data in step S220. Fault data is a fault management requirement that is predetermined by the operator, such as the location of the faulty software or hardware and the values of registers and stacks. For example, in the case of a fault in the software field, the fault data is the ID, program count, error value, etc. For a fault in the hardware field, the fault data is the task ID, program count, error type, error value, CPU register values. The fault data is generated in a predetermined format and can be output by a debug command.

장애 데이터의 수집이 완료되면 단계(S230)에서 장애관리 프로그램(260)은 장애관리를 위하여 미리 할당된 장애관리 저장영역에 로컬 데이터베이스를 구성한다. 상기 장애관리 저장영역은 시스템 초기화시에 할당되며 다른 프로그램에 의하여 액세스될 수 없다.When the collection of the failure data is completed in step S230, the failure management program 260 configures a local database in a failure management storage area pre-allocated for failure management. The failure management storage area is allocated at system initialization and cannot be accessed by other programs.

단계(S240)에서 장애관리 프로그램(260)은 상위 시스템에게 상기 장애 데이터를 전송하여 장애가 발생되었음을 보고한다. 그러면 상위 시스템은 상기 장애 데이터를 저장하며 운영자는 상위 시스템에 저장되어 있는 장애 데이터를 분석함으로써 장애의 발생원인을 즉시 파악하고 문제를 해결할 수 있다.In step S240, the failure management program 260 transmits the failure data to an upper system to report that a failure has occurred. Then, the host system stores the fault data, and the operator analyzes the fault data stored in the host system to immediately identify the cause of the fault and solve the problem.

단계(S250)에서 상기 장애 데이터는 상기 단계(S230)에서 구성된 로컬 데이터베이스에 저장된다. 그러면 단계(S260)에서 장애관리 프로그램(260)은 해당하는 미리 정해진 장애복구 동작을 수행한다. 장애복구를 위해서는 시스템을 재시동하거나 재부팅하거나 또는 해당하는 작업을 삭제하는 등의 동작이 수행될 수 있다.In step S250, the fault data is stored in a local database configured in step S230. Then, in step S260, the failure management program 260 performs a corresponding predetermined failure recovery operation. To recover from the failure, an operation such as restarting or rebooting the system or deleting a corresponding task may be performed.

한편 본 발명의 상세한 설명에서는 구체적인 실시예에 관해 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시예에 국한되지 않으며, 후술되는 특허청구의 범위뿐만 아니라 이 특허청구의 범위와 균등한 것들에 의해 정해져야 한다.Meanwhile, in the detailed description of the present invention, specific embodiments have been described, but various modifications are possible without departing from the scope of the present invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined not only by the scope of the following claims, but also by those equivalent to the scope of the claims.

이상에서 상세히 설명한 바와 같이 동작하는 본 발명에 있어서, 개시되는 발명중 대표적인 것에 의하여 얻어지는 효과를 간단히 설명하면 다음과 같다.In the present invention operating as described in detail above, the effects obtained by the representative ones of the disclosed inventions will be briefly described as follows.

본 발명은, 운영 중인 시스템에서 발생한 시스템 다운 또는 비정상적인 동작의 경우 해당 시스템의 상위 시스템으로 보고된 장애 데이터와 해당하는 프로세서 보드에 저장된 장애 데이터를 이용하여 문제의 발생 원인을 용이하게 파악할 수 있기 때문에 문제 해결 시간을 단축시키며 문제를 근본적으로 해결할 수 있다. 또한 발생된 장애가 심각하고 즉각적인 조치가 필요한 경우에도 장애의 종류에 따라서 현장에서 즉각적인 조치가 가능하다는 효과가 있다.In the present invention, in the case of a system down or abnormal operation occurring in an operating system, the problem may be easily determined by using fault data reported to a higher system of the corresponding system and fault data stored in a corresponding processor board. It can shorten the resolution time and solve the problem fundamentally. In addition, even if a failure occurs seriously and needs immediate action, there is an effect that immediate action is possible in the field depending on the type of failure.

Claims

In the failure management method of a computer system comprising an operating system program, an application program executed by the operating system program, a middleware program, a device driver and a hardware controller,

If a failure occurs in any one of an application program, a middleware program, a device driver, and a hardware controller, calling a failure management program from a corresponding program and transferring information on the generated failure to the failure management program;

Confirming the type of the generated failure in the failure management program;

Collecting failure data corresponding to the generated failure;

Configuring a local database in a pre-allocated fault data storage area;

Reporting the fault data to a higher system;

Storing the fault data in the local database;

And performing a failover operation corresponding to the generated failure.

The method of claim 1, wherein the failure data includes a task ID, a program count, a type of failure, an error value, a register value, and a stack value.

The method of claim 1, wherein the failover operation is a restart, reboot, or deletion of a failed task.

The method of claim 1, wherein the application program, the middleware program, and the device driver identify a return value received from the operating system program in response to a system service request, and determine that a failure has occurred if the return value indicates a failure. How to.