KR100206472B1

KR100206472B1 - Error manage & recover method of switching system

Info

Publication number: KR100206472B1
Application number: KR1019960076703A
Authority: KR
Inventors: 이동준
Original assignee: 윤종용; 삼성전자주식회사
Priority date: 1996-12-30
Filing date: 1996-12-30
Publication date: 1999-07-01
Also published as: KR19980057414A

Abstract

가. 청구범위에 기재된 발명이 속한 기술분야:end. The technical field to which the invention described in the claims belongs:

전전자 교환기에서 시스템 장애관리 및 복구방법System failure management and recovery method in electronic switchboard

나. 발명이 해결하려고 하는 기술적 과제:I. The technical problem the invention is trying to solve:

전전자 교환기에서 발생할 수 있는 하드웨어 및 소프트웨어 장애 등의 시스템 장애를 신속하게 복구 처리를 할 수 있도록 한다.It enables the system to quickly recover from system failures such as hardware and software failures that may occur at the electronic switchboard.

다. 그 발명의 해결방법의 요지:All. The gist of the solution of the invention:

전전자 교환기에서 발생할 수 있는 하드웨어 및 소프트웨어 장애 등의 시스템 장애를 검출하여 보관하고 있다가 시스템의 장애내역을 분석하여 시스템의 장애를 신속하게 복구 처리한다.It detects and stores system failures such as hardware and software failures that can occur at all electronic switchboards, and analyzes the system failure history and recovers the system failures quickly.

라. 발명의 중요한 용도:la. Important uses of the invention:

전전자교환기 유지보수Electronic exchanger maintenance

Description

System failure management and recovery method in electronic switchboard

본 발명은 전전자 교환기에 관한 것으로, 특히 전전자교환기의 시스템 장애 관리 및 복구를 수행하기 위한 방법에 관한 것이다.TECHNICAL FIELD The present invention relates to an electrical exchanger, and more particularly, to a method for performing system failure management and recovery of an electrical exchanger.

일반적인 전전자교환기의 운영체계(Operating System: 이하 OS라 칭함)에서 제공하는 사용자 프로그램의 호 루틴(call routine)은 약 200개 정도가 제공되는데 사용자 프로그래머가 OS를 잘못 사용하게 되면 상기 OS는 정상적인 처리동작을 수행하지 못하게 된다. 이런 경우 OS는 에러를 사용자 프로그래머에게 알려준다. 그렇지만 시스템 운용중에 하드웨어나 소프트웨어의 에러로 인해 시스템의 다운이 발생하게 되면 에러에 대한 정보를 OS에서 자체적으로 저장하지는 않는다.There are about 200 call routines of the user program provided by the general operating system (Operating System) of the electronic switching system. If the user programmer misuses the OS, the OS handles the normal processing. You will not be able to perform the action. In this case, the OS notifies the user programmer of the error. However, if the system crashes due to a hardware or software error while the system is running, the OS does not store the information about the error itself.

그러므로 종래 기술은 교환시스템 운용중 하드웨어나 소프트웨어 문제로 시스템이 리스타트(re-start)시 시스템 장애에 대한 정보를 저장하고 있지 않음으로 인해 시스템의 장애원인이 파악하기 힘들게 된다. 그 결과 신속한 장애복구가 어렵게 되는 문제가 있다.Therefore, in the prior art, due to hardware or software problems during the operation of the switching system, the failure of the system is difficult to grasp because the system does not store information on the system failure when the system is restarted. As a result, there is a problem that it is difficult to quickly fail over.

따라서 본 발명의 목적은 전전자 교환기에서 발생할 수 있는 하드웨어 및 소프트웨어 장애 등의 시스템 장애를 신속하게 복구 처리를 할 수 있도록 하는 방법을 제공하는데 있다.Accordingly, an object of the present invention is to provide a method for quickly recovering system failures such as hardware and software failures that may occur in an electronic switch.

본 발명의 다른 목적은 전전자 교환기에서 발생할 수 있는 하드웨어 및 소프트웨어 장애를 검출 및 이에 대한 신속한 복구처리를 수행하고 사용자 포트를 통해 시스템 장애를 운용자에게 알리는 방법을 제공하는데 있다.It is another object of the present invention to provide a method for detecting hardware and software failures that may occur in an electronic switchgear, performing a quick recovery process, and notifying an operator of a system failure through a user port.

상기한 목적에 따라, 본 발명은, 전전자 교환기에서 발생할 수 있는 하드웨어 및 소프트웨어 장애 등의 시스템 장애를 검출하여 보관하고 있다가 시스템의 장애 내역을 분석하여 시스템의 장애를 신속하게 복구 처리를 할 수 있도록 하는데 향한다.In accordance with the above object, the present invention can detect and store system failures such as hardware and software failures that may occur in the electronic switchboard, and then analyze the failure details of the system to quickly recover the failure of the system. I aim to make it.

도 1은 본 발명의 실시예에 따른 전전자 교환기에서 시스템 장애 관리 및 복구를 기능적으로 설명하기 위한 블럭 구성도1 is a block diagram for functionally explaining system failure management and recovery in an electronic switch according to an embodiment of the present invention.

도 2는 본 발명의 실시예에 따른 장애검출 및 출력 그리고, 장애관리 및 복구를 위僅 구성된 OS 관리기들의 블럭 구성도2 is a block diagram of OS managers configured for fault detection and output and fault management and recovery according to an embodiment of the present invention.

도 3은 프로세스 어보트(process abort) 장애 체크하는 도 2의 프로세스 관리기 22의 동작절차를 보여주는 도면FIG. 3 is a flowchart illustrating an operation of the process manager 22 of FIG. 2 for checking a process abort fault.

도 4는 본 발명의 실시예에 따른 폴트(fault)를 수집·저장하는 메모리맵 구성도4 is a configuration diagram of a memory map for collecting and storing faults according to an exemplary embodiment of the present invention.

도 5는 본 발명의 실시예에 따라 장애에 대해 복구하기 위한 메모리맵 구성도5 is a configuration diagram of a memory map for recovering from a failure according to an embodiment of the present invention.

도 6은 폴트수집 저장 및 복구 위한 메모리 영역 할당의 일실시예를 보여주는 도면FIG. 6 illustrates an embodiment of memory area allocation for fault collection storage and recovery; FIG.

도 7은 본 발명의 실시예에 따른 장애정보 관리 및 복구하는 방법을 설명하기 위한 도면7 is a view for explaining a method for managing and recovering disability information according to an embodiment of the present invention.

이하 본 발명의 바람직한 실시예들을 첨부한 도면을 참조하여 상세히 설명한다. 도면들중 동일한 구성요소들은 가능한한 어느 곳에서든지 동일한 부호들로 나타내고 있음에 유의해야 한다. 또한 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the same elements in the figures are denoted by the same numerals wherever possible. In addition, detailed descriptions of well-known functions and configurations that may unnecessarily obscure the subject matter of the present invention will be omitted.

본 발명의 실시예에 따른 전전자 교환기의 OS는 도 1의 20의 블럭 내에 있는 여러 기능(즉, 장애검출 기능, 장애 관리기능, 장애복구 기능, 및 장애 출력기능)들이 모여 사용자 프로그래머에게 다양한 서비스를 제공한다. 도 1을 참조하면, 사용자 프로그램부 2는 시스템을 구동하여 특정기능을 수행하기 위한 운용자 프로그램이 구동하면, 발생된 장애를 OS에서 감지하여 장애검출기능부 4로 통보한다. 장애검출기능부 4는 OS의 여러 관리기가 각종 기능을 수행하면서 발생한 장애를 검출하고 검출한 장애정보를 장애관리기능부 6, 장애복구기능부 8, 및 장애출력기능부 10에 각각 알려준다. 장애관리기능부 6은 상기 장애정보를 받아 그 내용을 분석하여 상세한 장애정보로 만들고 프로세서가 재시동을 하더라도 저장데이타가 변하지 않는 메모리에 저장하여 보관한다. 상기 메모리에 저장된 상세한 장애정보들은 장애복구기능부 8이 장애복구 수행시 이용된다. 장애복구기능부 8은 장애관리기능부 6의 메모리에 저장되어 있는 상세한 장애정보를 참조하여 장애등급에 따라 장애복구 기능을 수행한다. 장애출력기능부 10은 발생한 장애정보를 운용자가 볼수 있도록 운용자 포트로 메세지를 출력한다. 그렇게 하여 운용자가 시스템의 장애에 대한 빠른 조치를 할 수 있도록 한다.The OS of the electronic switch according to the embodiment of the present invention is a collection of various functions (i.e., fault detection function, fault management function, fault recovery function, and fault output function) within 20 blocks of FIG. To provide. Referring to FIG. 1, when a user program for driving a specific function is driven by driving a system, the user program unit 2 detects a generated failure in the OS and notifies the failure detection function unit 4. The fault detection unit 4 detects a fault generated while various managers of the OS perform various functions, and informs the fault management function 6, the fault recovery unit 8, and the fault output unit 10, respectively. The fault management function unit 6 receives the fault information, analyzes the content of the fault information into detailed fault information, and stores the data in a memory that does not change even when the processor is restarted. The detailed fault information stored in the memory is used by the failback function unit 8 when performing fault recovery. The fault recovery function unit 8 performs the fault recovery function according to the fault level by referring to the detailed fault information stored in the memory of the fault management function unit 6. The fault output function unit 10 outputs a message to the operator port so that the operator can view the generated fault information. This allows the operator to take quick action on system failures.

도 2는 도 1에서 설명한 바와 같은 기능을 수행하는 OS관리기들의 블럭 구성도이다. 도 2의 OS 관리기들은, 프로세스 관리기(process manager) 22를 제외한 8개의 관리기들 즉, IPC관리기(Inter Process Communication manager) 24, 메모리 관리기(memory manager) 26, I/O관리기(Input/Output manager) 28, 이중화 관리기(duplication manager) 30, 이셉션 관리기(exception manager) 32, 시간관리기(time manager) 34, 파일관리기(file manager) 36 및 장애관리기 38을 포함하고 있다.FIG. 2 is a block diagram illustrating OS managers that perform the functions described with reference to FIG. 1. The OS managers of FIG. 2 include eight managers except for process manager 22, that is, an IPC manager 24, a memory manager 26, and an I / O manager. 28, a duplication manager 30, an exception manager 32, a time manager 34, a file manager 36 and a failure manager 38.

도 2의 관리기들 중, 프로세스 관리기 22, IPC관리기 24, 메모리 관리기 26, I/O관리기, 이중화 관리기 30, 이셉션 관리기 32, 시간관리기 24 및 파일관리기 26 각각은 도 1의 장애검출기능부 4의 기능과 장애출력기능부 10의 기능을 함께 수행한다. 그리고 도 2의 관리기들 중, 장애관리기 38은 도 1의 장애관리기능부 6 및 장애관리복구기능부 8의 기능을 함께 수행한다.Among the managers of FIG. 2, each of the process manager 22, the IPC manager 24, the memory manager 26, the I / O manager, the redundant manager 30, the exception manager 32, the time manager 24, and the file manager 26 are each the fault detection unit 4 of FIG. 1. Perform the function of and the function of fault output function unit 10 together. Among the managers of FIG. 2, the fault manager 38 performs the functions of the fault management function unit 6 and the fault management recovery function unit 8 of FIG. 1.

장애검출 및 장애출력 기능을 수행하는 도 2의 각 관리기들 22∼36의 개략적인 동작을 설명하면 다음과 같다. 프로세스 관리기 22는 각 관리기들의 제반적인 관리를 수행하고, IPC관리기 24는 메세지 정보를 주고받는 내외부 통신 관리를 수행하며, 메모리 관리기 26은 메모리들을 통합하여 관리한다. 그리고 I/O관리기 28은 MMC(Man Machine Communication) 처리동작을 관리하고, 이중화관리기 30은 이중화된 시스템의 동작/대기 상태 제어를 관리한다. 또한 이셉션관리기 32는 인터럽트처리 관련 관리를 수행하고, 시간관리기 34는 현재 시스템 시간에 관련된 관리를 수행한다. 그리고 파일관리기 36은 HDD(Hard Disk Drive), MTU(Magnetic Tape Unit) 구동을 위한 관리를 수행한다.A schematic operation of each of the managers 22 to 36 in FIG. 2 that performs the fault detection and fault output functions will be described below. The process manager 22 performs general management of each manager, the IPC manager 24 performs internal and external communication management of sending and receiving message information, and the memory manager 26 integrates and manages memories. The I / O manager 28 manages MMC (Man Machine Communication) processing operations, and the duplication manager 30 manages operation / standby state control of the redundant system. In addition, the exception manager 32 performs interrupt processing related management, and the time manager 34 performs management related to the current system time. The file manager 36 manages the HDD (Hard Disk Drive) and the MTU (Magnetic Tape Unit).

도 1 및 도 2와 같은 관리기능으로 구현된 본 발명의 OS는 그 성격상 제어계 하드웨어와 사용자 프로그램의 문제점을 검출하는데 용이하며 여러가지 장애를 체계적으로 수집하고 수집한 내용을 가지고 적절한 소프트웨어 복구를 수행하며, 복구한 결과를 운용자에게 상세하게 전달할 수 있다.The OS of the present invention implemented with the management functions as shown in FIGS. 1 and 2 is easy to detect problems of the control system hardware and the user program due to its nature, and systematically collects various failures and performs appropriate software recovery with the collected contents. In addition, the recovered results can be forwarded to the operator in detail.

도 2를 다시 참조하면, 프로세스 관리기 22를 포함한 8개의 관리기들 22∼36에서는 이벤트에 의해서 발생되는 모든 종류의 OS장애를 장애관리기 38에게 보고한다. 장애관리기 38은 보고된 각 관리기의 장애를 종류별로 수집 저장하고 그 관리기의 요구에 따라 장애정보도 저장관리한다. 장애관리기 38의 장애 및 장애정보에 대한 수집 및 저장이 완료되고 나면 장애관리기 38은 해당 장애에 대응되는 복구처리를 시작한다. 한편 장애가 발생했던 관리기들은 상기 장애관리기 38가 장애복구처리를 시작함과 동시에 이러한 사실을 운용자에게 Problem, Reason, Action, Information 등의 정보로 출력한다. 이때 운용자에게 전달하는 장애 발생 및 처리결과는 제어계의 서비스 측면에서 중요하다고 판단되는 것을 우선으로 처리한다.Referring back to FIG. 2, eight managers 22-36, including process manager 22, report all types of OS failures caused by an event to failure manager 38. The fault manager 38 collects and stores the reported faults of each manager by type and stores and manages the fault information according to the manager's request. After the failure and failure information of the failure manager 38 is collected and stored, the failure manager 38 starts the recovery process corresponding to the failure. On the other hand, the managers that have failed generate the facts such as Problem, Reason, Action, and Information to the operator as soon as the failure manager 38 starts the recovery process. At this time, the failure occurrence and the processing result delivered to the operator are first processed to be considered important in terms of service of the control system.

이하 장애발생한 각 관리기에서의 장애검출 및 장애관리기 38의 장애관리 및 복구처리하는 동작을 몇 가지의 종류를 들어 설명한다.Hereinafter, a description will be given of several types of operations for detecting and recovering the faults of the fault manager 38 and the fault manager 38.

(1) 인 라인 체크(In Line Check)(1) In Line Check

인 라인 체크는 도 2의 모든 관리기들 22∼36에서 모두 수행한다. TDX교환기의 OS는 사용자에게 약 140개의 프리미티브(primitive)를 제공하며, 0S내부간의 많은 프로시져 콜(procedure call)을 통해 정상적인 기능을 수행하고 있다. 이러한 사용자용 프리미티브나 OS자체의 프로시져들은 대부분 파라미터 패싱(prameter passing)을 통하여 서비스를 요구하고 서비스의 수행을 시작한다. 따라서 전달되는 파라미터의 오류가 즉각 검출되지 못하는 경우에는 시스템의 다운타임(down time)에 심각한 영향을 초래할 수 있게 된다. 이러한 측면에서 사용자 프리미티브와 OS프로시져들중에서 심각한 영향을 초래할 수 있는 것들을 중심으로 파라미터 체크기능(입력 파라미터 유효성 체크, 중요 데이타의 인덱스 유효성 체크, 어드레스 포인터 사용전 유효성 체크)을 수행한다.In-line checks are performed at all managers 22-36 in FIG. The OS of the TDX exchange provides about 140 primitives to the user and performs normal functions through many procedure calls between 0S. Most of these user primitives or the OS's own procedures require a service through parameter passing and start performing the service. Therefore, if the error of the parameter to be passed is not detected immediately, it can have a serious effect on the down time of the system. In this respect, parameter checks (input parameter validation, index validation of critical data, validation before using address pointers) are performed, focusing on those that can have a serious impact on user primitives and OS procedures.

(2) 시스템 프리미티브(system primitive)를 콜(call)할 때의 장애(2) Obstacles when calling system primitives

시스템 프리미티브를 콜할 때의 장애는 도 2의 관리기들 22∼36 어느곳에서도 체크된다. 장애발생원인은 다음과 같다. 현재 OS에서 제공하는 프리미티브들을 크게 3가지(사용자 프리미티브, 시스템 프리미티브, 데이타베이스 프리미티브)로 구분할 수 있다. 이들 프리미티브들은 다시 여러개의 기능으로 구분되어 이들 각각에 특정번호가 할당되는데 이 번호를 다이랙티브(directive)번호라 한다. 사용자 프리미티브는 d0레지스터에 다이랙티브 번호를 넣고 트랩 1을 수행함으로써 원하는 프리미티브를 수행할 수 있고, 시스템 프리미티브는 d0레지스터에 다이랙티브 번호를 넣고 트랩 0을 수행함으로써 원하는 프리미티브를 수행할 수 있으며, 데이타베이스 프리미티브는 d0레지스터에 다이랙티브 번호를 넣고 트랩 2를 수행함으로써 원하는 프리미티브를 수행할 수 있다. 그런데 각 그룹에는 한정된 갯수의 다이랙티브가 존재하는데 사용자 프로그램의 잘못사용으로 이 범위를 벗어나거나, 할당하지 않은 다이랙티브로 프리미티브 수행을 요구할 때 이 장애가 발생한다.The failure in calling the system primitive is checked anywhere in the managers 22-36 of FIG. The causes of failure are as follows. Primitives provided by the current OS can be classified into three types: user primitives, system primitives, and database primitives. These primitives are further divided into several functions, each of which is assigned a specific number, which is called a direct number. The user primitive can perform the desired primitive by putting the directive number in the d0 register and performing trap 1, and the system primitive can perform the desired primitive by putting the directive number in the d0 register and performing trap 0, Database primitives can perform the desired primitives by placing a directive number in the d0 register and performing trap 2. However, there is a limited number of directives in each group, and this fault occurs when the user program is out of this range due to misuse of the user program, or when the primitive execution is requested by an unassigned directive.

이때의 처리방법은 다음과 같다. OS에서는 각 트랩 핸들러(trap handler)는 사용자 프로그램이 요구한 다이랙티브번호가 현재 제공하는 범위를 벗어나는지 검사하여, 제공할 수 있는 다이랙티브번호인 경우 해당 프리미티브를 수행하여 주고, 범위를 벗어나거나 할당되지 않은 프리미티브를 요구한 경우는 에러값을 넘겨준다.The treatment method at this time is as follows. In the operating system, each trap handler checks whether the number requested by the user program is out of the range provided by the user program, and executes the corresponding primitive when the number is provided. Or an unassigned primitive is passed an error value.

(3) 파라미터 에러 장애(3) parameter error failure

파라미터 에러 장애도 도 2의 관리기들 22∼36 어느곳에서나 체크된다. 장애발생원인은 다음과 같다. OS가 제공하는 여러가지 프리미티브들 중에는 해당 프리미티브를 수행한 후 여러가지의 값을 사용자 프로그램에게 넘겨주어야 하는 경우가 있다. 이 경우에 사용자 프로그램은 미리 자신의 영역나에 메모리를 할당하여 이 메모리의 어드레스를 프리미티브의 지정된 파라미터로서 OS로 전달해 주어야 한다. 그런데 사용자 프로그램의 잘못으로 그 파라미터의 값을 제대로 주지 않아 사용자 프로그램의 어드레스영역을 벗어난 값이 OS로 넘어갈 때에 이 장애가 발생한다.The parameter error fault is also checked anywhere in the managers 22-36 of FIG. The causes of failure are as follows. Among various primitives provided by the OS, it is necessary to pass various values to the user program after executing the primitive. In this case, the user program must allocate memory in its own area or in advance, and pass the address of this memory to the OS as a designated parameter of the primitive. However, this error occurs when the value of the parameter outside the address area of the user program is passed to the OS due to the error of the user program.

이때의 처리방법은 다음과 같다. OS내의 각 프리미티브의 처리루틴들은 필요한 파라미터중에서 사용자 프로그램의 매모리 어드레스를 필요로 하는 경우 그 파라미터의 값이 해당 사용자 프로그램의 어드레스 영역내에 존재하는지 검사하여 잘못 사용된 경우 사용자 프로그램에게 에러를 넘겨준다.The treatment method at this time is as follows. The processing routines of each primitive in the OS need the memory address of the user program among the necessary parameters and check whether the parameter value exists in the address area of the user program.

(4) 프로세스 어보트(process abort) 장애처리(4) Process abort failure handling

프로세스 어보트 장애 체크는 도 2의 관리기들중 프로세스 관리기 22에서 수행한다. 도 3은 프로세스 어보트 장애 체크하는 프로세스 관리기 22의 동작절차를 보여주는 도면이다. 도 3을 참조하여 설명하면, 프로세스 어보트는 상위프로세스에 해당하는 아웃모트스 프로세스(outmost process)에서 발생할 수 있고, 하위 프로세스에 해당하는 차일드 프로세스에서 발생할 수도 있다.The process abbot check is performed in process manager 22 of the managers of FIG. 3 is a flowchart illustrating an operation of the process manager 22 for checking a process abbot failure. Referring to FIG. 3, the process abort may occur in an outmost process corresponding to a higher process or may occur in a child process corresponding to a lower process.

먼저 아웃모스트 프로세스 어보트의 경우에는 해당 블럭의 모든 프로세서를 어보트하고 아웃모스트를 재생성한 후 이러한 사실을 장애관리기 34로 보고한다. 다음으로 차일드 프로세스 어보트(child process abort)의 경우에는, 해당 차일드 프로세스만을 어보트시키고 이러한 사실을 해당블럭 아웃모스트 프로세스에게 중요정보에 함께 신호로 통보한다. 또한 자체처리 가능하면 프로세스 어보트의 장애 처리후 이러한 사실을 장애관리기 38로 보고하여 장애 내역이 저장될 수 있도록 한다.In the case of outmost process abbott, all processors of the block are aborted and the outmost is recreated, and the fact is reported to fault manager 34. Next, in the case of a child process abort, it only aborts the child process and signals this fact to the block out-most process. In addition, if self-processing is possible, after the failure of the process abbot, this fact is reported to the failure manager 38 so that the failure history can be stored.

(5) 버스에러장애처리(5) Bus error fault handling

프로세스 어보트 장애 체크는 도 2의 관리기들중 이셉션 관리기 32에서 수행한다. 장애발생원인은 다음과 같다. 사용자 프로그램이나 OS코드내의 버그(bug)에 의하여 현재 사용할 수 있는 메모리의 범위를 초과한 경우 CPU(Central Processing Uint)내부에 있는 MMU(Memory Management Unit)의 트리거(trigger)에 의해 CPU에서 소프트웨어에게 발생시키는 이셉션으로 소프트웨어 데이타 에러, 메모리의 장애 등으로 발생될 수 있다. 롱 버스(long bus) 또는 숏트 버스 에러 이셉션(short bus error exception)이 발생한 경우이다.The process abort failure check is performed in the exception manager 32 of the managers of FIG. The causes of failure are as follows. If the memory limit exceeds the currently available memory due to a bug in the user program or OS code, it is triggered to the software from the CPU by the trigger of the memory management unit (MMU) inside the central processing unit (CPU). This may cause software data errors, memory failures, and so on. This is the case when a long bus or short bus error exception occurs.

이때의 처리방법은 다음과 같다. OS의 버스에러핸들러에서 버스에러가 발생한 위치가 OS영역인지 사용자 프로그램영역인지 검사하여 OS영역인 경우 리드-변경-라이트 사이클(Read-Modify-Write cycle)이 아니면 절체한 후 대기상태로 빠지고, 리드-변경-라이트 사이클(Read-Modify-Write cycle)인 경우 버스에러가 MMU 장애(fault)에 의한 것이고 스택그로우(stack grow: 메모리를 더 확장해서 할당한 것을 의미함)와 같은 방법으로 버스 에러를 소프트웨어적으로 복구할 수 있는 경우는 복구하여 정상수행을 가능하게 하고 그렇지 못한 경우는 사용자 프로그램으로 장애메세지를 보내고 해당 사용자프로그램을 어보트시킨다.The treatment method at this time is as follows. In case of OS area, check whether the bus error location is OS area or user program area in the bus error handler of the OS. If the OS area is not read-modify-write cycle, it will be transferred to the standby state after changing. In the case of the Read-Modify-Write cycle, the bus error is caused by an MMU fault and the bus error is handled in the same way as stack grow. If it is possible to recover by software, it can be restored and run normally. If not, it sends a fault message to the user program and aborts the user program.

(6) 파일/파일시스템 풀(full) 체크(6) file / filesystem full check

파일/파일시스템 풀(full)이란 메세지가 예컨데, 300메가바이트(maga bytes)의 디스크 공간이 풀이 나서 더이상 사용자에게 디스크 공간을 할당해 줄 수 없슴을 나타내는 것을 의미하는 것으로, 도 2의 관리기들 중 파일관리기 36에서 이를 체크한다.A file / filesystem full message means, for example, that 300 megabytes of disk space is full and no more disk space can be allocated to the user. Check this in file manager 36.

(7) I/O디바이스 오픈에러(I/O device open error) 체크(7) I / O device open error check

사용자 프로그램이 시스템에 존재하지 않은 디바이스를 사용하기 위해서 오픈할 시 에러가 발생하는 것으로, 도 2의 관리기들중 I/O관리기 28에서 이를 체크한다.An error occurs when the user program opens to use a device that does not exist in the system. The I / O manager 28 of the managers of FIG. 2 checks this.

(8) IPC/IPC큐 풀 장애체크(8) IPC / IPC queue pool fault check

상기 장애는 사용자 프로그램들이 과다한 IPC메세지를 전송하거나, 어떤 다른 원인(예컨데, OS코드내의 버그(bug))에 의하여 초기에 할당받은 IPC큐를 전부사용하였거나, 비정상적인 수행에 의하여 IPC큐가 손상되어 더 이상의 프리 큐(free queue)를 할당받을 수 없는 경우 발생한다. 이러한 장애체크는 도 2의 관리기들중 IPC관리기 24에서 이를 체크한다.The failure may be caused by user programs sending excessive IPC messages, exhausting IPC queues initially allocated by some other cause (e.g., a bug in the OS code), or corrupting the IPC queues by abnormal performance. Occurs when the above free queue cannot be allocated. This fault check checks this in the IPC manager 24 of the managers of FIG.

(9) 메모리 라이트 위반 체크(9) check for memory write violations

메모리의 라이트 금지영역을 라이트할 때 발생하는 것으로, 도 2의 관리기들중 메모리관리기 26에서 이를 체크한다.This occurs when the write prohibition area of the memory is written, which is checked by the memory manager 26 of the managers of FIG. 2.

(10) 시간/시간테이블 풀이 생기는 원인 체크(10) Check the cause of time / timetable pool

시간테이블 풀이 생기는 경우는 타임 잡(time job)이 타임 아웃되지 않고 시간테이블 모두를 사용할 때 생긴다. 즉, 프리 리스트 링크된 리스트(free list linked list)에 사용할 수 있는 테이블이 하나도 없슴을 의미한다. 이때에는 도 2의 관리기들중 시간관리기 34에서 이를 체크한다.The creation of a timetable pool occurs when a time job uses all of the timetables without timing out. That is, no table can be used for the free list linked list. In this case, the time manager 34 of the managers of FIG. 2 checks this.

상기한 예들과 같은 종류의 장애가 발생하면, 장애관리 기능 및 장애복구 기능을 수행하는 도 2의 장애관리기 38은, 장애정보를 받아 그 내용을 분석하여 상세한 장애정보로 만들고 프로세서가 재시동을 하더라도 저장데이타가 변하지 않는 메모리에 저장하여 보관하고, 장애복구 기능 수행시 상기 메모리에 저장되어 있는 상세한 장애정보를 참조하여 장애등급에 따라 장애복구 기능을 수행한다.When a failure of the same kind as in the above examples occurs, the failure manager 38 of FIG. 2 that performs the failure management function and the failure recovery function, receives the failure information, analyzes the contents thereof, makes detailed failure information, and stores the data even when the processor is restarted. It is stored and stored in a memory that does not change, and when performing a fault recovery function, it performs a fault recovery function according to a fault level by referring to detailed fault information stored in the memory.

도 4는 장애관리기 38의 메모리에 구현된 폴트(fault) 장애정보를 수집·저장하는 메모리맵 구성도로서, 각 OS 관리기별 장애 발생갯수를 저장하는 버퍼영역 40과, 각 OS관리기별 장애 상세데이타 저장버퍼 영역 50과, 각 OS관리기별 1차, 2차 복구 플래그 정보 버퍼 영역 60을 가지고 있다.4 is a memory map configuration diagram for collecting and storing fault fault information implemented in the memory of the fault manager 38. The buffer area 40 stores the number of fault occurrences of each OS manager, and detailed fault data of each OS manager. It has a storage buffer area 50 and a primary and secondary recovery flag information buffer area 60 for each OS manager.

도 4의 버퍼영역들 40, 50, 60에 있는 약어들중에서 PM은 프로세스 관리기 22, MM은 메모리 관리기 26, IM은 IPC 관리기 24, TM은 시간관리기 34, FS은 파일관리기 36, IO는 I/O관리기 28, EM은 이셉션관리기 32, DM은 이중화관리기 30을 의미함을 이해하여야 한다. 이러한 이해를 전제로 도 4의 OS 관리기별 장애 발생갯수를 저장하는 버퍼영역 40내 표시된 내용들을 살피면, 예컨데, PM_fcntr[150]은 프로세스 관리기(Process Manager) 22의 폴트발생카운터(fault occerence counter)테이블의 방이 150개임을 의미하고 있다. 그리고 도 4의 버퍼영역 50내, 예컨데, PM_finform[150][10]은 프로세스 관리기(Process Manager) 22의 폴트정보(fault information)테이블의 방이 150개이고, 각 방의 상세정보 방이 10개 임을 의미한다. 또한 도 4의 버퍼영역 60내, 예컨데, PM_rhis[150]은 프로세스 관리기(Process Manager) 22의 폴트복구내역(fault recovery history)테이블의 방이 150개임을 의미한다.Among the abbreviations in buffer areas 40, 50, and 60 of FIG. 4, PM is process manager 22, MM is memory manager 26, IM is IPC manager 24, TM is time manager 34, FS is file manager 36, and IO is I / I. It should be understood that O manager 28, EM stands for egress manager 32, and DM stands for redundant manager 30. Based on this understanding, looking at the contents displayed in the buffer area 40 that stores the number of failures of each OS manager of FIG. 4, for example, PM_fcntr [150] is the fault occupancy counter table of the process manager 22. Means that there are 150 rooms. In the buffer area 50 of FIG. 4, for example, PM_finform [150] [10] means that there are 150 fault information tables of the Process Manager 22 and 10 detailed information rooms of each room. In addition, in the buffer area 60 of FIG. 4, for example, PM_rhis [150] means that there are 150 rooms of a fault recovery history table of the process manager 22.

도 5는 장애관리기 38의 메모리에 구현된 폴트 복구를 위한 메모리맵 구성도로서, 도 2의 모든 관리기들 22∼36에 대한 폴트복구테이블(fault recovery table) frtble[]을 가지고 있다. 각 폴트복구테이블 frtbl[]에는 폴트코드, 폴트 등급, 1차 복구루틴 데이타, 2차 복구루틴 데이타 등의 정보 포함되어 있다.FIG. 5 is a diagram of a memory map for fault recovery implemented in the memory of the fault manager 38, and has fault recovery table frtble [] for all managers 22 to 36 of FIG. Each fault recovery table frtbl [] contains information such as fault code, fault level, primary recovery routine data, secondary recovery routine data, and so on.

도 6은 폴트수집 저장 및 복구 위한 메모리 영역 할당의 일실시예를 보여주는 도면으로서, 각 버퍼영역 및 테이블에 대한 어드레스가 부여되어 있다.FIG. 6 is a diagram illustrating an embodiment of memory area allocation for fault collection storage and recovery, in which an address is assigned to each buffer area and a table.

도 7은 본 발명의 실시예에 따라 도 2의 장애관리기 38에서 수행하는 장애정보 관리 및 복구하는 방법을 설명하기 위한 도면이다.FIG. 7 is a diagram for describing a method for managing and restoring failure information performed by the failure manager 38 of FIG. 2 according to an exemplary embodiment of the present invention.

지금, 도 2의 각 관리기들 22∼36중 어떤 관리기에서 전술한 바와 같은 종류의 장애가 발생하면 도 2의 장애관리기 38로 장애를 통보한다. 상기 장애관리기 38은 도 7의 100단계에서 통보되 장애정보가 각 관리기에 속한 폴트코드의 것인가를 파악하고, 각 관리기들에 속한 폴트코드인 경우에만 102단계로 진행한다. 102단계에서는 장애정보를 받아 그 내용을 분석하고 각종 관리기별에 맞는 상세 장애정보 즉, 도 4에서 함께 언급한 바와 같은 각 관리기별장애발생갯수, 각 관리기별 상세데이타정보, 및 각 관리기별 1차, 2차 복구 플래그정보 등을 만들고 104단계에서 도 4와 같은 해당 버퍼영역 40, 50, 60중 해당 관리테이블에 저장한다. 또한 도 5와 같은 해당 복구테이블에 폴트코드, 폴트등급, 1차 복구 루틴, 2차 복구 루틴 등의 상세 장애복구정보를 저장해 놓는다.Now, if any of the above types of faults occur in any of the managers 22 to 36 in Fig. 2, the fault manager 38 in Fig. 2 is notified of the fault. The failure manager 38 determines in step 100 of FIG. 7 whether the failure information is a fault code belonging to each manager, and proceeds to step 102 only when the fault code belongs to each manager. In step 102, the failure information is received and the contents are analyzed and detailed failure information suitable for various management groups, that is, the number of failure occurrences for each management unit, detailed data information for each management unit, and primary for each management unit as mentioned in FIG. The second recovery flag information is created and stored in the corresponding management table among the buffer areas 40, 50, and 60 shown in FIG. In addition, detailed failure recovery information such as a fault code, a fault level, a primary recovery routine, and a secondary recovery routine is stored in the corresponding recovery table as shown in FIG. 5.

그후 장애관리기 38은 도 5의 폴트복구테이블 frtbl[]에 저장되어 있는 상세 장애복구정보를 참조하여 장애등급에 따라 장애복구 기능을 수행한다. 장애복구 기능이 수행함과 동시에 장애가 발생했던 각 관리기들에서는 장애발생 사실을 운용자포트, 예컨데, 프린터, PC화면, CRT등을 통하여 운용자에게 Problem, Reason, Action, Information 등의 정보로 출력하여 준다.Thereafter, the failure manager 38 performs a failure recovery function according to the failure level by referring to detailed failure recovery information stored in the fault recovery table frtbl [] of FIG. 5. At the same time as the failure recovery function is performed, each manager that has failed generates a problem, reason, action, and information to the operator through the operator port, for example, printer, PC screen, CRT, etc.

장애복구 기능 수행은 도 7의 106단계 내지 112단계의 동작으로 이루어진다. 먼저 도 7의 106단계에서 장애관리기 38은 도 4의 버퍼영역 60에 있는 해당 폴트복구내역테이블 rhis[150]의 1차 복구플래그의 세트유무를 참조하여 폴트 1차 복구를 판단하고, 만약 1차 복구플래그가 세트되어 있으면 108단계로 진행한다. 108단계에서는 도 5의 해당 폴트복구테이블 frtbl[]에 있는 1차 복구 루틴의 데이타를 이용하여 장애에 대한 복구를 수행하고, 110단계로 진행하여 2차 장애발생시 처리를 위해 도 4의 버퍼영역 60에 있는 해당 폴트복구내역테이블 rhis[150]의 2차 복구플래그를 세트한다.The failover function is performed by the operations of steps 106 through 112 of FIG. 7. First, in step 106 of FIG. 7, the failure manager 38 determines the first fault recovery by referring to the presence or absence of the first recovery flag of the corresponding fault recovery table rhis [150] in the buffer area 60 of FIG. 4. If the recovery flag is set, go to step 108. In step 108, the fault recovery is performed by using the data of the primary recovery routine in the corresponding fault recovery table frtbl [] of FIG. 5, and in step 110, the buffer area 60 of FIG. Set the secondary recovery flag of the corresponding fault recovery history in rhis [150].

한편 도 7의 106단계의 판단에서 폴트 1차 플래그가 세트되지 않았으면 폴트 2차 플래가 세트된 것을 확인하고 112단계로 진행한다. 112단계에서는 도 5의 해당 폴트복구테이블 frtbl[]에 있는 2차 복구 루틴의 데이타를 이용하여 장애에 대한 복구를 수행하게 된다.On the other hand, if the fault primary flag is not set in the determination of step 106 of FIG. 7, it is confirmed that the fault secondary flag is set and the process proceeds to step 112. In step 112, the failure recovery is performed by using data of the secondary recovery routine in the corresponding fault recovery table frtbl [] of FIG.

한편 운용자는 장애복구 기능이 수행함과 동시에 장애가 발생했던 각 관리기들이 보내준 장애발생 사실을 프린터, PC화면, CRT등을 통해 알게 된다.On the other hand, the operator knows the failure occurrence sent by each manager that failed while the failover function is being performed through the printer, PC screen, and CRT.

상술한 본 발명에서는 장애를 체계적으로 분석하여 특정 메모리를 이용하여 수집하고 장애의 종류에 따라 적당한 소프트웨어 복구를 수행하며, 장애의 결과를 운용자에게 상세하게 전달할수 있는 장애관련 각종 관리기들을 구현함으로써 제어계의 장애에 대한 감내능력을 향상시키고, 내부적인 처리결과를 운용자에게 상세히 전달한다. 또한 폴트에 대한 모든 정보는 프로세서가 리스타트하여도 지워지지 않는 특정 메모리에 저장하였다가 폴트에 대한 정보를 분석하여 시스템 문제에 대해 신속히 정보가 업데이트(update)되어 장애별 발생횟수를 분석하여 신혹히 복구 및 조치할 수 있다.In the present invention described above, by analyzing the fault systematically, collecting it using a specific memory, performing appropriate software recovery according to the type of the fault, and implementing various fault managers that can deliver the result of the fault to the operator in detail. Improve the ability to tolerate failures and communicate the results of internal treatments to operators in detail. In addition, all information about faults is stored in a specific memory that cannot be cleared even if the processor is restarted. Then, the fault information is analyzed and the information is updated quickly for system problems. And action can be taken.

상술한 본 발명의 설명에서는 구체적인 실시예에 관해 설명하였으나, 여러가지 변형이 본 발명의 범위에서 벗어나지 않고 실시할 수 있다. 따라서 본 발명의 범위는 설명된 실시예에 의하여 정할 것이 아니고 특허청구의 범위와 특허청구의 범위의 균등한 것에 의해 정해 져야 한다.In the above description of the present invention, specific embodiments have been described, but various modifications can be made without departing from the scope of the present invention. Therefore, the scope of the present invention should not be defined by the described embodiments, but should be defined by the equivalents of the claims and the claims.

상술한 바와 같이 본 발명은 교환기 시스템에서 발생하는 장애를 즉각적으로 검출하고 이러한 장애가 시스템에 영향을 주기전에 최소한의 시스템 손실 범위 내에서 복구되어질 수 있게 하고, 장애수집 및 출력기능을 통해 운용자가 즉각적으로 시스템의 상태를 파악하고 대처할 수 있도록 하는 장점이 있다.As described above, the present invention immediately detects the failures occurring in the exchange system and allows them to be recovered within the minimum system loss range before they affect the system, and the operator can immediately recover through the failure collection and output function. It has the advantage of being able to understand the state of the system and cope with it.

Claims

A system failure management and recovery method in an electronic switch having a plurality of managers for detecting and outputting a system failure and managing and recovering the failure,

Detecting a failure caused by a software or hardware failure in the system by using a detection manager;

Analyzing the faults detected by the corresponding detecting manager to make detailed fault management and recovery information for each type and storing them in a memory for fault management and fault recovery;

And when the detailed information of each type is stored in the memory, performing a recovery process by referring to the stored detailed failure recovery information and outputting the generated failure information to an operator.

The method of claim 1, wherein the detection manager is

Process manager for general management of each manager, IPC manager for internal and external communication management for sending and receiving message information, memory manager for integrating memories, I / O manager for managing MMC (Man Machine Communication) processing operation, redundancy Redundancy manager that manages the operation / standby state control of the old system, Eception manager that performs interrupt-related management, Time manager that performs management related to the current system time, HDD (Hard Disk Drive), Magnetic Tape Unit (MTU) Method comprising a file manager for performing management for driving.

The memory of claim 1, wherein the memory for fault management and fault recovery is

A fault management memory unit including a buffer area for storing the number of failures of each manager, a fault data storage buffer area for each manager, and a primary and secondary recovery flag information buffer area for each manager;

Method comprising a failover memory section having a fault recovery table for all managers.

4. The method of claim 3, wherein the fault recovery table includes information such as a fault code, fault level, primary recovery routine data, secondary recovery routine data, and the like.

4. The method of claim 3, wherein the buffer areas have tables for each manager.

6. The method of claim 5, wherein the tables have 150 rooms.