KR970007401B1

KR970007401B1 - Disorder management method of distributed processing system using event number

Info

Publication number: KR970007401B1
Application number: KR1019930012863A
Authority: KR
Inventors: 김철수; 김한경; 김현숙
Original assignee: 조백제; 한국전기통신공사; 양승택; 재단법인 한국전자통신연구소
Priority date: 1993-07-08
Filing date: 1993-07-08
Publication date: 1997-05-08
Also published as: KR950005069A

Abstract

The present invention relates to a failure management method for a distributed data processing system using event numbers by failure sources. This method includes a processor failure managing function collecting (31) failures, alarm messages, etc. from the source, and a system failure managing function. According to this invention, when a failure occurs, an event number (32) is allocated from the system failure managing funjction and each transition step of the sources is administrated to manage source failure, recovery request, current states, etc. as the same event number, thus easily maintaining, programming, and enhancing the system reliability.

Description

Fault management method using event number by fault resource in distributed processing system

제 1 도는 본 발명이 적용되는 비동기전달모드(ATM) 교환기의 구조도.1 is a structural diagram of an asynchronous transfer mode (ATM) exchange to which the present invention is applied.

제 2 도는 본 발명에 따른 고장, 장애, 경보메시지가 수집되는 경로 설명도.2 is a diagram illustrating a path in which faults, faults, and alarm messages are collected according to the present invention.

제 3 도는 본 발명에 따른 처리 흐름도.3 is a process flow diagram in accordance with the present invention.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

A : CIM B : ASMA: CIM B: ASM

1 : ASMP(Access Switching Maintenance Processor)1: Access Switching Maintenance Processor (ASMP)

2 : BCP(Brodcasting Cal1 Processor)2: BCP (Brodcasting Cal1 Processor)

3 : ClMP(Central Interconnection Maintenance Processor)3: ClMP (Central Interconnection Maintenance Processor)

4 : GSP(Globa1 Service Processor)4: GSP (Globa1 Service Processor)

5 : MMCP(Man Machine Control Processor)5: MMCP (Man Machine Control Processor)

6 : NSCP(Network Synchronization Control Processor)6: Network Synchronization Control Processor (NSCP)

7 : NTP(Number Translation Processor)7: NTP (Number Translation Processor)

8 : OMP(Operation and Maintenance Processor)8: OMP (Operation and Maintenance Processor)

9 : RCIP(Remote Center Interface Processor)9: Remote Center Interface Processor (RCIP)

10 : SCP(Subscriber Call Rrocessor)10: SCP (Subscriber Call Rrocessor)

11 : SH(Signalling Handler)11: SH (Signalling Handler)

12 : TCP(Trunk Call Processor)12: TCP Call Processor

본 발명은 분산처리시스템에 있어서 고장자원별 사건번호를 이용한 고장관리방법에 관한 것이다.The present invention relates to a fault management method using an event number for each fault resource in a distributed processing system.

일반적으로 전자교환기와 같은 분산처리 시스템은 수십 내지 수천개의 프로세서로 이루어져 있으며, 몇개의 프로세서를 제외한 대부분의 프로세서가 가입자 관련 프로세서이다.Generally, a distributed processing system such as an electronic exchanger is composed of tens or thousands of processors, and most processors except for a few processors are subscriber related processors.

이러한 가입자 관련 프로세서의 대부분이 동일한 자원(Resource)을 가지며, 이들 자원의 고장 역시 동일한 성격의 고장으로 인해 메시지 형태는 같고 발생 위치(프로세서)만 달라진다.Most of these subscriber-related processors have the same resources, and failures of these resources also have the same type of message and only the location of occurrence (processor) due to the failure of the same nature.

그러므로, 운용자가 특정 프로세서의 고장 발생 후, 이들 고장자원의 조치 및 현재 상확을 추적하려면 프린터된 기록용지를 전부 찾아야 하며, 이를 찾기 위해 많은 시간이 투자되어야 하므로 효과적인 유지보수가 사실상 불가능한 문제점이 있었다.Therefore, after the failure of a specific processor, the operator must find all the printed recording papers to track the action and the current status of these failure resources, and thus, a large amount of time must be invested in finding the effective maintenance.

상기 문제점을 해결하기 위하여 안출된 본 발명은, 전자교환기와 같은 분산처리 시스템에서 특정자원(Resource)의 고장에 대해 사건보호를 동적으로 부여함으로써 해당자원의 고장발생, 자체 시험 결과 조치 및 현재 상황에 대한 출력메시지의 출력메시지 사건번호(Phntout Event Number)를 동일한 번호로 관리함으로써 고장자원의 자체복귀 연역을 효율적으로 관리할 수 있는 알고리즘을 제공함으로써 시스템의 안정적인 서비스를 제공하고, 운용자가 시스템 관리를 쉽게 할 수 있도록 하는 고장자원별 사건번호를 이용한 고장관리방법을 제공하는데 그 목적이 있다.The present invention has been made to solve the above problems, by dynamically assigning event protection to the failure of a specific resource in a distributed processing system such as an electronic exchanger, the failure of the resource, self-test result measures and current situation It provides stable service of the system by providing algorithm that can efficiently manage the self-return deduction of fault resources by managing the Phtout Event Number of the output message for the same message. The purpose is to provide a fault management method using the event number for each fault resource.

상기 목적을 달성하기 위하여 본 발명은, 분산처리시스템에 적용되는 고장관리방법에 있어서, 시스템의 메시지 출력량과 기간에 따라 임의의 자연수 n을 결정하여 접수한 사건에 대한 출력메시지 사건번호를 0에서 10ⁿ-1까지 할당하는 단계와, 특정자원의 사건(장애,고장,경보) 정보가 수신되는 사건번호가 0에서 10ⁿ-1까지 모두 부여되지 않은 경우 접수된 특정자원의 사건에 대한 사건번호를 상기 0 내지 10ⁿ-1 중 하나를 동적으로 부여하는 단계와, 특정자원의 사건 정보가 접수된 후, 사건번호가 상기 0에서 10ⁿ-1까지 모두부여되었으면 새로운 사건번호를 0에서 10ⁿ-1까지 할당하고, 이를 운용자에게 통보하는 단계와, 접수된 특정자원의 사건에 대한 조치가 완료되면 부여된 사건번호를 삭제하는 단계와, 새로운 사건번호를 할당한 후, 시스템의 부하를 점검하여 통과중 상태인가를 조사하는 단계, 및 통화중 상태이면 해결되지 않은 고장에대해 출력하여 운용자에게 조치를 요구하고, 통화중이 아닌 경우에는 남아 있는 사건번호에 대해 감사를 요구하는 단계를 포함하여 이루어진 것을 특징으로 한다.In order to achieve the above object, the present invention, in the failure management method applied to the distributed processing system, the output message event number for the event received by determining an arbitrary natural number n according to the message output amount and period of the system from 0 to 10 Allocating up to ⁿ -1, and assigning the event number for the event of a specific resource received when the event number for which the event (disability, failure, alarm) information of a specific resource is not assigned is 0 to 10 ⁿ -1. Dynamically assigning one of the 0 to 10 ⁿ -1 and, if the event number is allotted from 0 to 10 ⁿ -1 after the event information of a specific resource is received, a new event number is set from 0 to 10 ^n- Allocating up to 1, notifying the operator of this, deleting the assigned case number when the action on the received case of a specific resource is completed, assigning a new case number, and then loading the system Checking and checking whether it is in the pass state, and outputting an unresolved fault in the case of a call, requesting the operator to take action, and requesting an audit of the remaining case number if not in a call. Characterized in that made.

이하, 첨부된 도면을 참조하여 본 발명에 따른 일실시예를 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described an embodiment according to the present invention;

제 1 도는 본 발명이 적용되는 비동기전달모드(ATM:Asynchronous Transfer Mode) 교환기의 구조도로서, 도면에서 A는 CIM, B는 ASM,1은 ASMP, 2는 BCP, 3은 CIMP, 4는 GSP, 5는 MMCP, 6은 NSCP, 7은 NTP, 8은 OMP, 9는 RCIP, 10은 SCP, 11은 SH, 12는 TCP를 각각 나타낸다.1 is a structural diagram of an Asynchronous Transfer Mode (ATM) exchange to which the present invention is applied, in which A is CIM, B is ASM, 1 is ASMP, 2 is BCP, 3 is CIMP, 4 is GSP, and 5 is MMCP, 6 is NSCP, 7 is NTP, 8 is OMP, 9 is RCIP, 10 is SCP, 11 is SH, and 12 is TCP.

도면에 도시한 바와 같이 비동기전달모드(ATM) 교환기의 구조는 다음과 같다.As shown in the figure, the structure of the asynchronous transfer mode (ATM) exchange is as follows.

ASMP(Access Switching Maintenance Processor)(1)는 ASM(B)내의 장애수집 및 하드웨어 자원에 대한시험등 ASM(B)내에서의 유지보수 업무에 대한 제어를 관장하며, BCP(Brodcasting Call Processor)(2)는다중접속(방송)관련 서비스 제어를 관장하는 프로세서로서 CIM(A)과 ASM(B)에 위치하여 CIM(A)과 착신측 ASM(B)호 처리 과정의 전반적인 다중접속 기능을 제어한다.The Access Switching Maintenance Processor (ASMP) (1) manages the control of maintenance tasks in the ASM (B), such as fault collection and testing of hardware resources in the ASM (B), and the BCP (Brodcasting Call Processor) (2). As a processor that manages multiple access (broadcasting) related service control, it is located in CIM (A) and ASM (B) and controls the overall multiple access function of CIM (A) and called party ASM (B) call processing.

CIMP(Central Interconnection Maintenance Processor)(3)는 CIM(A)내의 장애 수집 및 하드웨어 자원에 대한 시험등 CIM(A)내에서의 유지보수 업무에 대한 제어를 관장하며, GSP(Global Service Processor)(4)는 녹음 안내 방송 등과 같이 시스템에 전반적으로 제공되는 시스템 서비스를 제어한다.The CIMP (Central Interconnection Maintenance Processor) (3) manages the control of maintenance tasks in the CIM (A), such as fault collection and testing of hardware resources in the CIM (A), and the Global Service Processor (GSP) (4). ) Controls system services provided to the system as a whole.

MMCP(Man Machine Control Processor)(5)는 운용자와 시스템간의 또는 운용 센타와 시스템간의 대화를 가능하게 한다. 운용자와 시스템간의 대화를 위해서는 시스템 콘솔(17)을 포함한 VDU(Visial Display Unit)들과, 출력 데이타를 하드카피(hardcopy)하기 위한 프린터들이 사용된다. VDU의 투명도(visibiligy) 보완, 커맨드 화일(command file), 로킹화일(logging file) 저장 등의 용도로 로컬 디스크(loca1 disk)를 활용하며, NSCP(Network Synchronization Control Processor)(6)는 망동기 장치를 제어한다.Man Machine Control Processor (MMCP) 5 allows for dialogue between the operator and the system or between the operating center and the system. For the dialogue between the operator and the system, VDUs (Visial Display Units) including the system console 17 and printers for hard copying output data are used. The local disk (loca1 disk) is used for the purpose of supplementing VDU transparency, command file, and logging file storage, and the Network Synchronization Control Processor (NSCP) (6) To control.

NTP(Number Translation Processor)(7)는 각 SCP(10), TCP(Trunk Cal1 Processor)(12) 및 BCP(2)로부터의 번호번역 요구에 응답하게 되고, ASM(B) 번호 번역과 각 ASM(B)에 수용된 링크 식별번호(ID) 번역을 수행하며, OMP(Operation and Maintenance Processor)(8)는 시스템내의 일련의 운용과 유지보수 관련기능을 총괄한다. 따라서 이에 필요한 보조 기억 장치로 마그네틱 테이프(MT)(1) 및 디스크(Disk)(16)를 관장한다.The NTP (Number Translation Processor) 7 responds to the number translation request from each SCP (10), TCP (Trunk Cal1 Processor) 12, and BCP (2), and ASM (B) number translation and each ASM ( It performs the translation of the link identification number (ID) accepted in B), and the Operation and Maintenance Processor (OMP) 8 oversees a series of operations and maintenance related functions in the system. Therefore, a magnetic tape (MT) 1 and a disk 16 are managed as auxiliary storage devices.

MT(15)에는 요금 기록, 통계, 유지보수, 운용관리 정보 등이 수록되며, 디스크(16)에는 일반적인 프로그램(generic program) 및 데이타 등이 수록된다.The MT 15 stores fee records, statistics, maintenance, operation management information, and the like, and the disc 16 stores general programs and data.

RCIP(Remote Center Interface Processor)(9)는 원격 운용 센터 혹은 TMN(Telecommunication Management Network)과의 통신을 제어하며, 필요에 따라서는 프로토클 변환 기능을 수행한다.The Remote Center Interface Processor (RCIP) 9 controls communication with a remote operation center or a Telecommunication Management Network (TMN), and performs a protocol conversion function as necessary.

SCP(Subscriber Cal1 Processor)(10) 는 UNI(User Network Interface) 프로토콜을 사용하는 일반 가입자의 호 처리를 수행하는 프로세서로서, 가입자 정합 회로와 함께 호수락 제어, UPC(Usage Parameter Control), 우선순위 제어, 폭주 제어(congestion control)등 전반적인 트래픽 제어를 수행하고, SH(Signalling Handler)(11) sms UNI/NNI(Netwo가 Node Interfac) 프로토콜상의 신호 정보 셀을 종단시키고, SCP(10), TCP(12) 흑은 BCP(2)와 함께 신호 정보를 처리한다.Subscriber Cal1 Processor (SCP) 10 is a processor that performs call processing of a general subscriber using the User Network Interface (UN) protocol. The subscriber Cal1 processor (UPC), priority parameter control (UPC), and priority control are performed together with a subscriber matching circuit. Overall traffic control such as congestion control, congestion control, and termination of signaling information cells on the Signaling Handler (SH) (11) sms UNI / NNI (Node Interfac by Netwo) protocol, SCP (10), and TCP (12). The black processes signal information together with the BCP 2.

TCP(12)는 NNI 프로토클을 사용하는 망과의 호 처리를 수행하는 프로세서로서, 중계선 정합 회로와 함께 입력/출력 호에 대한 호/접속 제어를 수행하며, 망과의 정합에 필요한 모든 기능을 관장한다.TCP (12) is a processor that performs call processing with network using NNI protocol. It performs call / connection control on input / output call with relay line matching circuit and performs all functions necessary for matching with network. Preside over

종래에는 운용자가 특정 프로세서의 고장발생 후, 이들 고장자원의 조치 및 현재 상황을 추적하려면 전슬한 프린터 용지를 일일이 찾거나 히스토리 레코딩기능을 이용하였다.In the past, the operator used to search the entire printer paper or use the history recording function to track the action and current status of these failure resources after the failure of a specific processor.

프린터 용지를 추적하는 방법에 의하면 출력메시지가 각 하나의 OMD(Output Message Description) 화일을 가지므로 각 OMD들간의 구별은 고유한 정수값(출력메시지 참조번호)에 의해 구별되며, 이는 출력메시지에 대한 자세한 설명을 부가하여 운용자에게 그 출력메시지를 자세하게 설명하여 시스템에 대한 조치상항등을 기록한 문서이다.According to the method of tracking the printer paper, the output message has one output message description (ODM) file, so that the distinction between each OMD is distinguished by a unique integer value (output message reference number). It is a document that describes the output message to the operator in detail by adding detailed explanation and records the action conditions on the system.

일반적으로 특정자원의 고장발생, 조치 및 현재상황에 대한 일련의 흐름에는 작게는 수초에서 수시간이 소요되므로 각 단계에 따른 출력메시지를 추적하는데 상당히 어렵다. 왜냐하면 프린터는 공유장치이므로 시스템 상황에 따라 엄청난 출력메시지가 있기 때문에 이런 메시지들 사이에 특정자원의 고장발생, 조치 및 현재상황에 대한 일련의 흐름을 추적하는데는 시간 소요 및 인력낭비가 필연적이다.In general, a series of flows of failures, actions, and current conditions of a particular resource can take as little as a few seconds to several hours, making it difficult to track the output messages for each step. Because printers are shared devices, there are tremendous output messages depending on the system situation, so it takes time and labor to trace a series of flows of specific resource failures, actions, and current conditions.

또한, 같은 종류의 일에 대한 출력메시지 참조번호가 같기 때문에 발생하는 문제도 있다. 예를 들어 프로세서 시동에 따른 출력메시지를 살펴보면 특정 화일의 시동종료에 따라 한개의 메시지를 출력하게 되는데 이 출력메세지에 대한 출력 메시지 참보번호가 모두 같고 이들 메시지간의 프로세서 위치만 달라진다. 시스템에서 동시에 프로세서 시동이 가능하므로 이들 중 특정 프로세서에 관련된 출력메시지를 추적하는것도 마찬가지로 시간 소요 및 인력낭비가 필연적이다.There is also a problem caused by the same output message reference numbers for the same kind of work. For example, if you look at the output message according to the start-up of a processor, one message is output according to the start-up of a specific file. The output message reference numbers for the output message are the same and only the processor positions between these messages are different. Processors can be started at the same time in the system, so tracking output messages related to a particular processor is equally time consuming and labor-intensive.

다음은 히스토릭 레코딩(History Recording) 기능을 이용하는 명령어이다. 이 명령어에 의하면 MMCP(5)의 디스크에 저장된 출력 메시지를 다시 참조하려 할때 년, 월, 일, 시작시각, 끝난시각, 출력메시지 참조번호 및 특정 입/출력(1/0)터미날 번호별 메시지를 참조할 수 있다.The following command uses the historical recording function: According to this command, when trying to refer back to the output message stored in the disk of MMCP (5), the message by year, month, day, start time, end time, output message reference number and specific input / output (1/0) terminal number See.

1.포맷1.Format

DIS-MSG-HIS:TYPE=a[,YEAR=b][,DATE=c][,STM=d][,ETM=e][,PRN=f][,PORT=g];DIS-MSG-HIS: TYPE = a [, YEAR = b] [, DATE = c] [, STM = d] [, ETM = e] [, PRN = f] [, PORT = g];

2.설명2. Description

2.1메시지 설명2.1 Message Description

-입출력 장치별, 메시지 종류별로 출력된 메시지 내역을 출력시킨다.-Display the message details printed by I / O device and message type.

-등급은 3등급이며 SMA, NM, TRKT, SUBT 그룹이 있다.-The grade is grade 3 and there are SMA, NM, TRKT, SUBT group.

-이 메시지는 본체에 적용되며, 입력 가능한 입력장치로는 PC,CRT,TTY형태이다.-This message is applied to the main body. The input devices that can be input are PC, CRT, and TTY.

2.2 포맷 설명2.2 Format Description

변수 타입(maximum argument) 디폴트(최소 : 최대) enumlistsVariable argument default (minimum: maximum) enumlists

a ENUM 1 - - SYSM,ALM,FLT,STS,MMCa ENUM 1--SYSM, ALM, FLT, STS, MMC

b 년 1b year 1

c 일 1c day 1

d 시간 1d time 1

e 시간 1e time 1

f SHORT 1 (0 : 9999)f SHORT 1 (0: 9999)

g SHORT 1 (0 : 19)g SHORT 1 (0: 19)

여기서, 변수 a는 출력메시지 종류는, b는 년(year)을 나타내며, 디폴트인 경우는 올해를 의미한다(YYYY).Here, the variable a is the output message type, b is the year, and the default is the year (YYYY).

그리고, c SMS 일(Date)를 나타내며, 디폴트인 경우 오늘을 의미한다(mmdd).And, c represents the SMS Date (Date), the default means today (mmdd).

d는 시작시간을 나타내며, 디폴트인 경우 현재 시각보다 1시간전을 의미한다(HHMMSS).d represents the start time, and in the default case, means 1 hour before the current time (HHMMSS).

e는 끝난 시간을 나타내며, 디폴트인 경우 현재시각을 의미한다(HHMMSS).e indicates the finished time and, in the default case, the current time (HHMMSS).

f는 출력메시지 조회번호로 값이 주어지지 않을 경우 모든 메시지를 의미한다.f means all messages if no value is given as output message inquiry number.

g는 입출력 장치 번호로, 값이 주어지지 않을 경우에는 해당 명령어를 입력시킨 입출력장치를 의미한다.g is an input / output device number. If a value is not given, g means an input / output device in which a corresponding command is input.

이때 관련메시지는 M0300이다.The relevant message is M0300.

히스토리 레코딩 기능은 운용중 출력되는 상태 메시지, 장애 메시지, 경보 메시지, MMC 메시지등을 디스크에 수일간 보존하고 있다. 이 히스토리 레코딩 기능으로 특정 메시지를 출력시키려면 운용자가 입력명령어를 통해 상태 메시지별, MMC 출력 메시지별, 경보 메시지별, 장애 메시지별로 출력 가능하다.The history recording function keeps the status messages, fault messages, alarm messages and MMC messages output during operation for several days. To output a specific message with this history recording function, the operator can output by status command, MMC output message, alarm message, and fault message.

기존에는 대부분의 메시지 형태가 출력메시지 참조번호(Printoutt Reference Number)에 의해 출력 가능하나 동일한 성격의 고장으로 인해 메시지 형태는 같고 발생 위치(프로세서)만 달라지는 경우가 대부분이므로 특정 프로세서에 관계된 출력메시지를 추적하는 것도 마찬가지로 시간소요 및 인력낭비가 필연적이다.In the past, most message types can be printed by the Printoutt Reference Number. However, due to the failure of the same nature, the message types are the same and only the location of occurrence (processor) is different. Therefore, the output messages related to a specific processor are tracked. Likewise, time consuming and manpower was inevitable.

다음은 히스토리 레코딩 명령어 입력에 따른 출력 메시지이다. 이 출력 메시지의 출력 메시지 참조번호(Printout Reference Number)는 0300임을 나타낸다.The following is an output message according to the history recording command input. The printout reference number of this output message is 0300.

1. M0300의 포맷1.Format of M0300

<M0300가 정상일 경우의 디스플레이 메시지 히스토리><Display message history when M0300 is normal>

타입 년 일 STM ETM 포트(PORT)Type Year Sun STM ETM Port

××× ×××× ×××× ×××××× ×××××× ××××× ×××× ×××× ×××××× ×××××× ××

결과=정상Result = normal

<M0300가 비정상일 경우의 디스플레이 메시지 히스토리><Display message history when M0300 is abnormal>

타입 년 일 STM ETM 포트(PORT)Type Year Sun STM ETM Port

결과=비정상Result = abnormal

이유=bReason = b

2.설명2. Description

2.1 메시지설명:입출력장치별, 메시지 종류별로 출력된 메시지 내역을 출력시킨다.2.1 Message Description: Displays the message details printed by I / O device and message type.

2.2 포맷 설명2.2 Format Description

aaaa:저장되었던 출력메시지aaaa: The saved output message

이유:비정상적일 경우Reason: If abnormal

데이타가 아님Not data

무휴 날짜Day of the week

너무 늦은 시작시간Too late start time

인터벌(interval)이 범위를 벗어남.The interval is out of range.

이때 관련 메시지는 C0300이다.The relevant message is C0300.

제 2 도는 본 발명에 따른 고장, 장애, 경보메시지가 수납되는 경로도로서, ASM 모듈내에 각 프로세서에서 발생된 경보메시지는 ASMP(1)로 수집되고, 수집된 장애, 경보메시지에 대해 ASMP(1)는 시스템 장애관리를 하는 OMP(8)로 보고한다.2 is a path diagram in which faults, faults, and alarm messages are stored according to the present invention, wherein alarm messages generated by each processor in the ASM module are collected by the ASMP 1 and an ASMP (1) ) Is reported to the OMP (8) that manages system failures.

장애발생으로 인해 경보소스에서 장애 정보가 접수되었을때, 변경상태를 상태테이블에 반영시키고, 이 장애 정보가 일정시간이내에 반복적으로 출력되는지를 판단하여 일정시간내 중복 출력인 경우 메시지 출력을 억제시키며, 중복이 아닌 경우 출력시켜 주어야 한다.When failure information is received from the alarm source due to the occurrence of a failure, the change status is reflected in the status table, and it is determined whether the failure information is repeatedly output within a predetermined time, and the message output is suppressed in case of duplicate output within the predetermined time. If it is not duplicate, it should be printed.

제 3 도는 본 발명에 따른 처리 흐름도이다.3 is a process flow diagram according to the present invention.

먼저, 특정자원의 장애, 고장 및 경보 정보를 접수한다(31). 이때, 일정시간내에 발생하는 고장에 대해서는 고장자원별로 출력메시지 사건번호를 동적으로 발생시켜 해당자원의 고장발생, 자체시험 결과, 종치 및 현재상황을 알 수 있게 한다.First, it receives the failure, failure and alarm information of a specific resource (31). At this time, for the fault occurring within a certain time, the output message event number is dynamically generated for each fault resource so that the fault occurrence, self-test result, final value and current situation of the corresponding resource can be known.

메시지 출력시 이 사건에 대해 출력메시지 사건번호(Printout Event Number)를 부여하며(32), 이 사건번호는 n자리 정수로 부여하여 0개 내지 10ⁿ-1개의 사건으로 구성하며, 사건번호가 다 할당되었으면 이를 운용자에게 알려 새로운 사건번호를 0에서 10ⁿ-1까지 부여하고, 사건번호가 다 할당되지 않았으면 사건번호를 하나 증가시킨다(33,34,38). 단 정수 n은 시스템의 메시지 출력량과 기간에 의존하는 변수이다.When the message is output, a Printout Event Number is assigned to this event (32). The event number is assigned as an n-digit integer and consists of 0 to 10 ⁿ -1 events. If assigned, the operator is notified and a new case number is assigned from 0 to 10 ⁿ -1. If the case number is not assigned, the case number is increased by one (33, 34, 38). However, the integer n is a variable that depends on the amount and duration of the message output of the system.

사건이 발생되면 사건번호를 할당하게 되며, 이에 대한 시험상태, 조치도 같은 사건번호로 출력메시지를 갖게 되며, 고장, 장애의 복구시 복구메시지를 끝으로 해당사건에 대해 조치되면 사건 리스트에서 추출해 낸다.When an event occurs, an event number is assigned, and the test status and action for this event are also outputted with the same event number, and when the error is recovered, the recovery message is finally extracted from the event list. .

사건번호가 다 할당되어 새로운 사건번호를 할당할때, 이를 운용자에게 알려주고(33,34), 현재 시스템의 부하를 점검하여(35) 통화중 상태인가를 조사한 후에(36) 통화중 상태인 경우에는 해결되지 않은 고장에 대해 출력하고, 운용자에게 조치를 요구한 후 종료한다(37). 이때, 소프트웨어(S/W) 결함이나, 신호유실로 인해 복구가 되었는데도 사건번호가 남아있는 상황을 고려하여 통화중이 아닌 때 남아있는 사건보호에 대해 감사(Audit)를 요구하고 종료한다(39).When the case number is allocated and a new case number is assigned, the operator is notified (33, 34), and the current system load is checked (35) to check whether it is busy (36). Print out the unresolved fault, ask the operator for action and exit (37). At this time, considering the situation in which the case number remains even though the software (S / W) defect or the signal is recovered, the audit is requested for the remaining case protection when the call is not in progress and terminated (39). .

아올러 경보소스로부터 정보가 접수되면 이 상태의 확인이나, 장애 내역을 위한 테스트 루틴의 구동 혹은 운용자에 가시, 가청의 정보로 전송할 수 있어야 하고, TMN으로 상태 전송이 가능해야 한다.When information is received from the alert source, it must be able to confirm this status, run a test routine for fault detail, or transmit it as an audible and audible information to the operator, and transmit the status to the TMN.

0. 시스템 메시지의 종류는 다음과 같이 분류된다.0. Types of system messages are classified as follows.

MMC 메시지:운용자가 시스템의 상태를 점검하기 위해 입력시킨 명령어의 결과 출력을 말한다.MMC Message: The output of the command the operator entered to check the status of the system.

상태 메시지:각 소프트웨어(S/W) 블럭으로부터 기능 수행중 운용자에게 현재 기능의 진행 상태를 알려주는 메시지Status messages: Messages that inform the operator of the current status of a function while performing a function from each software (S / W) block.

장애 메시지:각 장애 소스로부터 혹은 장애 소스를 관리하는 소프트웨어(S/W)블럭에 의해 감지된 장애의 발생을 운용자에게 알려주는 메시지로 일시적인 장애를 말한다.Failure message: A message that informs the operator of the occurrence of a failure detected from each failure source or by a software (S / W) block that manages the failure source.

장애가 검출되면 가능한 상세한 고장위치(라인 & 채널, 카드, 디바이스 & 쉘, 랙(Rack), 행(Row))를 출력시켜 준다.When a fault is detected, it prints out the possible fault location (line & channel, card, device & shell, rack, row).

각 시스템의 행(Row), 랙, 쉘프, 카드의 위치는 상징적인 이름을 가지지 아니하고, 아래와 같이 상수의 형태를 가진다.The location of the row, rack, shelf, and card in each system does not have a symbolic name, but is in the form of a constant as shown below.

프로세서(장비)이름 행(Row) 랙(Rocd) 쉘프 카드(장치)Processor (Equipment) Name Row Rack Shelf Card (Device)

××× 1 3 3 2××× 1 3 3 2

검출된 장애는 해당 장비별 발생빈도(장비의 특성에 따라 임계치는 다를 수 있다)를 측정하여 이 임계치를 초과하는 경우 경보메시지로 상승시켜 준다.Detected faults are measured by the frequency of occurrence of each device (threshold value may vary depending on the characteristics of the equipment) and raised to an alarm message when the threshold is exceeded.

경보 메시지:경보 메시지는 가시, 가청을 포함하는 메시지로 장애소스가 즉각적으로 조치가 이루어져야하는 상태임을 알려준다. 경보메시지의 발생을 반복되는 장애메시지의 상태가 상승되어 이루어진 경우와 해당 장애소스에서 직접 발생되는 경우가 있으며, 장애가 심각정도에 따라 마이너(Minor), 메이저(Major),크리티컬(Critical)로 나눈다. 출력메시지의 구성을 다음과 같다.Alert message: An alert message is a message that includes audible and audible signals that indicates that the source of the fault must be acted upon immediately. The occurrence of an alarm message is generated when the status of a recurring error message is raised and may occur directly at the source of the error. The failure is divided into minor, major and critical according to the severity. The composition of the output message is as follows.

Exxxxxx 임의의 메시지(××××)Exxxxxx Random Messages (××××)

+ ++ +

│ ││ │

사건번호 출력메시지참조번호Event number Output message reference number

출력 메 시 지 참조번호상기와 같은 본 발명은 자원에서 발생되는 고장, 장애, 경보 메시지를 수집하는 프로세서 장애관리기능과 중앙관리모듈인 시스템 장애 관리기능으로 구성되며, 고장 발생시 시스템 장애관리기능으로부터 사건번호를 할당받아 자원의 각 천이 단계를 관리하고, 자원의 고장, 자체 회복 요구, 현재상황들을 같은 사건번호로 관리함으로써, 운용자의 유지보수 및 개발자로 하여금 쉽게 프로그램할 수 있도록 시스템 신뢰도향상 및 구현이 용이한 효과가 있다.Output message reference number The present invention as described above is composed of a processor failure management function for collecting failure, failure, and alarm messages generated from resources, and a system failure management function that is a central management module. By assigning a number to manage each transition stage of the resource and managing resource failures, self-recovery requests, and current situations with the same event number, system reliability improvement and implementation can be easily programmed by the operator's maintenance and developers. It has an easy effect.

Claims

A failure management method applied to a distributed processing system, the method comprising: determining an arbitrary natural number n according to a message output amount and a period of a system and allocating an output message event number for a received event from 0 to 10 ⁿ -1; When the event (disability, failure, alarm) information is received, if the event number is not assigned from 0 to 10 ⁿ -1, the event number for the event of a specific resource is dynamically assigned to one of the above 0 to 10 ⁿ -1. the method comprising; specific received; the method comprising after the event information reception of the specific resource, the case number is given both in the 0 to 10 ⁿ -1 If the assignment in the new case number 0 to 10 ⁿ -1, and inform the operator Deleting the assigned case number when the action on the event of the resource is completed; allocating a new dictionary number, checking the load on the system to investigate whether the call is in a busy state; and Faults using the event number of each resource in the distributed processing system including the steps of outputting an unresolved fault and asking the operator to take an action, and requesting an audit for the remaining case number when not in a call. How to manage.