KR100939352B1

KR100939352B1 - Service failure monitoring device and method

Info

Publication number: KR100939352B1
Application number: KR1020070104331A
Authority: KR
Inventors: 김범수; 황찬규; 유재형
Original assignee: 주식회사 케이티
Priority date: 2007-10-17
Filing date: 2007-10-17
Publication date: 2010-01-29
Anticipated expiration: 2027-10-17
Also published as: KR20090038982A

Abstract

본 발명은 서비스 장애 감시 장치 및 방법에 관한 것이다.The present invention relates to a service failure monitoring apparatus and method.

본 발명은 (a) 서비스 운용에 관련된 장애 관리 객체를 데이터베이스에서 선택하여 토폴로지 맵에 등록하는 단계; (b) 장애 관리 객체의 각각의 장애 영향도 속성(장애 영향도 판단 기준과 장애 영향도 검색 범위)을 설정하는 단계; (c) 장애 영향도 속성을 통해 생성한 장애 관리 객체의 장애 영향도 판단 결과를 송수신하기 위한 장애 영향도 송신 경로 및 장애 영향도 수신 경로를 선택하여 설정하는 단계; (d) 단위 객체에 장애 발생시 서비스 운용에 관련된 장애 관리 객체의 장애 영향도를 생성하는 단계; (e) 서비스 운용에 관련된 장애 관리 객체의 장애 위치 및 장애 영향도를 토폴로지 맵에 출력하는 단계; (f) 서비스에 발생된 서버, 단위 객체들의 장애 수신 경로 및 장애 송신 경로를 검색하는 단계를 제공한다.The present invention includes the steps of (a) selecting a failure management object related to service operation from a database and registering it in a topology map; (b) setting each failure impact attribute (disability impact determination criteria and failure impact search range) of the failure management object; (c) selecting and setting a failure impact transmission path and a failure impact reception path for transmitting and receiving a failure impact determination result of the failure management object generated through the failure impact property; (d) generating a failure impact degree of a failure management object related to service operation when a failure occurs in a unit object; (e) outputting a failure location and failure impact of a failure management object related to service operation in a topology map; (f) searching for a server generated in the service, a faulty reception path and a faulty transmission path of the unit objects.

본 발명은 서버 구성요소에서 발생된 장애 현황, 장애의 파급 범위 및 영향도를 명확하게 파악할 수 있도록 지원함으로써 서비스 장애 관리 업무를 정확하고 신속하게 처리할 수 있는 효과를 기대할 수 있다.The present invention can be expected to be able to accurately and quickly handle the service failure management task by supporting to clearly understand the current status of the failure, the scope and impact of the failure occurred in the server component.

서비스 관리, 장애 관리, 장애 감시, 장애 영향도, 서버 관리 Service Management, Failure Management, Failure Monitoring, Failure Impact, Server Management

Description

Service failure monitoring device and method {Method and Apparatus for Monitoring Service Fault}

본 발명은 서비스 장애 감시 장치 및 방법에 관한 것으로서, 특히 장애 전파 경로 정의를 통한 서비스 장애 감시 장치 및 방법에 관한 것이다.The present invention relates to a service failure monitoring apparatus and method, and more particularly, to a service failure monitoring apparatus and method through a failure propagation path definition.

종래의 서버 장애 관리는 개별적인 서버 단위로 장애 관리가 수행되고 개별적인 서버의 장애 이벤트 및 오류 로그를 통해 국부적인 장애 발생 위치와 원인이 감시되며 개별적인 서버에 발생한 장애들을 개별적으로 분석 및 복구하는 형태로 처리된다.In the conventional server failure management, failure management is performed on an individual server basis, local failure location and cause are monitored through failure events and error logs of each server, and the failures of individual servers are analyzed and recovered individually. do.

이와 같은 종래 서버 장애 관리는 특정 서버의 구성요소에서 장애가 발생하는 경우 개별적인 서버 구성요소에서 발생한 장애가 어떤 서비스들 또는 서버들까지 영향을 미치는지 파악하기 어려운 문제점이 있었다.The conventional server failure management has a problem that it is difficult to determine which services or servers affect the failure of an individual server component when a failure occurs in a specific server component.

또한, 종래 서버 장애 관리는 개별적인 서버 구성요소에서 발생한 장애가 해당 서비스들 또는 서버들에게 어느 정도의 영향을 주는지 파악하기 어렵고, 개별적인 서비스들 또는 서버들의 장애가 내부 또는 외부의 원인으로부터 발생했는지 파악이 어려운 문제점이 있었다.In addition, the conventional server failure management is difficult to determine how much the failure of the individual server component affects the services or servers, it is difficult to determine whether the failure of the individual services or servers from the internal or external causes There was this.

이러한 종래 서버 장애 관리는 서비스의 운영에 치명적인 서버의 오류를 방치하고 서비스 장애를 유발시킨 서버 구성요소들의 위치 및 원인 분석에 많은 시간을 낭비하였다. 또한, 종래 서버 장애 관리는 서버 구성요소에서 발생한 장애의 파급 범위 및 영향도 파악에 어려움이 있으며, 서버 구성요소에서 발생한 장애의 대응 및 조치 방법을 파악하는 데 어려움이 있었다.This conventional server failure management wastes a lot of time in analyzing the location and cause of server components that neglect the error of the server that is fatal to the operation of the service and cause the service failure. In addition, the conventional server failure management has a difficulty in grasp the extent and impact of the failure occurred in the server component, it was difficult to identify the response method and measures for failure occurred in the server component.

이와 같은 문제점을 해결하기 위하여, 본 발명은 장애 관리 객체의 장애 전파 경로 정의를 통한 서비스 장애 감시 장치 및 방법을 제공하기 위한 것이다.In order to solve such a problem, the present invention is to provide a service failure monitoring apparatus and method through the failure propagation path definition of the failure management object.

이러한 기술적 과제를 달성하기 위한 본 발명의 특징에 따른 서비스 장애 감시 방법은 (a) 서비스 운용에 관련된 장애 관리 객체―상기 장애 관리 객체는 서비스, 서버 및 단위 객체를 포함함―를 상기 장애 관리 객체를 저장한 데이터베이스에서 선택하여 토폴로지 맵에 등록하는 단계; (b) 상기 서비스, 상기 서버 및 상기 단위 객체 각각의 장애 영향도 속성―상기 장애 영향도 속성은 장애 영향도 판단 기준과 장애 영향도 검색 범위를 포함함―을 설정하는 단계; 및 (c) 상기 장애 영향도 속성을 이용하여 생성한 상기 서비스, 상기 서버 및 상기 단위 객체 각각의 장애 영향도 판단 결과를 상기 단위 객체과 상기 서버 간에, 상기 서버와 상기 서비스 간에 송수신하기 위한 장애 영향도 송신 경로 및 장애 영향도 수신 경로를 선택하여 설정하는 단계를 포함한다.Service failure monitoring method according to an aspect of the present invention for achieving the technical problem is (a) a failure management object associated with the service operation, the failure management object includes a service, server and unit objects; Selecting from the stored database and registering it in the topology map; (b) setting a failure impact attribute of each of the service, the server, and the unit object, wherein the failure impact attribute includes a failure impact determination criterion and a failure impact search range; And (c) a failure impact level for transmitting and receiving a failure impact determination result of each of the service, the server, and the unit object generated using the failure impact property between the unit object and the server and between the server and the service. Selecting and setting the transmission path and the failure impact also receive path.

본 발명의 특징에 따른 서비스 장애 감시 방법은 (a) 서비스와 상기 서비스의 운용에 관련된 다수의 서버―각 서버는 장애를 발생시킨 단위 객체들의 상태값들과 기설정된 임계값들을 비교한 후 제1 장애 판단 기준을 기초로 서버 장애 영향도 판단 결과를 생성하여 상기 각 서버의 장애 상태로 반영함―를 토폴로지 맵에 등록하는 단계; 및 (b) 상기 다수의 서버의 각 장애 영향도 판단 결과를 제2 장애 판단 기준 및 장애 영향도 검색 범위를 기초로 서비스 장애 영향도 판단 결과를 생성함으로써 상기 서비스, 상기 다수의 서버 및 상기 각 서버의 단위 객체들 간의 장애 전파 경로가 설정되는 단계를 포함한다.According to an aspect of the present invention, a service failure monitoring method includes: (a) a service and a plurality of servers related to the operation of the service, each server comparing the state values of the unit objects causing the failure with preset thresholds; Generating a result of determining a server failure effect degree based on a failure determination criterion and reflecting the failure status of each server into a topology map; And (b) generating a service failure impact determination result based on a second failure determination criterion and a failure impact search range based on each failure impact determination result of the plurality of servers. And establishing a fault propagation path between the unit objects of the.

본 발명의 특징에 따른 서비스 장애 감시 장치는 서비스 운용에 관련된 장애 관리 객체―상기 장애 관리 객체는 서비스, 서버 및 단위 객체를 포함함―를 상기 장애 관리 객체를 저장한 데이터베이스에서 선택하여 토폴로지 맵에 등록하는 장애 관리 객체 등록부; 상기 등록한 장애 관리 객체의 장애 영향도 속성―상기 장애 영향도 속성은 제1 장애 영향도 판단 기준, 제2 장애 영향도 판단 기준과 장애 영향도 검색 범위를 포함함―을 설정하는 속성 설정부; 및 상기 장애 관리 객체에서 상기 장애 영향도 속성을 이용하여 생성된 장애 영향도 판단 결과를 상기 단위 객체와 상기 서버 간에, 상기 서버와 상기 서비스 간에 송수신하기 위한 장애 영향도 송신 경로 및 장애 영향도 수신 경로를 선택하여 설정하는 장애 영향도 경로 선택부를 포함한다.According to an aspect of the present invention, a service failure monitoring apparatus selects a failure management object related to service operation, wherein the failure management object includes a service, a server, and a unit object from a database storing the failure management object and registers it in a topology map. Fault management object registration unit; An attribute setting unit for setting a failure impact attribute of the registered failure management object, wherein the failure impact attribute includes a first failure impact determination criterion, a second failure impact determination criterion, and a failure impact search range; And a failure impact transmission path and a failure impact reception path for transmitting and receiving a failure impact determination result generated by the failure management object using the failure impact property between the unit object and the server and between the server and the service. It includes a fault impact path selection unit for selecting and setting.

전술한 구성에 의하여, 본 발명은 서비스의 운영에 치명적인 서버의 오류들을 사전에 예방 및 조치할 수 있는 효과를 기대할 수 있다.According to the above-described configuration, the present invention can be expected to have the effect of preventing and correcting errors in the server that are critical to the operation of the service in advance.

본 발명은 서비스에 장애를 유발시킨 서버 구성요소들의 위치 및 원인 분석이 간편하고 쉬운 효과를 기대할 수 있다.The present invention can be expected to be easy and easy to analyze the location and cause of the server components causing the service failure.

본 발명은 서버 구성요소에서 발생된 장애의 대응 및 조치 방법의 정확도를 향상할 수 있는 효과를 기대할 수 있다.The present invention can expect the effect that can improve the accuracy of the response and action method for the failure occurred in the server component.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소 "포함" 한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when a part is said to "include" a certain component, it means that it may further include other components, except to exclude other components unless specifically stated otherwise. In addition, the terms “… unit”, “… unit”, “module”, “block”, etc. described in the specification mean a unit that processes at least one function or operation, which is hardware or software or a combination of hardware and software. It can be implemented as.

본 발명의 실시예에 따른 서비스 운용에 관련된 장애 관리 객체는 서비스, 서버 그룹, 서버, 단위 객체 그룹 및 단위 객체로 구분된다.The failure management object related to service operation according to an embodiment of the present invention is classified into a service, a server group, a server, a unit object group, and a unit object.

서비스는 인터넷 프로토콜 텔레비젼(Internet Protocol Television, 이하 'IPTV'라 칭함), 보이스 오버 인터넷 프로토콜(Voice over Internet Protocol, 이 하 'VoIP'라 칭함), 동적 호스트 설정 프로토콜(Dynamic Host Configuration Protocol, 이하 'DHCP'라 칭함), 도메인 네임 서비스(Domain Name Service, 이하 'DNS'라 칭함) 등으로 이루어져 있으며, 상위 서비스, 하위 서비스로 구성할 수 있다.Services include Internet Protocol Television (hereinafter referred to as "IPTV"), Voice over Internet Protocol (hereinafter referred to as "VoIP"), and Dynamic Host Configuration Protocol (DHCP). And a domain name service (hereinafter referred to as 'DNS'), and may be configured as an upper service or a lower service.

서버 그룹은 IPTV 서버 팜(Server Farm), VoIP 서버 팜(Server Farm), DHCP 서버 팜(Server Farm), DNS 서버 팜(Server Farm) 등으로 이루어진다.The server group includes an IPTV server farm, a VoIP server farm, a DHCP server farm, and a DNS server farm.

서버는 서버 관리 서버들(Server Management Servers: SMS), 데이터베이스 관리 서버들(DB Management Serveres: DBMS), 네트워크 관리 서버들(Network Management Servers: NMS), 백업 관리 서버들(Backup Management Servers: BMS), 스토리지 관리 서버들(Storage Management Servers: STMS) 등으로 이루어진다.Servers include Server Management Servers (SMS), Database Management Servers (DBMS), Network Management Servers (NMS), Backup Management Servers (BMS), Storage management servers (STMS).

또한, 서버는 서비스 운용에 관련된 기본 서버와 다른 서비스 또는 시스템에 연관되어 있는 외부 서버로 구성할 수 있다.In addition, the server may be configured as an external server associated with another service or system and a basic server related to service operation.

단위 객체 그룹은 중앙 처리 장치(Central Processing Unit: CPU), 메모리(Memory), 디스크(Disk), 네트워크 인터페이스 카드(Network Interface Card: NIC), 애플리케이션 프로세스(Application Process) 등의 각 단위 객체를 포괄함을 의미한다. 또한, 단위 객체 그룹은 다수의 단위 객체를 의미하기도 하지만, 동일한 단위 객체의 그룹(예를 들어, 중앙 처리 장치 1, 중앙 처리 장치 2, 중앙 처리 장치 3 등)을 의미할 수 있다.The unit object group covers each unit object such as a central processing unit (CPU), memory, a disk, a network interface card (NIC), and an application process. Means. In addition, the unit object group may mean a plurality of unit objects, but may also mean a group of the same unit object (eg, the central processing unit 1, the central processing unit 2, the central processing unit 3, etc.).

도 1은 본 발명의 실시예에 따른 서비스 장애 감시 장치(100)의 내부 구성을 간략하게 나타낸 블록 구성도이다.1 is a block diagram schematically showing the internal configuration of the service failure monitoring apparatus 100 according to an embodiment of the present invention.

본 발명의 실시예에 따른 서비스 장애 감시 장치(100)는 장애 관리 객체 등록부(110), 속성 설정부(120), 장애 영향도 경로 선택부(130), 장애 영향도 생성부(140), 장애 영향도 전송부(150), 장애 영향도 출력부(160) 및 장애 영향도 전파 경로 검색부(170)를 포함한다.The service failure monitoring apparatus 100 according to an embodiment of the present invention includes a failure management object registration unit 110, an attribute setting unit 120, a failure impact path selector 130, a failure impact generation unit 140, and a failure. The impact transmission unit 150, the failure impact output unit 160, and the failure impact propagation path search unit 170 are included.

장애 관리 객체 등록부(110)는 서비스 운용에 관련된 장애 관리 객체(서비스, 서버 그룹, 서버, 단위 객체 그룹 및 단위 객체)를 장애 관리 객체를 저장한 데이터베이스(미도시)에서 선택하여 토폴로지 맵에 등록한다. 예를 들어, 대전 이메일 서비스를 관리하는 서버를 장애 관리 객체로 데이터베이스에서 검색하여 마우스로 끌어다가 토폴로지 맵에 위치시킨다.The failure management object registration unit 110 selects a failure management object (service, server group, server, unit object group, and unit object) related to service operation from a database (not shown) storing the failure management object and registers it in the topology map. . For example, a server managing a competitive e-mail service is retrieved from the database as a failure management object and dragged into the topology map.

속성 설정부(120)는 토폴로지 맵에 등록한 장애 관리 객체의 장애 영향도 속성을 설정한다. 여기서, 장애 영향도 속성은 장애 영향도 판단 기준, 장애 영향도 판단 유형, 각 단위 객체, 각 서버의 상태값과 비교하기 위한 임계값 및 장애 영향도 검색 범위를 포함한다.The property setting unit 120 sets a failure impact property of a failure management object registered in the topology map. Here, the failure impact attribute includes a failure impact determination criterion, a failure impact determination type, each unit object, a threshold for comparing with a state value of each server, and a failure impact search range.

즉 속성 설정부(120)는 서비스, 서버, 단위 객체 그룹에 대한 장애 영향도 판단 기준, 장애 영향도 판단 유형, 임계값 및 장애 영향도 검색 범위를 설정한다. 여기서, 장애 영향도 판단 유형은 '치명', '긴급', '위험', '주의', '무해', '미정' 및 '정상'을 포함하며 장애 관리 객체의 상태, 이벤트, 속성을 기준으로 이 중 선택하여 구성할 수 있다. 이하에서, 장애 영향도 판단 기준 및 장애 영향도 검색 범위를 상세하게 설명하기로 한다.That is, the attribute setting unit 120 sets a failure impact determination criterion, a failure impact determination type, a threshold value, and a failure impact search range for a service, a server, and a unit object group. Here, failure impact judgment types include 'fatal', 'emergency', 'danger', 'caution', 'harmless', 'determined' and 'normal' and are based on the status, event, and attributes of the failure management object. You can choose to configure among them. Hereinafter, the criteria for determining the impact of disability and the scope of searching for the impact of disability will be described in detail.

여기서, 임계값과 장애 영향도 판단 유형은 장애 관리 객체가 이용률, 사용 률, 온/오프 등 장애를 발생시키는 형태에 따라 임계값과 장애 영향도 판단 유형을 다르게 설정할 수 있다.Here, the threshold value and the failure impact determination type may set the threshold value and the failure impact determination type differently according to the type of failure management object causing the failure such as utilization rate, usage rate, and on / off.

본 발명의 실시예에 따른 장애 영향도 판단 유형은 설명의 편의를 위해 '치명', '긴급', '위험', '주의', '정상'으로 구분하기로 한다. 물론 장애 영향도 판단 유형은 필요에 따라 추가하여 구성할 수 있다.Disability impact determination type according to an embodiment of the present invention will be divided into 'fatal', 'urgent', 'danger', 'caution', 'normal' for convenience of explanation. Of course, the type of failure impact determination can be configured as needed.

장애 영향도 경로 선택부(130)는 장애 관리 객체에서 생성된 장애 영향도 판단 결과를 단위 객체와 서버 간, 서버와 서비스 간에 송수신하기 위한 장애 영향도 송신 경로 및 수신 경로를 선택하여 설정한다. 따라서, 본 발명의 실시예는 장애 관리 객체 간의 장애 전파 경로를 검색할 수 있게 된다.The failure impact path selector 130 selects and sets a failure impact transmission path and a reception path for transmitting and receiving the failure impact determination result generated by the failure management object between the unit object and the server and between the server and the service. Accordingly, embodiments of the present invention can search for a fault propagation path between fault management objects.

장애 영향도 생성부(140)는 단위 객체 그룹의 각 단위 객체의 상태값을 수신하고, 각 단위 객체의 상태값과 각 단위 객체에 설정된 임계값의 대소 여부를 비교하여 제1 장애 영향도를 생성한다.The failure impact generation unit 140 receives a state value of each unit object of the unit object group, and compares the state value of each unit object with a threshold value set in each unit object to generate a first failure impact degree. do.

또한, 장애 영향도 생성부(140)는 각 단위 객체에서 생성된 제1 장애 영향도를 제1 장애 판단 기준을 기초로 서버 장애 영향도 판단 결과를 생성하고, 다수의 서버의 각 서버 장애 영향도 판단 결과를 제2 장애 판단 기준 및 장애 영향도 검색 범위를 기초로 서비스 장애 영향도 판단 결과를 생성한다.In addition, the failure impact generation unit 140 generates a server failure impact determination result based on the first failure determination criterion generated by each unit object, and each server failure impact of a plurality of servers. Based on the determination result of the second failure determination criteria and the failure impact search range, the service failure impact determination result is generated.

장애 영향도 전송부(150)는 하위 레벨의 장애 관리 객체에서 생성한 장애 영향도 판단 결과를 상위 레벨의 장애 관리 객체(단위 객체 그룹 => 서버로, 서버 => 서비스)로 전송한다.The failure impact transmitter 150 transmits the failure impact determination result generated by the failure management object of a lower level to a failure management object of a higher level (unit object group => server, server => service).

장애 영향도 출력부(160)는 서비스와 서비스 운용에 관련된 서버, 각 단위 객체 그룹의 장애 상태, 장애 영향도 및 장애 영향도 판단 결과를 토폴로지 맵을 통해 출력한다.The failure impact level output unit 160 outputs a server related to service and service operation, a failure state of each unit object group, a failure impact degree, and a failure impact determination result through a topology map.

장애 영향도 전파 경로 검색부(170)는 장애 영향도 송/수신 경로 상의 장애 관리 객체를 선택하여 서비스, 서버, 단위 객체 간의 장애 전파 경로를 검색한다.The failure impact propagation path searching unit 170 selects a failure management object on the failure impact transmission / reception path and searches for a failure propagation path between a service, a server, and a unit object.

전술한 장애 영향도 속성 중 장애 영향도 판단 기준은 '평균값', '정족수' , '최대값' 및 '조건값'을 포함한다.The criteria for determining the disability impact among the aforementioned disability influence attributes include 'average', 'quorum', 'maximum', and 'condition'.

장애 영향도 판단 기준('평균값')은 장애 관리 객체의 상태값들과 기설정된 임계값을 비교하여 장애를 발생시킨 장애 관리 객체의 상태값들의 평균값을 해당 장애 관리 객체의 장애 상태를 판단한다.The criterion for determining the influence of failure ('average value') determines the failure state of the corresponding failure management object by comparing the average value of the failure management object with the failure value by comparing the status values of the failure management object with a preset threshold.

장애 영향도 판단 기준('정족수')은 장애 관리 객체의 상태값들과 기설정된 임계값을 비교하여 장애를 발생시킨 장애 관리 객체들이 정의된 정족수를 초과하는 경우 해당 장애 관리 객체의 장애 상태로 판단한다.The failure impact criterion ('quorum') compares the status values of the failure management object with a preset threshold to determine the failure status of the failure management object when the failure management objects causing the failure exceed the defined quorum. do.

장애 영향도 판단 기준('최대값')은 장애 관리 객체의 상태값들과 기설정된 임계값을 비교하여 장애를 발생시킨 장애 관리 객체의 상태값 중에서 최대값을 해당 장애 관리 객체의 장애 상태로 판단한다.The failure impact criterion ('maximum value') compares the status values of the failure management object with a preset threshold and determines the maximum value among the status values of the failure management object that caused the failure as the failure status of the failure management object. do.

장애 영향도 판단 기준('조건값')은 장애 관리 객체의 상태값들과 기설정된 임계값을 비교하여 장애를 발생시킨 장애 관리 객체의 상태값들이 조건값을 초과하는 경우 해당 장애 관리 객체의 장애 상태로 판단한다.The failure impact criterion ('condition value') compares the status values of the failure management object with a predetermined threshold value and the failure value of the failure management object when the status values of the failure management object that caused the failure exceed the condition value. Judging by the state.

전술한 장애 영향도 속성 중 장애 영향도 검색 범위는 내부 장애 영향도 우선 반영, 외부 장애 영향도 우선 반영 및 혼합 장애 영향도 반영을 포함한다.Among the above-described disability impact attributes, the disability impact search range includes reflecting internal disturbances first, reflecting external disturbances first, and reflecting mixed disturbances.

내부 장애 영향도 우선 반영은 장애를 발생시킨 내부 구성요소들에서 생성된 장애 영향도 판단 결과를 우선하여 상위 레벨의 장애 관리 객체의 장애 상태로 판단하는 장애 영향도 판단 방법이다. 여기서, 내부 구성요소는 서비스 운용에 관련된 장애 관리 객체로서 기본 서버 또는 기본 서버의 단위 객체들이다.Reflecting the internal failure impact first is a failure impact determination method that judges the failure status of the high-level failure management object by prioritizing the failure impact determination result generated by the internal components causing the failure. Here, the internal components are failure management objects related to service operation and are basic servers or unit objects of the basic server.

외부 장애 영향도 우선 반영은 장애를 발생시킨 외부 구성요소들에서 생성된 장애 영향도 판단 결과를 우선하여 상위 레벨의 장애 관리 객체의 장애 상태로 판단하는 장애 영향도 판단 방법이다. 여기서, 외부 구성요소는 해당 서비스 운용에 직접 관련되지 않은 다른 서비스의 외부 서버 또는 외부 서버의 단위 객체들이다.Priority reflecting the external failure impact is a failure impact determination method that judges the failure status of a high-level failure management object by prioritizing the result of failure impact determination generated by the external components that caused the failure. Here, the external components are unit objects of an external server or an external server of another service not directly related to the operation of the corresponding service.

혼합 장애 영향도 반영은 장애를 발생시킨 내부 구성요소 및 외부 구성요소들에서 생성된 장애 영향도 판단 결과를 혼합하여 상위 레벨의 장애 관리 객체의 장애 상태로 판단하는 장애 영향도 판단 방법이다.The reflection of mixed failure impact is a failure impact determination method that determines the failure status of a higher level failure management object by mixing the failure impact determination results generated by the internal and external components that caused the failure.

도 2는 본 발명의 제1 실시예에 따른 단위 객체에 장애 발생시 서비스 운용에 관련된 서버들의 서비스 장애 영향도 전파 경로를 설정하는 과정을 설명하기 위한 도면이다.2 is a diagram illustrating a process of setting a service failure impact map propagation path of servers related to service operation when a failure occurs in a unit object according to the first embodiment of the present invention.

장애 영향도 생성부(140)는 서버 1, 서버 2 및 서버 3에서 각 단위 객체의 상태값을 수신하고, 수신한 각 단위 객체의 상태값과 기설정된 각 단위 객체의 임계값의 대소 여부를 비교하여 장애 영향도를 판단한다.The failure impact generation unit 140 receives the state value of each unit object from the server 1, the server 2, and the server 3, and compares the state value of each received unit object with a threshold value of each preset unit object. To determine the impact of failure.

도 2에 도시된 바와 같이, 서버 1의 중앙 처리 장치, 메모리, 디스크, 네트워크 인터페이스 카드는 상태값이 90, 60, 50, 50, Normal이고, 각 단위 객체의 임계값이 90(치명)/80(긴급)/70(위험)/60 이하(정상)이며, 애플리케이션 프로세스의 임계값이 다운(Down)/웨이트(Wait)로 설정한다.As shown in FIG. 2, the central processing unit, memory, disk, and network interface card of the server 1 have status values of 90, 60, 50, 50, and normal, and a threshold value of each unit object is 90 (fatal) / 80. (Emergency) / 70 (Danger) / 60 or less (Normal), the threshold of the application process is set to Down / Wait.

따라서, 장애 영향도 생성부(140)는 장애 영향도 판단 기준('최대값')에 따라 장애 영향도 판단 결과를 '치명'으로 설정한다.Therefore, the failure impact generation unit 140 sets the failure impact determination result to 'fatal' according to the failure impact determination criteria ('maximum value').

서버 2의 중앙 처리 장치, 메모리, 디스크, 네트워크 인터페이스 카드는 상태값이 50, 80, 30, 20, Normal이고, 각 단위 객체의 임계값이 90(치명)/80(긴급)/70(위험)/60 이하(정상)이며, 애플리케이션 프로세스의 임계값이 다운(Down)/웨이트(Wait)로 설정한다.Server 2's central processing unit, memory, disks, and network interface cards have status values of 50, 80, 30, 20, and Normal, and each unit object has a threshold of 90 (fatal) / 80 (urgent) / 70 (critical). / 60 or less (normal), the threshold of the application process is set to Down / Wait.

따라서, 장애 영향도 생성부(140)는 장애 영향도 판단 기준('최대값)에 따라 장애 영향도 판단 결과를 '긴급'으로 설정한다.Accordingly, the failure impact generation unit 140 sets the failure impact determination result to 'urgent' according to the failure impact determination criteria ('maximum value').

서버 3의 중앙 처리 장치, 메모리, 디스크, 네트워크 인터페이스 카드는 상태값이 10, 30, 20, 70, Normal이고, 각 단위 객체의 임계값이 90(치명)/80(긴급)/70(위험)/60 이하(정상)이며, 애플리케이션 프로세스의 임계값이 다운(Down)/웨이트(Wait)로 설정한다.Server 3's central processing unit, memory, disks, and network interface cards have status values of 10, 30, 20, 70, and Normal, and each unit object has a threshold of 90 (fatal) / 80 (urgent) / 70 (critical). / 60 or less (normal), the threshold of the application process is set to Down / Wait.

따라서, 장애 영향도 생성부(140)는 장애 영향도 판단 기준('최대값')에 따라 장애 영향도 판단 결과를 '위험'으로 설정한다.Therefore, the failure impact generation unit 140 sets the failure impact determination result to 'risk' according to the failure impact determination criteria ('maximum value').

장애 영향도 전송부(150)는 서버 1, 서버 2 및 서버 3에서 생성된 각각의 장애 영향도 판단 결과('치명', '긴급', '위험')를 서비스로 전송한다.The failure impact transmitter 150 transmits each failure impact determination result ('fatal', 'emergency', 'risk') generated by the server 1, the server 2, and the server 3 to the service.

장애 영향도 생성부(140)는 서비스에 수신된 서버 1, 서버 2 및 서버 3의 각 장애 영향도 판단 결과와 기설정된 서비스의 임계값('주의')을 장애 영향도 판단 기준('최대값') 및 장애 영향도 검색 범위('내부 장애 영향도 우선 반영')를 기초 로 판단하여 장애 영향도 판단 결과('치명')를 생성한다. 이에 따라 서비스('치명'), 서버 1('치명'), 서버 1의 중앙 처리 장치('치명'), 서버 2('긴급'), 서버 2의 메모리('긴급'), 서버 3('위험'), 서버 3의 네트워크 인터페이스 카드('위험')가 설정됨으로써 서비스, 서버, 단위 객체 간의 장애 전파 경로가 설정된다.The failure impact generation unit 140 determines a failure impact determination result of each of server 1, server 2, and server 3 received by the service, and a threshold value ('attention') of a predetermined service based on a failure impact determination criterion ('maximum value'). Based on ') and the range of disability impact search (' internal disability impact first '), the disability impact determination result (' fatality ') is generated. As a result, services ('fatal'), server 1 ('fatal'), server 1's central processing unit ('fatal'), server 2 ('urgent'), server 2's memory ('urgent'), server 3 ( 'Risk'), server 3's network interface card ('risk') is set up, so that the fault propagation path between service, server and unit object is established.

도 3은 본 발명의 제2 실시예에 따른 단위 객체에 장애 발생시 서비스 운용에 관련된 서버들의 서비스 장애 영향도 전파 경로를 설정하는 과정을 설명하기 위한 도면이다.FIG. 3 is a diagram illustrating a process of setting a service failure impact propagation path of servers related to service operation when a failure occurs in a unit object according to a second embodiment of the present invention.

도 3에 도시된 바와 같이, 서버 1, 서버 2 및 서버 3의 장애 영향도 판단 기준과 장애 영향도 판단 결과는 도 2의 제1 실시예와 동일하므로 중복되는 설명을 생략하기로 한다.As shown in FIG. 3, the failure impact determination criteria and the failure impact determination results of the server 1, the server 2, and the server 3 are the same as in the first embodiment of FIG. 2, and thus redundant descriptions thereof will be omitted.

도 3에서는 서버 1, 서버 2 및 서버 3는 서비스 운용에 관련된 기본 서버이고, 서버 4는 서비스와 직접적인 관련이 없는 다른 서비스 또는 시스템과 관련된 외부 서버이다.In FIG. 3, server 1, server 2, and server 3 are basic servers related to service operation, and server 4 is an external server associated with another service or system not directly related to the service.

서버 4의 중앙 처리 장치, 메모리, 디스크, 네트워크 인터페이스 카드는 상태값이 10, 30, 20, 20, Normal이고, 각 단위 객체의 임계값이 90(치명)/80(긴급)/70(위험)/60 이하(정상)이며, 애플리케이션 프로세스의 임계값이 다운(Down)/웨이트(Wait)로 설정한다.Server 4's central processing unit, memory, disks, and network interface cards have status values of 10, 30, 20, 20, and Normal, and each unit object has a threshold of 90 (fatal) / 80 (urgent) / 70 (critical). / 60 or less (normal), the threshold of the application process is set to Down / Wait.

따라서, 장애 영향도 생성부(140)는 최대값 영향도 판단 기준에 따라 장애 영향도 판단 결과를 '정상'으로 설정한다.Therefore, the failure impact generation unit 140 sets the failure impact determination result to 'normal' according to the maximum value impact determination criteria.

장애 영향도 전송부(150)는 서버 1, 서버 2, 서버 3 및 서버 4에서 생성된 각각의 장애 영향도 판단 결과('치명', '긴급', '위험', '정상')를 서비스로 전송한다.The failure impact transmitter 150 uses the failure impact determination results ('fatal', 'emergency', 'risk', 'normal') generated by the server 1, the server 2, the server 3, and the server 4 as a service. send.

장애 영향도 생성부(140)는 서비스에 수신된 서버 1, 서버 2, 서버 3 및 서버 4의 각 장애 영향도 판단 결과와 기설정된 서비스의 임계값('정상')을 정족수 영향도 판단 기준 및 장애 영향도 검색 범위('혼합 장애 영향도 반영')를 기초로 판단하여 장애 영향도 판단 결과('정상')를 생성한다. 이에 따라, 서비스('치명'), 서버 1('치명'), 서버 1의 중앙 처리 장치('치명'), 서버 2('긴급'), 서버 2의 메모리('긴급'), 서버 3('위험'), 서버 3의 네트워크 인터페이스 카드('위험'), 서버 4('정상')가 설정됨으로써 서비스, 서버, 단위 객체 간의 장애 전파 경로가 설정된다.The failure impact generation unit 140 may determine the failure impact result of each of the server 1, the server 2, the server 3, and the server 4 received in the service and the threshold value ('normal') of the preset service, based on the quorum impact determination criteria and Based on the disability influence search range ('reflecting the mixed disability effect'), a disability impact determination result ('normal') is generated. Accordingly, services ('fatal'), server 1 ('fatal'), server 1's central processing unit ('fatal'), server 2 ('urgent'), server 2's memory ('urgent'), server 3 ('Risk'), Server 3's network interface card ('Risk'), and Server 4 ('Normal') establish the fault propagation path between the service, server, and unit object.

도 4는 본 발명의 제3 실시예에 따른 단위 객체에 장애 발생시 서비스 운용에 관련된 서버들의 서비스 장애 영향도 전파 경로를 설정하는 과정을 설명하기 위한 도면이다.FIG. 4 is a diagram illustrating a process of setting a service failure impact propagation path of servers related to service operation when a failure occurs in a unit object according to a third embodiment of the present invention.

중앙 처리 장치 그룹의 중앙 처리 장치 1, 중앙 처리 장치 2, 중앙 처리 장치 3, 중앙 처리 장치 4, 중앙 처리 장치 5는 상태값이 90, 60, 50, 50, 50이고, 각 단위 객체의 임계값이 90(치명)/80(긴급)/70(위험)/60 이하(정상)로 설정한다.The central processing unit 1, central processing unit 2, central processing unit 3, central processing unit 4, and central processing unit 5 of the central processing unit group have status values of 90, 60, 50, 50, and 50, and the threshold values of each unit object. Set to 90 (fatal) / 80 (urgent) / 70 (danger) / 60 or less (normal).

따라서, 장애 영향도 생성부(140)는 중앙 처리 장치 그룹의 장애 영향도 판단 기준('정족수')에 따라 장애 영향도 판단 결과를 '정상'으로 설정한다.Accordingly, the failure impact generation unit 140 sets the failure impact determination result to 'normal' according to the failure impact determination criterion ('quorum') of the CPU group.

메모리 그룹의 메모리 1, 메모리 2, 메모리 3, 메모리 4, 메모리 5는 상태값이 50, 80, 30, 20, 20이고, 각 단위 객체의 임계값이 90(치명)/80(긴급)/70(위 험)/60 이하(정상)로 설정한다.Memory 1, Memory 2, Memory 3, Memory 4, and Memory 5 of the memory group have status values of 50, 80, 30, 20, and 20, and the threshold value of each unit object is 90 (fatal) / 80 (urgent) / 70 Set to (Danger) / 60 or less (Normal).

따라서, 장애 영향도 생성부(140)는 메모리 그룹의 장애 영향도 판단 기준('정족수')에 따라 장애 영향도 판단 결과를 '정상'으로 설정한다.Therefore, the failure impact generation unit 140 sets the failure impact determination result to 'normal' according to the failure impact determination criterion ('quorum') of the memory group.

하드 디스크 그룹의 하드 디스크 1, 하드 디스크 2, 하드 디스크 3, 하드 디스크 4, 하드 디스크 5는 상태값이 10, 30, 20, 70, 40이고, 각 단위 객체의 임계값이 90(치명)/80(긴급)/70(위험)/60 이하(정상)로 설정한다.Hard disk 1, hard disk 2, hard disk 3, hard disk 4, and hard disk 5 of the hard disk group have status values of 10, 30, 20, 70, and 40, and the threshold value of each unit object is 90 (fatal) / Set it to 80 (Emergency) / 70 (Danger) / 60 or less (Normal).

따라서, 장애 영향도 생성부(140)는 하드 디스크 그룹의 장애 영향도 판단 기준('정족수')에 따라 장애 영향도 판단 결과를 '정상'으로 설정한다.Therefore, the failure impact generation unit 140 sets the failure impact determination result to 'normal' according to the failure impact determination criteria ('quorum') of the hard disk group.

네트워크 인터페이스 그룹의 네트워크 인터페이스 카드 1, 네트워크 인터페이스 카드 2, 네트워크 인터페이스 카드 3, 네트워크 인터페이스 카드 4, 네트워크 인터페이스 카드 5는 상태값이 10, 30, 20, 20, 40이고, 각 단위 객체의 임계값이 90(치명)/80(긴급)/70(위험)/60 이하(정상)로 설정한다.The network interface card 1, network interface card 2, network interface card 3, network interface card 4, and network interface card 5 of the network interface group have status values of 10, 30, 20, 20, and 40, and a threshold value of each unit object. Set to 90 (fatal) / 80 (emergency) / 70 (danger) / 60 or less (normal).

따라서, 장애 영향도 생성부(140)는 네트워크 인터페이스 그룹의 장애 영향도 판단 기준('정족수')에 따라 장애 영향도 판단 결과를 '정상'으로 설정한다.Accordingly, the failure impact generation unit 140 sets the failure impact determination result to 'normal' according to the failure impact determination criteria ('quorum') of the network interface group.

장애 영향도 전송부(150)는 중앙 처리 장치 그룹, 메모리 그룹, 하드 디스크 그룹 및 네트워크 인터페이스 그룹에서 생성된 각각의 장애 영향도 판단 결과('정상', '정상', '정상' 및 '정상')를 서비스로 전송한다.The failure impact transmitter 150 determines the failure impact results of each of the CPUs, memory groups, hard disk groups, and network interface groups ('normal', 'normal', 'normal', and 'normal'). ) To the service.

장애 영향도 생성부(140)는 서비스에 수신된 중앙 처리 장치 그룹, 메모리 그룹, 하드 디스크 그룹 및 네트워크 인터페이스 그룹의 각 장애 영향도 판단 결과와 기설정된 서비스의 임계값('정상')을 장애 영향도 판단 기준('최대값') 및 장애 영향도 검색 범위('내부 장애 영향도 우선 반영')를 기초로 판단하여 장애 영향도 판단 결과('정상')를 생성한다.The failure impact generation unit 140 determines the failure impact result of each of the central processing unit group, the memory group, the hard disk group, and the network interface group received by the service and the threshold value ('normal') of the preset service. Based on the degree determination criteria ('maximum value') and the disability influence search range ('internal disability influence first'), the disability influence determination result ('normal') is generated.

이에 따라, 서비스('정상'), 중앙 처리 장치 그룹('정상'), 중앙 처리 장치 1('치명'), 메모리 그룹('정상'), 메모리 2('긴급'), 하드 디스크 그룹('정상'), 하드 디스크 4('위험'), 네트워크 인터페이스 카드 그룹('정상')이 설정됨으로써 서비스, 서버, 단위 객체 간의 장애 전파 경로가 설정된다.As a result, services ('normal'), central processing unit groups ('normal'), central processing unit 1 ('fatal'), memory groups ('normal'), memory 2 ('urgent'), and hard disk groups ( By setting up 'Normal', Hard Disk 4 ('Danger'), and Network Interface Card Group ('Normal'), fault propagation path between service, server and unit object is established.

이상에서 설명한 본 발명의 실시예는 장치 및/또는 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하기 위한 프로그램, 그 프로그램이 기록된 기록 매체 등을 통해 구현될 수도 있으며, 이러한 구현은 앞서 설명한 실시예의 기재로부터 본 발명이 속하는 기술분야의 전문가라면 쉽게 구현할 수 있는 것이다.The embodiments of the present invention described above are not implemented only by the apparatus and / or method, but may be implemented through a program for realizing a function corresponding to the configuration of the embodiments of the present invention, a recording medium on which the program is recorded, and the like. Such implementations may be readily implemented by those skilled in the art from the description of the above-described embodiments.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

도 1은 본 발명의 실시예에 따른 서비스 장애 감시 장치의 내부 구성을 간략하게 나타낸 블록 구성도이다.1 is a block diagram schematically illustrating an internal configuration of a service failure monitoring apparatus according to an exemplary embodiment of the present invention.

Claims

(a) selecting a failure management object related to service operation, wherein the failure management object includes a service, a server, and a unit object;

(b) setting respective failure impact attributes for the service, the server, and the unit object, wherein the failure impact attribute includes a failure impact determination criterion and a failure impact search range; And

(c) generating a result of determining a failure impact degree for each of the service, the server, and the unit object based on the failure impact property to provide a service between a plurality of servers related to the operation of the service and a unit object of each server; Steps to Set Up the Fault Impact Propagation Path

Service failure monitoring method comprising a.

According to claim 1,

After step (c),

(d) comparing the status values of the failure management object causing the failure with a predetermined threshold value, and then analyzing the failure effects of the service, the server, and the unit object based on the failure determination criteria and the failure impact search range. Generating a result of determining the degree; And

(e) transmitting the result of the failure impact determination of each of the generated service, server, and unit object from the unit object to the server and from the server to the service through the failure impact transmission path and the failure impact reception path; To establish a fault propagation path

Service failure monitoring method further comprising.

The method of claim 2,

After step (e),

Outputting a failure state, a failure location and a failure impact of the failure management object, and a failure impact determination result of each of the generated service, server, and unit object through the topology map; And

Selecting a failure management object in the failure impact transmission path and the failure impact reception path to search for a failure propagation path between the service, the server, and the unit object;

Service failure monitoring method further comprising.

The method according to any one of claims 1 to 3,

The criterion for determining the impact of failure is that after comparing the status values of the failure management object with the preset threshold,

An average value influence criterion for reflecting an average value of the state values of the failure management object causing the failure as a failure state of the failure management object;

A quorum influence determination criterion that reflects a failure state of the corresponding failure management object when the failure management objects causing the failure exceed a defined quorum;

A maximum value influence determination criterion for reflecting a maximum value among the state values of the failure management object causing the failure as a failure state of the corresponding failure management object; And

Criteria for determining the condition value influence that reflects the failure status of the corresponding failure management object when the status values of the failure management object causing the failure exceed the condition values

The service failure monitoring method, characterized in that for selecting any one of the criteria.

The method according to any one of claims 1 to 3,

The failure impact search range,

The internal component that causes the failure, wherein the internal component is a failure management object related to the operation of the service, includes a plurality of primary servers or respective unit objects of the plurality of primary servers. First reflecting the impact of an internal failure determined as a failure state of a high-level failure management object;

The external component that caused the failure, wherein the external component is a failure management object of an external service not directly related to the service, includes a plurality of external servers or respective unit objects of the plurality of external servers. First reflecting an external failure influence that determines the failure state of the higher level failure management object by giving priority to the failure impact degree; And

Reflects the mixed failure effect that determines the failure effect generated by the internal components and the external components as the failure state of the high level failure management object.

Service failure monitoring method, characterized in that any one of.

(a) A plurality of servers related to the operation of the service—each server compares the state values of the unit objects causing the failure with predetermined thresholds and then generates a server failure impact determination result based on the first failure determination criteria. Reflecting to the failure state of each server to register in the topology map; And

(b) generating service failure impact determination results based on a second failure determination criterion and a failure impact search range based on each failure impact determination result of the plurality of servers; Step of establishing fault propagation path between unit objects

Service failure monitoring method comprising a.

A failure management object registration unit configured to register a failure management object related to service operation, wherein the failure management object includes a service, a server, and a unit object in a database storing the failure management object and registering the same in a topology map;

An attribute setting unit for setting a failure impact attribute of the registered failure management object, wherein the failure impact attribute includes a first failure impact determination criterion, a second failure impact determination criterion, and a failure impact search range; And

A fault impact degree that sets a fault impact propagation path between a plurality of servers related to the operation of the service and a unit object of each server by generating a fault impact determination result using the fault impact property in the fault management object. Path selector

Service failure monitoring device comprising a.

The method of claim 7, wherein

After comparing the state values of the unit objects causing the failure with preset thresholds, a server failure impact determination result is generated based on the first failure impact determination criterion, and each server failure impact of a plurality of servers is also determined. A failure impact generation unit that generates a service failure impact determination result based on the second failure determination criteria and the failure impact search range.

Service failure monitoring device further comprising.

The method according to claim 7 or 8,

A fault impact level output unit that outputs a fault state, fault impact level, and the generated fault impact level determination result of the fault management object through the topology map.

Service failure monitoring device further comprising.

The method according to claim 7 or 8,

A fault impact propagation path searching unit that searches for a fault propagation path between the service, the server, and the unit object by selecting the fault management object in the fault impact transmission path and the fault impact receiving path.

Service failure monitoring device further comprising.