KR100887874B1

KR100887874B1 - System for managing fault of internet and method thereof

Info

Publication number: KR100887874B1
Application number: KR1020020036891A
Authority: KR
Inventors: 홍원규; 윤동식
Original assignee: 주식회사 케이티
Priority date: 2002-06-28
Filing date: 2002-06-28
Publication date: 2009-03-06
Also published as: KR20040001627A

Abstract

본 발명은 인터넷 망에서 발생하는 이벤트들의 상관 관계를 분석하여 장애 원인을 찾아내고 이에 필요한 조치가 이루어지도록 하는 인터넷 망의 장애 관리 시스템 및 그 방법에 관한 것으로, 인터넷 망을 구성하는 장치가 발생시키는 이벤트들을 빠짐없이 수신하고 수신된 이벤트들의 상관 관계를 분석하여 정확한 원인을 파악하고 그 결과에 따라 인터넷 서비스 제공에 지장을 초래하는 상황이 발생한 경우에는 즉각적인 조치를 취함으로써, 망 사업자 관점에서는 안정적인 인터넷 서비스 제공이 가능하고, 이러한 안정적인 인터넷 통신 서비스를 통해 서비스 품질 또한 높일 수 있는 효과가 있다.The present invention relates to a failure management system and method of the Internet network for analyzing the correlation between events occurring in the Internet network to find the cause of the failure and to take necessary measures. Provides reliable Internet service from the network operator's point of view by taking immediate action in case of situations that interfere with the provision of Internet service according to the result by analyzing the correlation between the received events and the received events. This is possible, and through this stable Internet communication service, there is an effect of increasing the quality of service.

Description

System for managing fault of internet network and method thereof

도 1은 본 발명에 따른 인터넷 망의 장애 관리 시스템의 구성도.1 is a block diagram of a failure management system of the Internet network according to the present invention.

도 2는 도 1의 이벤트 관리부의 상세 구성도.FIG. 2 is a detailed configuration diagram of the event management unit of FIG. 1.

본 발명은 인터넷 망의 장애 관리 시스템 및 그 방법에 관한 것으로, 보다 상세하게는 인터넷 망에서 발생하는 이벤트들의 상관 관계를 분석하여 장애 원인을 찾아내고 이에 필요한 조치가 이루어지도록 하는 인터넷 망의 장애 관리 시스템 및 그 방법에 관한 것이다.The present invention relates to a failure management system and method of the Internet network, and more particularly, a failure management system of the Internet network to find the cause of the failure by analyzing the correlation between the events occurring in the Internet network and to take necessary measures And to a method thereof.

일반적으로, 인터넷 망은 적게는 수십에서 많게는 수백 개의 라우터 및 스위치들로 구성된다. In general, the Internet network consists of at least tens to hundreds of routers and switches.

이러한 인터넷 망을 통하여, 안정적인 인터넷 서비스 제공을 보장하려면, 상기 라우터 및 스위치들에 대한 장애 감시가 필수적이고, 이들의 장애 감시를 통하여 이상 상황을 감지한 경우에는 즉각적인 조치가 이루어져야 한다. Through such an internet network, in order to ensure stable Internet service provision, failure monitoring of the routers and switches is essential, and when an abnormal situation is detected through their failure monitoring, immediate action should be taken.

이러한, 인터넷 망의 관리는 망과 연결된 장치들에 포함된 SNMP(Simple Network Management Protocol) 에이전트(Agent)가 장치의 이상 상황을 발견하면 트랩 메시지를 망 관리 시스템에 제공함으로써 이루어진다. The management of the Internet network is performed by providing a trap message to the network management system when a Simple Network Management Protocol (SNMP) Agent included in devices connected to the network detects an abnormal state of the device.

이는 망 관리 시스템이 장치에 탑재된 SNMP 에이전트가 제공하는 트랩 메시지에 의존하여 장치의 이상 상황을 수동적으로 파악하는 방법이고, 망 관리 시스템에 의한 능동적인 방법은 SNMP를 이용하여 장치의 상태를 주기적으로 검색하여 장치의 이상 상황을 파악하는 방법이다.This is a way for the network management system to passively grasp the abnormal condition of the device based on the trap message provided by the SNMP agent installed in the device. The active method by the network management system periodically monitors the device status using SNMP. It is a way to find out the abnormal situation of the device by searching.

하지만, 전자의 방법은 SNMP가 기본적으로 신뢰성을 보장하지 않는 UDP(User Datagram Protocol)을 기반으로 동작하기 때문에 트랩 메시지가 정확하게 인터넷 망과 연결된 장치로부터 망 관리 시스템에 전달되는 것을 보장할 수 없으므로, 망 관리 시스템에 의한 정확한 장치들의 장애 감시가 어려운 문제점이 있다.However, since the former method is based on User Datagram Protocol (UDP), which does not guarantee reliability by default, it cannot guarantee that trap messages are correctly transmitted from the device connected to the Internet network to the network management system. It is difficult to accurately monitor faults of devices by the management system.

그리고, 후자의 방법은 망 관리 시스템이 수많은 장치들의 상태를 주기적으로 검색하여야 하므로, 망 관리 시스템에 많은 부하가 뒤따르고 검색 주기 또한 길어 정확한 망 장애 상태를 파악하기 어려운 문제점이 있다.In the latter method, since the network management system needs to periodically search for the state of a large number of devices, there is a problem in that it is difficult to identify an accurate network failure state due to a large load on the network management system and a long search period.

또한, 다수의 라우터 및 스위치로부터 동시 다발적으로 발생하는 많은 수의 이벤트 혹은 트랩 메시지를 망 관리 시스템이 수신한 경우, 정확한 진단 기능이 없으면 운용자에 의한 정확한 원인 파악이 힘들다. In addition, when the network management system receives a large number of events or trap messages simultaneously occurring from multiple routers and switches, it is difficult to determine the exact cause by the operator without the accurate diagnosis function.

즉, SNMP는 그 특성상 데이터 전송에 대한 신뢰성을 보장하지 않는 UDP 기반으로 동작하므로 장치에 탑재된 SNMP 에이전트가 장치에 대한 이상 상황을 인식하고 이를 망 관리 시스템에게 통보하기 위한 트랩 메시지를 전송하지만, 실제 환경에서는 여러 가지 이유로 인하여 트랩 메시지가 유실되는 경우가 많다. In other words, SNMP operates based on UDP which does not guarantee the reliability of data transmission. Therefore, the SNMP agent mounted on the device transmits a trap message for notifying an abnormal situation of the device and notifying the network management system of the situation. In many circumstances, trap messages are often lost for various reasons.

또한, 소수의 운용자가 수백 대의 라우터 및 장치에 직접 접근하여 장치의 상황을 파악하는 것은 거의 불가능하기 때문에, 트랩 메시지의 유실은 망 관리 시스템 혹은 인터넷 망 관리자 측면에서는 매우 심각하다. In addition, the loss of trap messages is very serious in terms of network management systems or Internet network managers, since it is nearly impossible for a few operators to gain direct access to hundreds of routers and devices to determine the status of the device.

따라서, 망 관리 시스템에 의한 장애 관리가 필수적이지만, 인터넷 망을 구성하는 장치가 탑재하고 있는 SNMP를 통한 트랩 메시지는 빈번한 유실로 인한 정확한 장애 관리가 거의 되지 않는 문제점이 있다.Therefore, although fault management by a network management system is essential, there is a problem in that a trap message through SNMP mounted on a device constituting the Internet network is hardly managed correctly due to frequent loss.

상술된 문제점을 해결하기 위하여, 본 발명의 목적은 인터넷 망의 각종 장치에서 발생하는 모든 이벤트를 분석 처리하여 인터넷 망의 장애를 통합 관리함에 있다.In order to solve the above problems, an object of the present invention is to analyze and process all events occurring in the various devices of the Internet network to integrate management of the failure of the Internet network.

이를 위하여, 본 발명에 따른 인터넷 망의 장애 관리 시스템은, 인터넷 망에 구성된 장치에 대한 장애를 관리하는 인터넷 망의 장애 관리 시스템에 있어서, 장치의 장애 정보와 이벤트에 대한 처리 규칙을 갖는 장애 관리 정책 저장부; 처리 규칙에 따라 장치의 트랩 메시지를 전달하는 트랩 메시지 관리부; 처리 규칙에 따라 장치의 상태를 수집하는 장비 상태 관리부; 처리 규칙에 따라 핑(Ping)을 이용하여 주기적으로 상기 인터넷 망에 속한 임의의 구간 및 임의의 상기 장치에 대한 도달 가능성을 관리하는 핑 관리부; 처리 규칙에 따라 상기 장치의 시스템 로그 데이터를 수집하는 시스템 로그 관리부; 및 트랩 메시지 관리부, 장비 상태 관리부, 핑 관리부 및 시스템 로그 관리부로부터 각각 이벤트 메시지를 전달받아서, 이벤트 메시지간 상관 관계를 분석하는 이벤트 관리부를 구비하는 것을 특징으로 한다.To this end, in the failure management system of the Internet network according to the present invention, in the failure management system of the Internet network for managing the failure of the device configured in the Internet network, a failure management policy having a rule for processing the failure information and events of the device Storage unit; A trap message management unit for transferring a trap message of a device according to a processing rule; An equipment state management unit for collecting a state of a device according to a processing rule; A ping manager that manages reachability of any section and any device belonging to the Internet network periodically using a ping according to a processing rule; A system log manager configured to collect system log data of the device according to a processing rule; And an event management unit receiving event messages from the trap message management unit, the device state management unit, the ping management unit, and the system log management unit, respectively, and analyzing the correlations between the event messages.

그리고, 본 발명에 따른 인터넷 망의 장애 관리 방법은, 인터넷 망에 구성된 장치의 장애에 대한 이벤트 메시지를 이용한 인터넷 망의 장애 관리 방법에 있어서, 장치의 장애 상태, 장치의 상태, 장치의 시스템 로그 메시지 및 핑을 이용한 장치에 대한 도달 가능성에 대한 정보를 각각 이벤트 메시지로 전달받아서 저장하는 단계; 이벤트 메시지들 중 중복된 이벤트의 필터링과 미리 정의된 불필요한 이벤트에 대한 필터링 중 최소한 하나 이상의 필터링을 수행하는 단계; 필터링된 이벤트 메시지들의 상관 관계를 분석하는 단계; 및 분석된 결과를 통보하는 단계를 구비함을 특징으로 한다.In addition, the fault management method of the Internet network according to the present invention, in the fault management method of the Internet network using the event message for the failure of the device configured in the Internet network, the failure status of the device, the status of the device, the system log message of the device And receiving and storing information on the reachability of the device using the ping as an event message, respectively. Performing at least one filtering of filtering of duplicated events among the event messages and filtering for a predefined unnecessary event; Analyzing the filtered event messages; And notifying the analyzed result.

이하, 본 발명의 실시예에 대해 첨부된 도면을 참조하여 보다 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 인터넷 망의 장애 관리 시스템의 구성도이다.1 is a block diagram of a failure management system of the Internet network according to the present invention.

도시된 바와 같이, 장애 관리 시스템(100)은, 장애 관리를 위한 제반 사항에 대한 규칙 및 원칙을 저장하는 장애 관리 정책 데이터베이스(300), 인터넷을 통하여 장애 관리 시스템(100)과 연결된 장치(810 내지 830)에 탑재된 SNMP 에이전트에 의해 실시간으로 트랩 메시지를 전달받아 처리하는 트랩 메시지 관리부(500), 망 형상 정보를 기반으로 장애 감시 대상을 검색하고 SNMP를 이용하여 직접 장치(810 내지 830)의 상태를 검색하여 처리하는 장비 상태 관리부(400), 인터넷 망에서의 임의의 구간 혹은 임의의 포트에 대한 도달 가능성을 PING을 이용하여 주기적으로 파악하여 관리하는 PING 관리부(700), 장치(810 내지 830)의 형상 변경 내역 및 작 업 내역을 상세하게 기록한 시스템 로그 데이터를 수집하여 분석하고 관리하는 시스템 로그 관리부(600), 이러한 장비 상태 관리부(400), 트랩 메시지 관리부(500), 시스템 로그 관리부(600) 및 PING 관리부(700)가 제공하는 각종 이벤트 메시지를 수집하여 이들 간의 상관 관계를 분석하여 장애의 근본 원인(Root Cause)을 파악하고 장애 원인별 파급 효과에 따라 심각도(Severity)를 할당하여 운용자에게 통보하는 이벤트 관리부(200)를 구비한다.As illustrated, the failure management system 100 includes a failure management policy database 300 that stores rules and principles for various matters for failure management, and a device 810 through which is connected to the failure management system 100 through the Internet. The trap message management unit 500 that receives and processes the trap message in real time by the SNMP agent installed in the device 830, searches for a failure monitoring target based on the network shape information, and directly uses the SNMP state of the devices 810 to 830. Equipment state management unit 400 for searching for and processing the PING management unit 700, the apparatus (810 to 830) to periodically identify and manage the reachability of any section or any port in the Internet network using the PING The system log management unit 600 collects, analyzes and manages the system log data, which records the configuration change history and the operation details in detail, and the state of such equipment. Collects various event messages provided by the resource unit 400, the trap message management unit 500, the system log management unit 600, and the PING management unit 700, and analyzes correlations between them to determine the root cause of the failure. And an event management unit 200 which notifies an operator by assigning a severity according to the ripple effect of each failure cause.

여기서, 장애 관리 정책 데이터베이스(300)는 장비 상태 관리부(400), 트랩 메시지 관리부(500), 시스템 로그 관리부(600) 및 PING 관리부(700)가 각각 장치(810 내지 830)의 장애 정보 혹은 이벤트를 처리하는 제반 규칙을 저장하고 있는 것이다.Here, the failure management policy database 300 is the equipment state management unit 400, trap message management unit 500, the system log management unit 600 and the PING management unit 700, respectively, failure information or events of the devices (810 to 830) It stores all the rules that are processed.

예를 들어, PING 관리부(700)를 위한 장애 관리 정책은 PING을 통하여 도달 가능성을 검사할 대상 장치 목록(라우터, 라우터의 포트 등)과 PING을 통하여 도달 가능성을 검사할 주기(예, 5분 10분 등)를 가지고 있고, PING 관리부(700)는 장애 관리 정책 데이터베이스(300)에 지정된 장치(810 내지 830)만을 대상으로 지정된 주기에 한번씩 도달 가능성을 검사한다For example, a failure management policy for the PING management unit 700 may include a list of target devices (routers, ports of routers, etc.) to check reachability through PING, and a period of checking reachability through PING (eg, 5 minutes 10 Minutes, etc.), the PING management unit 700 checks the reachability once every designated period for only the devices 810 to 830 specified in the failure management policy database 300.

도 2는 도 1의 이벤트 관리부(200)의 상세 구성을 나타낸다.2 illustrates a detailed configuration of the event manager 200 of FIG. 1.

이벤트 관리부(200)는, 트랩 메시지를 저장하는 트랩 큐(220), 망 장치의 상태 정보를 저장하는 망 상태 큐(222), 시스템 로그 메시지를 저장하는 시스템 로그 큐(224), PING 상태 메시지를 저장하는 PING 상태 큐(226), 트랩 메시지 관리부(500)로부터 수신된 트랩 메시지를 받아 트랩 큐(220)에 저장하는 트랩 이벤 트 수신부(210), 장비 상태 관리부(400)로부터 수신된 장치의 상태 정보를 망 상태 큐(222)에 저장하는 상태 수신부(212), 시스템 로그 관리부(600)로부터 수신된 각종 시스템 로그 메시지를 받아 시스템 로그 큐(224)에 저장하는 시스템 로그 수신부(214), PING 관리부(700)로부터 수신된 장치의 도달 가능성 정보를 받아 PING 상태 큐(226)에 저장하는 PING 상태 수신부(216), 트랩 큐(220), 망 상태 큐(222), 시스템 로그 큐(224) 및 PING 상태 큐(226)에 저장되어 있는 이벤트 메시지들을 읽어들여 이벤트에 대한 중복 발생 여부를 파악하여 처음에 발생한 이벤트만을 처리하고 후속적으로 발생한 동일한 이벤트들은 무시하는 이벤트 중복 처리부(230), 이벤트 중복 처리부(230)에서 중복 여부가 판단된 이벤트를 저장하는 이벤트 로그(232), 필터링을 위한 규칙을 저장하는 이벤트 필터 정책 데이터베이스(242), 이벤트 중복 처리부(230)가 전달한 이벤트 메시지에 대하여 이벤트 필터 정책 데이터베이스(242)를 참조하여 필터링 기능을 수행하는 이벤트 필터 처리부(240), 이벤트의 상관 관계 분석을 위한 규칙을 저장하는 이벤트 상관 관계 데이터베이스(252), 인터넷의 망 구성 정보를 저장하는 인터넷 망 형상 정보 데이터베이스(254), 이벤트 상관 관계 테이터베이스(252)와 인터넷 망 형상 정보 데이터베이스(254)를 참조하여 이벤트 필터 처리부(240)로부터 수신된 이벤트들 간의 상관 관계를 분석하고 이벤트 발생의 근본 원인을 찾는 이벤트 상관 관계 분석부(250), 이벤트 상관 관계 분석부(250)의 결과를 해당 시스템에게 통보하는 이벤트 통보부(260)를 구비한다. The event manager 200 may include a trap queue 220 for storing trap messages, a network status queue 222 for storing state information of network devices, a system log queue 224 for storing system log messages, and a PING status message. The state of the device received from the PING status queue 226 to store, the trap event receiver 210 to receive the trap message received from the trap message management unit 500 and to store it in the trap queue 220. A status receiver 212 for storing information in the network status queue 222, a system log receiver 214 for receiving various system log messages received from the system log manager 600 and stored in the system log queue 224, and a PING manager. PING status receiver 216, trap queue 220, network status queue 222, system log queue 224 and PING to receive the reachability information of the device received from the 700 and stored in the PING status queue 226 Event messages stored in status queue 226 Reads the event to determine whether or not the duplicate occurrence of the event to process only the event that occurred initially, and ignores the same event subsequently occurred, the event duplicate processing unit 230, the event duplicate processing unit 230 stores the event determined whether the duplicate An event log 232, an event filter policy database 242 storing filtering rules, and an event performing a filtering function with reference to the event filter policy database 242 for the event message delivered by the event duplication processor 230. Filter processing unit 240, an event correlation database 252 for storing rules for correlation analysis of events, an internet network shape information database 254 for storing network configuration information of the Internet, an event correlation database 252 Filter processing unit with reference to the network configuration information database 254 The event correlation unit 250 analyzes the correlation between the events received from the 240 and finds the root cause of the occurrence of the event, and the event notification unit notifying the corresponding system of the result of the event correlation analyzer 250 ( 260.

트랩 이벤트 수신부(210)는 트랩 메시지 관리부(500)가 인터넷 망의 장치로 부터 수집한 각종 트랩 메시지를 받아 트랩 큐(220)에 저장하는 기능을 한다.The trap event receiver 210 receives the various trap messages collected by the trap message manager 500 from a device on the Internet network and stores them in the trap queue 220.

또한, 상태 수신부(212)는 장비 상태 관리부(400)가 인터넷 망 장치로부터 수집한 장치의 상태 정보를 받아 망 상태 큐(222)에 저장하는 기능을 수행한다.In addition, the state receiving unit 212 receives the state information of the device collected by the equipment state management unit 400 from the Internet network device and performs the function of storing in the network state queue 222.

이와 더불어, 시스템 로그 수신부(214)는 시스템 로그 관리부(600)가 인터넷 망 장치로부터 수집한 각종 시스템 로그 메시지를 받아 시스템 로그 큐(224)에 저장하고, PING 상태 수신부(216)는 PING 관리부(700)가 인터넷 망 장치에 접근하여 수집한 도달 가능성 정보를 받아 PING 상태 큐(226)에 저장하는 기능을 수행한다. In addition, the system log receiver 214 receives various system log messages collected by the system log manager 600 from the Internet network device, and stores them in the system log queue 224, and the PING state receiver 216 is a PING manager 700. ) Receives the reachability information collected by accessing the Internet network device and stores it in the PING status queue 226.

이러한 트랩 큐(220), 망 상태 큐(222), 시스템 로그 큐(224) 및 PING 상태 큐(226)는 망 장치로부터 발생하는 이벤트의 발생 빈도가 이벤트 관리부(200)에서 처리할 수 있는 능력보다 많이 발생하므로 이에 대한 유실을 방지하기 위한 완충장치로, 먼저 큐에 입력된 이벤트가 먼저 처리되는 특성을 가진다. The trap queue 220, the network status queue 222, the system log queue 224, and the PING status queue 226 have a frequency of occurrence of events occurring from the network device than the event management unit 200 can handle. It is a buffer device to prevent the loss because it occurs a lot, it has the characteristic that the event entered into the queue first.

그리고, 이러한 트랩 이벤트 수신부(210), 상태 수신부(212), 시스템 로그 수신부(214) 및 PING 상태 수신부(216)는 각종 이벤트 메시지들을 수신하여 해당 큐에 저장하는 기능만 수행한다.The trap event receiver 210, the state receiver 212, the system log receiver 214, and the PING state receiver 216 perform only a function of receiving various event messages and storing them in a corresponding queue.

그러면, 이벤트 중복 처리부(230)는 각종 큐에 존재하는 이벤트 메시지들을 큐에서 읽어 이벤트에 대한 중복 발생 여부를 파악하고 중복으로 발생된 이벤트에 대해서는 처음에 발생한 이벤트만을 처리하고 후속적으로 발생한 동일한 이벤트들은 무시한다.Then, the event duplication processing unit 230 reads the event messages present in the various queues from the queue to determine whether there is a duplicate event, and processes only the first event that occurred in the event of the duplicate event, and subsequently the same events Ignore it.

이는, 장애 관리 시스템(100)의 이벤트 관리부(200)의 성능을 향상시키기 위함이다. This is to improve the performance of the event management unit 200 of the failure management system 100.

이벤트 중복 처리부(230)는 각종 큐에서 메시지를 읽어 이벤트 로그(232)에 저장하고, 이를 큐에서 제거한다. The event duplication processor 230 reads messages from various queues, stores them in the event log 232, and removes them from the queue.

만약 그 다음에 큐에서 읽은 이벤트 메시지가 이미 이벤트 로그(232)에 저장되어 있으면 해당 이벤트 메시지는 이벤트 로그(232)에 저장하지 않고, 큐에서만 삭제한다.If an event message read from the queue is already stored in the event log 232, the event message is not stored in the event log 232 but deleted only in the queue.

이후, 이벤트 중복 처리부(230)는 해당 이벤트를 이벤트 필터 처리부(240)에게 전송한다.Thereafter, the event duplication processor 230 transmits the corresponding event to the event filter processor 240.

이벤트 필터 처리부(240)는 이벤트 중복 처리부(230)가 전달한 이벤트 메시지에 대한 필터링(Filtering) 기능을 수행한다. The event filter processor 240 performs a filtering function on the event message delivered by the event duplicate processor 230.

필터링이란 인터넷 망의 장애 관리 시스템(100)에서 필요하지 않은 이벤트들은 무시하여 이벤트 처리에 대한 성능을 향상시키고, 중요한 이벤트는 부각시키고 중요하지 않은 이벤트를 무시하여 망 장애 상태 진단에 효율성을 기하기 위한 것이다.Filtering is to improve the performance of event processing by ignoring events that are not needed in the failure management system 100 of the Internet network, and to improve the efficiency of diagnosing network failure states by highlighting important events and ignoring non-critical events. will be.

필터링을 위한 규칙은 장애 관리 시스템(100) 구축 시 정의되거나, 장애 관리시스템(100)의 운용 중에 운용자에 의해 수시로 변경 가능하며, 이러한 규칙은 이벤트 필터 정책 데이터베이스(242)에 저장된다.Rules for filtering may be defined at the time of configuring the failure management system 100 or may be changed at any time by the operator during operation of the failure management system 100, and these rules are stored in the event filter policy database 242.

이벤트 필터 처리부(240)는 이벤트 중복 처리부(230)에서 이벤트에 대한 중복성을 제거하여 제공한 이벤트를 수신하여 이벤트 필터 정책 데이터베이스(242)에 저장된 규칙에 따라 무시될 필요성이 있는 이벤트는 무시하고 처리하여야 될 필요성이 있는 중요 이벤트는 이벤트 상관 관계 분석부(250)에게 전달한다. The event filter processor 240 receives an event provided by removing the redundancy of the event from the event duplicate processor 230 and ignores and processes an event that needs to be ignored according to a rule stored in the event filter policy database 242. Important events that need to be forwarded are delivered to the event correlation analyzer 250.

이벤트 상관 관계 분석부(250)는 망 장치로부터 발생하는 각종 이벤트 메시지들을 이벤트 필터 처리부(240)로부터 수신하여 이들 이벤트들 간의 연관 관계를 파악하고 이벤트 발생의 근본 원인을 찾는 기능을 수행한다.The event correlation analyzer 250 receives various event messages generated from the network device from the event filter processor 240 to identify correlations between these events and to find the root cause of the event occurrence.

이러한 이벤트 상관 관계 분석부(250)는 수신된 이벤트에 대한 심각도를 지정하여 운용자에게 통보함으로써 운용자로 하여금 심각도 유형에 따라 즉각적인 조치가 이루어 질 수 있도록 한다. The event correlation analyzer 250 specifies the severity of the received event and notifies the operator so that the operator can immediately take action according to the severity type.

이때, 심각도는 심각, 중요, 경고, 일반, 해제의 5 단계로 세분화하여 관리한다.At this time, the severity is divided into five levels of serious, important, warning, general and release.

심각은 해당 이벤트의 파급 효과가, 둘 이상의 인터넷 가입자가 인터넷 서비스를 제공받지 못하게 될 이벤트인 경우에 지정되고, 중요는 개별 인터넷 가입자가 인터넷 서비스의 이용이 불가능할 경우에 지정되며, 경고는 현재의 인터넷 서비스 이용에는 문제가 되지 않지만 현재 상태가 지속되면 머지 않아 인터넷 서비스 제공에 차질을 가져올 가능성이 있는 이벤트의 경우에 지정되고, 일반은 인터넷 서비스 이용에 지장을 초래하는 이벤트는 아니지만, 운용자가 인지하여야 할 필요성이 있는 이벤트의 경우에 저장되고, 해지는 심각, 중요, 경고로 지정된 이벤트의 발생 원인이 해지되어 장애가 복구(이전 심각도가 심각 및 중요인 경우)되었거나, 장애 발생 가능성이 없어진 경우(이전 심각도가 경고인 경우)에 지정된다.A severity is specified when the ripple effect of the event is an event that will prevent more than one Internet subscriber from being provided with Internet service, an importance is assigned when an individual Internet subscriber is not available, and a warning is given for the current Internet. This is not a problem for service use, but it is designated for an event that may cause a disruption in the Internet service provision in the near future if the current state persists, and the general is not an event that impedes the use of the Internet service, but the operator should be aware of it. In the event of a necessary event, the cause of the event, designated as critical, critical, or warning, is closed and the failure is recovered (if the previous severity is critical and critical), or the possibility of a failure is eliminated (the previous severity is alerted). Is specified in the

따라서, 이벤트에 대한 심각도는 운용자로 하여금 이벤트의 심각도만 보아도 해당 이벤트의 발생으로 인한 파급 효과를 즉각적으로 인지하고 이에 대한 조치가 손쉽게 이루어지도록 할 수 있다. Therefore, the severity of the event allows the operator to immediately recognize the ripple effect caused by the occurrence of the event and to easily take action on the event, even if only the severity of the event is viewed.

이때, 발생 이벤트에 대한 심각도는 개별 이벤트만 보아도 심각도가 바로 결정되는 경우가 있을 수 있고, 지금까지 발생한 이벤트의 추이를 분석을 통하여 심각도기 결정되는 두 가지로 분리된다.In this case, the severity of the occurrence event may be immediately determined only by looking at the individual event, and is divided into two severity is determined through the analysis of the trend of the event occurred so far.

개별 이벤트만 보아도 심각도가 바로 결정되는 경우는 다음과 같은 규칙에 따라 심각도를 부여한다.If the severity is immediately determined even by viewing individual events, the severity is assigned according to the following rules.

(규칙1) 하나의 가입자에게 영향을 미치는 포트의 다운 트랩 메시지가 발생하였고, 이전에 동일 포트의 상태가 정상인 경우에는 "중요"를 할당.(Rule 1) If a down trap message of a port affecting one subscriber has occurred, and the status of the same port has been normal before, assign "important".

(규칙2) 하나의 가입자에게 영향을 미치는 포트의 업(Up) 트랩 메시지가 발생하였고, 이전에 동일 포트의 상태가 다운(Down)인 경우에는 "해지"를 할당.(Rule 2) If the up trap message of the port affecting one subscriber has occurred, and the state of the same port was previously down, assign "revocation".

(규칙3) 하나의 가입자에게 영향을 미치는 포트에 대한 성능 저하 메시지가 발생하였고, 이전에 동일 포트의 상태가 정상인 경우에는 "경고"를 할당.(Rule 3) Allocating a "warning" when a performance degradation message has occurred for a port affecting one subscriber, and the status of the same port was normal.

(규칙4) 하나의 가입자에게 영향을 미치는 포트에 대한 성능 저하 복구 메시지가 발생하였고, 이전에 동일 포트의 상태가 성능 저하인 경우에는 "해지"를 할당.(Rule 4) Allocate a "revocation" if a degraded recovery message has occurred for a port that affects one subscriber, and if the status of the same port was previously degraded.

(규칙5) 인터넷 장치인 라우터 및 스위치에 대한 노드 다운 메시지가 발생하였고, 이전에 동일 노드의 상태가 정상인 경우에는 "심각"을 할당.(Rule 5) If a node down message has occurred for a router or switch, which is an Internet device, and the status of the same node has been normal, assign "severe".

(규칙6) 인터넷 장치인 라우터 및 스위치에 대한 노드 업 메시지가 발생하였고, 이전에 동일 노드의 상태가 다운인 경우에는 "해지"를 할당.(Rule 6) If a node up message has occurred for a router or switch that is an internet device, and the status of the same node was down before, assign "cancellation".

하지만, 개별 이벤트의 파급 효과를 파악하기 위해서는 이벤트 발생 위치별 심각도를 고려하여야 한다. However, in order to grasp the ripple effect of individual events, the severity of each event occurrence location must be considered.

이를 위해서는 이벤트의 발생 위치를 가지고 인터넷의 망 형상 정보간의 상관 관계를 분석하여야 이벤트의 정확한 파급효과를 판단할 수 있다.To do this, it is necessary to analyze the correlation between the network shape information of the Internet with the location of the event to determine the exact ripple effect of the event.

인터넷 망 형상 정보 데이터베이스(254)는 인터넷의 망 구성 정보를 저장하고 있는 데이터베이스로, 라우터, 라우터에 장착된 포트 목록 및 라우터간의 연결 형상 정보로 구성된다.The Internet network shape information database 254 is a database that stores network configuration information of the Internet. The Internet network shape information database 254 includes a router, a list of ports installed in routers, and connection shape information between routers.

이러한 인터넷 망 형상 정보에 관한 이벤트의 상관 관계 분석은 아래와 같은 규칙에 따라 이루어진다.Correlation analysis of events related to such Internet network configuration information is performed according to the following rules.

(규칙7) 하나 이상의 가입자에게 영향을 미치는 포트의 업 트랩 메시지가 발생하였고, 이전에 동일 포트의 상태가 다운인 경우에는 "해지"를 할당.(Rule 7) Allocating an "termination" if an uptrap message on a port that affects more than one subscriber has occurred and previously the state of the same port is down.

(규칙8) 하나 이상의 가입자에게 영향을 미치는 포트의 다운 트랩 메시지가 발생하였고, 이전에 동일 포트의 상태가 정상인 경우에는 "심각"을 할당.(Rule 8) Allocating a "severe" if a down trap message has occurred on a port that affects more than one subscriber, and the status of the same port has been normal previously.

(규칙9) 두 개 이상의 포트 다운 이벤트가 발생하였고, 포트가 동일 노드에 속하는 경우, 포트가 포함된 노드의 상태가 정상이면, 개별 포트 다운 이벤트에 "심각"을 할당. (Rule 9) If two or more port down events occur and the port belongs to the same node, if the node containing the port is in a normal state, assign a "severe" to the individual port down event.

(규칙10) 두 개 이상의 포트 다운 이벤트가 발생하였고, 포트가 동일 노드에 속하는 경우, 포트가 포함된 노드의 상태가 다운이면, 개별 포트 다운 이벤트는 무시하고 포트 다운 이벤트 대신 "노드 다운"이란 이벤트를 새롭게 발생시키고 이에 대한 메시지는 "심각"으로 할당.(Rule 10) If two or more port down events occur, and the port belongs to the same node, if the state of the node containing the port is down, the individual node down event is ignored and an event called "node down" instead of the port down event. Raise a new message and assign it a "serious" message.

(규칙11) 두 개 이상의 포트 업 이벤트가 발생하였고, 포트가 동일 노드에 속하는 경우, 포트가 포함된 노드의 상태가 정상이면, 개별 포트 업 이벤트에 "해 지"를 할당.(Rule 11) If two or more port up events occur and the port belongs to the same node, if the node containing the port is in a normal state, assign "release" to the individual port up event.

(규칙12) 두 개 이상의 포트 업 이벤트가 발생하였고, 포트가 동일 노드에 속하는 경우, 포트가 포함된 노드의 상태가 장애이면, 개별 포트 업 이벤트를 무시.(Rule 12) If two or more port up events occur and the port belongs to the same node, the individual port up event is ignored if the node containing the port has failed.

(규칙13) 포트 다운 이벤트 혹은 노드 다운 이벤트가 발생하였으나, 인터넷 망 형상 정보 상에서 해당 포트 혹은 노드가 다른 포트 혹은 노드와 연결되어 있지 않은 경우에는 해당 이벤트를 무시.(Rule 13) If a port down event or a node down event occurs, but the port or node is not connected to another port or node in the Internet configuration information, the event is ignored.

또한, SNMP 트랩을 통하여 수집된 이벤트 메시지와 시스템 로그 분석을 통해 수집된 이벤트 메시지는 바로 하나의 이벤트로 처리되지만, PING을 통해 수집된 장치 혹은 포트에 대한 도달 가능성과 SNMP-Get을 통해 수집된 장치 혹은 포트에 대한 상태 정보는 주기적으로 수집되는 것이다.In addition, event messages collected through SNMP traps and event messages collected through system log analysis are treated as a single event, but reachability to devices or ports collected through PING and devices collected through SNMP-Get Or state information about a port is collected periodically.

여기서, SNMP-Get을 통해 상태 정보를 수집하는 것은, 장비 상태 관리부(400)에서 주기적으로 SNMP를 이용하여 직접 망의 상태 정보를 수집하는 것을 의미한다.Here, collecting the status information through SNMP-Get means that the equipment status manager 400 periodically collects network status information directly using SNMP.

이들 정보는 망 장치에 의해 실시간으로 발생되는 이벤트 메시지가 아니므로 최근의 망 상태를 반영한 이벤트가 아니라 정보 수집 주기(예, 5분, 10분 등)만큼 과거의 장치 상태가 된다. Since this information is not an event message generated in real time by the network device, the information becomes a device state of the past by information collection cycle (eg, 5 minutes, 10 minutes, etc.) rather than an event reflecting the recent network state.

따라서, 이들을 처리하기 위해서는 다음과 같은 규칙에 따라 처리한다.Therefore, in order to process them, they are processed according to the following rules.

(규칙14) PING에 의한 도달 가능성 이벤트를 수신한 경우, 이벤트 상관 관계 분석부(250)는 도달 가능성이 "아니오"인 경우에는, 이전의 상태가 "아니오"이면 해당 이벤트를 무시하고, 이전의 상태가 "예"이면 장비 상태 관리자(400)에 요청하여 해당 장비의 현재 상태를 조회한 후, 상태가 다운이면 장치(노드 혹은 포트) 다운 이벤트를 생성하고, 상태가 정상이면 해당 이벤트를 무시한다.(Rule 14) When the reachability event by PING is received, the event correlation analyzer 250 ignores the event if the previous state is "no" when the reachability is "no", If the status is "Yes", the device status manager 400 is requested to query the current status of the corresponding equipment. If the status is down, the device (node or port) down event is generated. If the status is normal, the event is ignored. .

(규칙15) SNMP-Get에 의하여 수집된 장치(노드 혹은 포트)의 상태가 다운인 경우, 이벤트 상관 관계 분석부(250)는 PING 관리부(700)에게 해당 포트 혹은 노드에 PING에 의해 도달 가능성을 검사하도록 요청하고, 그 결과가 도달 가능인 경우에는 해당 다운 이벤트 메시지를 무시하고, 그 결과가 도달 불가능인 경우에는 다운 메시지를 전송한다.(Rule 15) If the status of a device (node or port) collected by SNMP-Get is down, the event correlation analyzer 250 informs the PING management unit 700 of the possibility of reaching the port or node by PING. Request to check, if the result is reachable, ignore the down event message; if the result is unreachable, send down message.

이와 같은 규칙에 따라, 이벤트 상관 관계 분석부(250)는 이벤트들간의 상관 관계를 분석하여 해당 이벤트에 대한 심각도를 할당하고, 이벤트 발생 근본 원인을 찾아 운용자가 쉽게 식별할 수 있는 새로운 이벤트를 생성하여 이벤트 통보부(260)에 보낸다.In accordance with such a rule, the event correlation analyzer 250 analyzes the correlation between the events, assigns the severity for the corresponding event, and finds the root cause of the event to generate a new event that can be easily identified by the operator. Send to event notification unit 260.

그러면, 이벤트 통보부(260)는 발생된 이벤트를 이벤트 별로 해당 시스템에게 통보하는 역할을 수행한다.Then, the event notification unit 260 notifies the corresponding system of the generated event for each event.

이러한 이벤트 통보부(260)는 "시스템의 위치와 시스템 명"을 관리하여 이벤트 발생 시에 이를 필요로 하는 시스템에게 정확하게 이벤트를 전송한다.The event notification unit 260 manages " system location and system name " to accurately transmit an event to a system in need thereof.

그리고, 타 시스템 이외에 인터넷 망의 관리 운용부에게는 발생하는 모든 이벤트를 전송한다.In addition to the other systems, all events that occur to the management operation unit of the Internet network is transmitted.

상기와 같은 장애 관리 시스템(100)을 이용하여 대규모 인터넷 망의 장애 관리가 가능하고, 이를 통해 안정적인 인터넷 서비스 제공이 가능하다.By using the failure management system 100 as described above it is possible to manage the failure of a large-scale Internet network, thereby providing a stable Internet service.

상술된 바와 같이, 본 발명은 장치가 발생시키는 이벤트들을 빠짐없이 수신하고 수신된 이벤트들의 상관 관계를 분석하여 정확한 원인을 파악하고 그 결과에 따라 인터넷 서비스 제공에 지장을 초래하는 상황이 발생한 경우에는 즉각적인 조치를 취함으로써, 망 사업자 관점에서는 안정적인 인터넷 서비스 제공이 가능하고, 이러한 안정적인 인터넷 통신 서비스를 통해 서비스 품질 또한 높일 수 있는 효과가 있다.As described above, the present invention receives all events generated by the device and analyzes the correlations between the received events to determine the exact cause, and immediately in case a situation occurs that impedes the provision of the Internet service. By taking measures, it is possible to provide stable Internet service from the perspective of network operators, and the quality of service can be enhanced through such stable Internet communication service.

Claims

In the failure management system of the Internet network for managing the failure of the device configured in the Internet network,

A fault management policy storage unit having fault information of the device and a processing rule for an event;

A trap message manager for transmitting a trap message of the device according to the processing rule;

An equipment state manager to collect a state of the device according to the processing rule;

A ping manager that manages reachability of any section and any device belonging to the Internet network periodically using a ping according to the processing rule;

A system log manager configured to collect system log data of the device according to the processing rule; And

And an event manager configured to receive event messages from the trap message manager, the device state manager, the ping manager, and the system log manager, respectively, and analyze correlations between the event messages.

The method of claim 1,

The device is equipped with an SNMP agent to provide a trap message, and the status of the device is provided using the SNMP protocol, the failure management system of the Internet network.

The method of claim 1,

The event management unit, the failure management system of the Internet network, characterized in that for processing each event message to a queue.

The method of claim 1,

The event management unit, the failure management system of the Internet network, characterized in that at least one of performing filtering of the duplicate event and filtering for the predefined unnecessary events.

The method of claim 1,

The event management unit, the failure management system of the Internet network, characterized in that to generate a severity according to a predefined multi-step type corresponding to at least one event message.

delete

The method of claim 1,

The event management unit,

Storage means for receiving various event messages provided by the trap message management unit, the equipment state management unit, the ping management unit, and the system log management unit, and storing them in individual queues corresponding to each unit;

Event duplication processing means for first filtering event duplicates by reading event messages stored in said storage means;

Event filtering means for performing secondary filtering of a predefined unnecessary event among the first filtered events;

Event correlation analysis means for analyzing a correlation between the second filtered events; And

And event notification means for notifying the corresponding system of the result of the event correlation analysis means.

delete

In the failure management method of the Internet network using the event message for the failure of the device configured in the Internet network,

Receiving and storing information about a failure state of the device, a state of the device, a system log message of the device, and a reachability of the device using a ping as an event message, respectively;

Performing at least one filtering of filtering of a duplicate event among the event messages and filtering for a predefined unnecessary event;

Analyzing correlations of the filtered event messages; And

And a method of notifying the analyzed result.

The method of claim 11,

And the filtering of the duplicated event and the filtering of the predefined unnecessary event are performed sequentially.

The method of claim 11,

And generating a severity according to a predefined multi-stage type in response to at least one or more of the event messages.

delete