KR100964392B1

KR100964392B1 - System and method for managing network failure

Info

Publication number: KR100964392B1
Application number: KR1020030043632A
Authority: KR
Inventors: 성종규; 황찬규; 김기응; 박호석
Original assignee: 주식회사 케이티
Priority date: 2003-06-30
Filing date: 2003-06-30
Publication date: 2010-06-17
Also published as: KR20050002263A

Abstract

본 발명은 대규모 망에서 장애 관리를 위한 망 관리에서의 장애 관리 시스템 및 그 방법에 관한 것이다. The present invention relates to a failure management system and method in network management for failure management in large networks.

망 관리에서의 장애 관리 시스템 및 그 방법은 지역 게이트웨이 장치에서 1차 필터링 되어 전송되는 장애 이벤트들을 2차 필터링 하여, 장애 이벤트들 중에서 상기 네트워크 장비 및 포트, 링크 관련 장애 이벤트들은 장애 알람 처리하고, 대형 장애 관련 장애 이벤트들은 객체트리 ID를 이용하여 해당 장애의 위치를 추적하여 장애 처리한다.The failure management system and method thereof in network management perform secondary filtering of failure events that are first filtered and transmitted from a local gateway device. Failure-related failure events track the location of the failure using object tree ID and handle it.

이와 같이 하면, 망 자원에서 장애 발생시 이중 필터링과 망 요소에 대한 객체트리 ID를 이용해 장애의 위치를 추적하여 관리함으로써 망 요소들간의 상관 관계가 쉽게 파악되고, 정확하고 신속한 장애 처리를 수행할 수 있고, 무엇보다도 복잡한 장애 이벤트의 상관 관계를 고려하지 않고도 대형 장애를 처리할 수 있다. In this way, when a failure occurs in the network resources, dual filtering and object tree IDs for the network elements are used to track and manage the location of the failure so that correlations between the network elements can be easily identified and accurate and rapid failure processing can be performed. First of all, large faults can be handled without considering the correlation of complex fault events.

망 관리 시스템, 장애 관리 서버, 지역 게이트웨이 장치, 1차 필터, 2차 필터, 객체트리 IDNetwork management system, failure management server, regional gateway device, primary filter, secondary filter, object tree ID

Description

Fault management system and its method in network management {SYSTEM AND METHOD FOR MANAGING NETWORK FAILURE}

도 1은 본 발명이 적용되는 망 관리 시스템의 구성을 도시한 것이다.1 illustrates a configuration of a network management system to which the present invention is applied.

도 2 및 도 3은 본 발명의 실시예에 따른 망 관리에서의 장애 관리 시스템의 구성을 도시한 것이다.2 and 3 illustrate the configuration of a failure management system in network management according to an embodiment of the present invention.

도 4는 본 발명에 적용되는 객체트리 ID의 구조를 도시한 것이다. 4 illustrates a structure of an object tree ID applied to the present invention.

도 5a~도 5h는 본 발명에 적용되는 객체트리 ID를 부여하는 과정을 도시한 것이다.5A to 5H illustrate a process of assigning an object tree ID applied to the present invention.

도 6은 본 발명에 적용되는 계위 ID의 구조를 도시한 것이다.6 shows a structure of a hierarchy ID applied to the present invention.

도 7은 본 발명의 실시예에 따른 망 관리에서의 장애 관리 방법의 순서도를 도시한 것이다.7 is a flowchart illustrating a failure management method in network management according to an embodiment of the present invention.

대부분의 인터넷 서비스가 IP를 이용한 서비스를 진행하고 있고, 이러한 서 비스를 수행하기 위하여 망사업자는 대개 라우터나 스위치 등으로 망을 구성하여 신규 서비스를 수용하고 있다. Most Internet services are using IP services, and in order to perform these services, network operators generally adopt new services by configuring networks with routers or switches.

최근 초고속 인터넷 서비스의 성공적인 출범으로 인해 라우터나 스위치 등을 이용한 망요소는 급속한 증가하고 있으며, 이미 대단위의 망을 형성하고 있는 실정이다. Recently, due to the successful launch of the high-speed Internet service, the network elements using routers or switches are rapidly increasing, and they have already formed large networks.

국내의 대부분의 인터넷 환경은 매트로 이더넷, 전용회선, 무선랜 등의 다양한 액세스 기술을 이용하는 액세스망에서 가입자의 트래픽을 수용하고, 이를 스위치와 라우터로 구성된 백본망으로 전달하여 백본망에서 서버 접속, 타 ISP 사업자와의 연결, 또는 해외라인으로 접속하도록 한다.Most Internet environments in Korea accept subscriber's traffic in access networks using various access technologies such as Ethernet, leased lines, and wireless LANs, and transfer them to the backbone network consisting of switches and routers to access servers in the backbone network, Connect with other ISPs or connect to overseas lines.

이러한 IP 계열의 장치로 이루어진 망을 유지 관리는 망 관리 시스템은 가입자를 수용하기 위한 스위치와 라우터 장치, 및 이를 서버군과 타 ISP 사업자나 해외라인으로 트래픽을 수송하기 위한 백본의 장치에 대한 장애를 신속 정확하게 파악하고 조치를 취해야 한다. A network management system that maintains a network of IP-based devices is designed to provide a switch and router device for accommodating subscribers, and a backbone device for transporting traffic to server groups, other ISP operators, or overseas lines. You must quickly and accurately identify and take action.

따라서, 망관리 시스템은 대규모 IP망에서 몇 대의 장치를 관리하는 수준의 장애처리 방안이 아니라, 수만에서 수십만 포트의 장치를 동시에 장애 처리하여 전국적인 상황을 모니터링 할 수 있는 체제를 갖추어야 한다.Therefore, the network management system should be equipped with a system that can monitor the nation-wide situation by simultaneously handling tens of thousands to hundreds of thousands of ports at the same time, instead of managing a few devices in a large IP network.

이를 위해, 망관리 시스템은 대규모 IP망에서 수만에서 수십만에 이르는 장애 데이터를 신속히 추출하며 가공 분석하여 정확한 장애를 발생시키는 장치 또는 포트를 찾아내도록 해야 한다.To do this, network management systems need to quickly extract, process and analyze tens of thousands to hundreds of thousands of fault data in large IP networks to find out the devices or ports that cause the correct faults.

그러나, 종래 망관리 시스템은 수만에서 수십만 포트의 장치를 동시에 장애 처리하여 전국적인 상황을 모니터링 할 수 있는 체제를 갖추지 못하고 있고, 망 사업자도 복잡한 서비스들, 특히 운용 관리 기능인 장애와 성능 관리의 기능은 통합적으로 수용하지 못하고 있는 실정이다.However, the conventional network management system does not have the system to monitor the nationwide situation by simultaneously handling tens of thousands to hundreds of thousands of devices at the same time. The situation is not integrated.

갈수록 망요소가 되는 스위치나 라우터 장치와 그 포트의 수는 다양한 서비스의 수용과 더불어 기하 급수적으로 증가하고 있고, 이러한 망요소들을 관리하기 위하여 각각의 서비스 단위로 각 망관리 시스템을 도입하는 데에는 망사업자의 비용 부담이 너무 커진다는 문제점이 있다. Increasingly, the number of switch or router devices and their ports, which are network elements, increases exponentially with the acceptance of various services, and network operators are required to introduce each network management system to each service unit to manage these network elements. There is a problem that the cost burden is too large.

본 발명이 이루고자 하는 기술적 과제는 대규모 IP 망에서 장애 발생시 장애의 정확한 위치를 추적하고, 대형 장애에 대한 신속한 장애 처리를 수행할 수 있는 망관리에서의 장애 관리 시스템 및 그 방법을 제공하는 것이다.The technical problem to be achieved by the present invention is to provide a system and method for managing a failure in network management that can track the exact location of a failure when a failure occurs in a large IP network, and can perform a quick failure treatment for a large failure.

이러한 과제를 해결하기 위해 본 발명은 대규모 망에서 장애 발생시 객체트리 ID와 이중 필터링을 통해 정확하고 신속한 장애 처리를 수행한다.In order to solve this problem, the present invention performs accurate and fast failure processing through object tree ID and double filtering when a failure occurs in a large network.

본 발명의 첫 번째 특징에 따른 망관리에서의 장애 관리 시스템은, 네트워크 장비로부터 장애 이벤트 정보를 수집하고, 상기 장애 이벤트 정보들을 1차 필터링 하여 동일 주기 내의 장애 이벤트들을 단일 그룹화 하여 전송하는 지역 게이트웨이 장치; 및 상기 지역 게이트웨이 장치로부터 전송되는 장애 이벤트들 2차 필터링 하여, 상기 장애 이벤트들 중에서 상기 네트워크 장비 및 포트, 링크 관련 장애 이벤트들은 장애 알람 처리하고, 대형 장애 관련 장애 이벤트들은 해당 장애의 위치를 추적하여 장애 발생 객체와 상관 관계에 있는 객체들을 포함하여 장애 처리하는 장애 관리 서버를 포함한다.The system for managing a failure in network management according to the first aspect of the present invention collects failure event information from a network device, first filters the failure event information, and transmits a single group of failure events within the same period. ; And secondly filtering out fault events transmitted from the local gateway device, among the fault events, fault signals of the network equipment, port, and link related fault events are handled, and large fault related fault events track the location of the fault. It includes a failure management server that handles failures, including objects that are correlated with the failed object.

상기 지역 게이트웨이 장치는, The regional gateway device,

상기 네트워크 장비로 주기적인 인터넷 제어 메시지 프로토콜(Internet Control Message Protocol, ICMP) 핑 테스트를 수행하고, 그 테스트 결과를 전송하는 핑(ping) 시험기; 상기 네트워크 장비에 간이 망관리 프로토콜(Simple Network Management Protocol, SNMP)을 이용해 접속하여 주기적으로 각 장비의 포트 상태와 트랩 정보를 수집하고, 그 수집 결과를 전송하는 SNMP 처리기; 상기 네트워크 장비에서 전송되는 원격 로그 정보를 시스템에서 사용되는 데이터 형태로 변환한 이벤트 정보를 전송하는 로그정보 처리기; 및 상기 핑 시험기, SNMP 처리기, 또는 로그정보 처리기에서 전송되는 정보를 분석하여 1차적으로 필터링 하여, 상기 네트워크 장비 관련 장애 상황이 탐지되면 해당 장비에 속한 모든 포트의 장애 이벤트를 단일 그룹화 하여 상기 장애 관리 서버로 전송하는 1차 필터를 포함하는 것이 바람직하다.A ping tester for performing a periodic Internet Control Message Protocol (ICMP) ping test to the network equipment and transmitting the test result; An SNMP processor which accesses the network device using a simple network management protocol (SNMP), periodically collects port state and trap information of each device, and transmits the collection result; A log information processor for transmitting event information obtained by converting the remote log information transmitted from the network equipment into a data form used in the system; And first filtering and analyzing information transmitted from the ping tester, the SNMP processor, or the log information processor. It is desirable to include a primary filter that sends to the server.

상기 SNMP 처리기는, 상기 네트워크 장비의 운용 상태를 설정 주기 동안 읽어들여 각 네트워크 장비의 포트의 업/다운 여부에 대한 정보를 수집하는 운용 상태 감시부; 및 상기 네트워크 장비에서 전송되는 트랩 정보들을 파싱(parsing)하여 시스템에서 사용되는 데이터 형태로 변환하고, 그 변환된 이벤트 정보를 상기 1차 필터로 전송하는 트랩 파싱부를 포함하는 것이 바람직하다.The SNMP processor may include an operation state monitoring unit that reads an operation state of the network device during a setting cycle and collects information on whether a port of each network device is up / down; And a trap parsing unit for parsing the trap information transmitted from the network equipment, converting the trap information into a data form used in the system, and transmitting the converted event information to the primary filter.

상기 장애 관리 서버는, The failure management server,

상기 네트워크 장비를 물리적, 논리적 객체트리 형태로 관리하며, 각 객체마다 ID를 부여하여 객체트리 ID를 관리하는 자원 객체 관리기; 상기 지역 게이트웨이 장치로부터 전송되는 장애 이벤트를 저장하는 버퍼; 상기 버퍼 내의 장애 이벤트를 감시하다가 장애 발생 종류별 상기 장애 이벤트를 추출하여 알람 처리하는 알람 처리기; 상기 버퍼에 저장되는 장애 이벤트 중에서 링크에 속하는 장애 이벤트는 링크 장애로 알람 처리하는 링크 장애 처리기; 상기 알람 처리기 또는 링크 장애 처리기에서 알람 처리 후 전송하는 이벤트의 대형 장애 여부를 조사하고, 상기 자원 객체 관리기의 객체트리 ID를 이용하여 대형 장애 발생 객체의 위치를 추적하는 2차 필터; 및 상기 2차 필터를 통해 대형 장애 발생 객체의 하위에 있는 모든 객체트리 ID에 대해 대형 장애 처리하는 대형 장애 관리기를 포함하는 것이 바람직하다.A resource object manager that manages the network equipment in the form of physical and logical object trees, and manages an object tree ID by assigning an ID to each object; A buffer for storing a failure event transmitted from the local gateway device; An alarm processor that monitors a failure event in the buffer and extracts the failure event for each type of failure; A link failure handler configured to process a failure event belonging to a link among the failure events stored in the buffer as a link failure; A secondary filter that examines whether a large failure of an event transmitted after the alarm processing is performed by the alarm processor or the link failure processor, and tracks the location of the large failure object using the object tree ID of the resource object manager; And a large failure manager that handles large failures for all object tree IDs below the large failure object through the secondary filter.

상기 자원 객체 관리기의 상기 객체트리 ID는, 상기 객체의 물리적, 논리적 상관 관계를 나타내는 객체 ID와, 상기 객체의 상위 레벨 관계를 나타내는 계위 ID로 구분되는 것이 바람직하다.The object tree ID of the resource object manager is preferably divided into an object ID indicating a physical and logical correlation of the object and a hierarchy ID indicating a high level relationship of the object.

상기 객체 ID는, 망 식별번호, 장비 식별 번호, 링크 식별 번호, 샤시(chassis) 식별 번호, 모듈 식별 번호, 카드 식별 번호, 포트 식별 번호, 논리포트 식별 번호를 차례로 부여하여 생성되는 것이 바람직하다.The object ID is preferably generated by assigning a network identification number, a device identification number, a link identification number, a chassis identification number, a module identification number, a card identification number, a port identification number, and a logical port identification number.

상기 객체 ID는, 하위 레벨의 객체는 자신이 속한 상위 레벨의 객체 ID를 포함하여 정의되는 것이 바람직하다., Preferably, the object ID is defined such that an object of a lower level includes an object ID of an upper level to which it belongs.

상기 1차 필터는, 상기 자원 객체 관리기의 객체트리 ID를 이용하여 특정 레 벨에 속한 객체트리 ID의 장애 탐지시, 상기 객체트리 ID의 하위 레벨에 속한 객체트리 ID의 장애 상황은 무시하는 것이 바람직하다.When the primary filter detects a failure of an object tree ID belonging to a specific level by using the object tree ID of the resource object manager, it is preferable to ignore a failure situation of an object tree ID belonging to a lower level of the object tree ID. Do.

본 발명의 두 번째 특징에 따른 망 관리에서의 장애 관리 방법은, A) 네트워크 장비로부터 장애 이벤트 정보를 수집하고, 장애 발생 종류별 상기 장애 이벤트 정보들을 분류하는 단계; B) 상기 A) 단계에서 상기 네트워크 장비 및 포트 관련 장애 발생한 장애 이벤트 정보를 추출하여 알람 처리하고, 상기 링크 관련 장애 발생한 장애 이벤트 정보를 추출하여 알람 처리하는 단계; 및 C) 상기 A) 단계에서 대형 장애 관련 장애 발생한 장애 이벤트들은 해당 장애 발생 객체의 위치를 추적하여 상기 장애 발생 객체와 상관 관계에 있는 객체들을 포함하여 대형 장애 처리하는 단계를 포함한다.A failure management method in network management according to a second aspect of the present invention includes the steps of: A) collecting failure event information from network equipment, and classifying the failure event information for each type of failure occurrence; B) extracting the fault event information related to the failure of the network equipment and the port in step A) and performing alarm processing; And C) disabling failure events related to a large failure in step A), including the objects that are correlated with the failure object by tracking the location of the failure object.

상기 A) 단계에서 장애 데이터를 수집하는 단계는, 상기 네트워크 장비로 주기적으로 장애 감시하는 지역별 게이트웨이 장치가 상기 장애 데이터를 1차적으로 필터링 하여, 상기 네트워크 장비 관련 장애 상황이 탐지되면 해당 장비에 속한 모든 포트의 장애 이벤트를 단일 그룹화 하여 전송하는 것이 바람직하다.In the collecting of the failure data in the step A), the gateway device for each region periodically monitoring the failure with the network equipment first filters the failure data, and when a failure state related to the network equipment is detected, all the devices belonging to the corresponding equipment are detected. It is desirable to send a single group of failure events on the port.

상기 A) 단계에서 상기 장애 이벤트를 분류하는 단계는, 상기 네트워크 장비의 각 개체마다 고유 ID를 부여하여 물리적, 논리적 객체트리 ID를 생성하고, 상기 객체트리 ID를 이용해 상기 장애 이벤트를 분석하여 장애 발생 종류와 위치를 확인하는 것이 바람직하다.In the step A), the categorizing of the failure event may include assigning a unique ID to each entity of the network device to generate a physical and logical object tree ID, and analyzing the failure event using the object tree ID to generate a failure. It is desirable to check the type and location.

상기 장애 이벤트를 1차적으로 필터링 하는 단계는, 상기 객체트리 ID를 이용하여 특정 레벨에 속한 객체트리 ID의 장애 탐지시, 상기 객체트리 ID의 하위 레 벨에 속한 객체트리 ID의 장애 상황은 무시하는 것이 바람직하다.The filtering of the failure event may include: ignoring a failure state of an object tree ID belonging to a lower level of the object tree ID when detecting a failure of an object tree ID belonging to a specific level using the object tree ID. It is preferable.

상기 C) 단계에서 상기 대형 장애를 처리하는 단계는, a) 상기 B) 단계의 알람 처리 결과에 대한 이벤트를 전달받아 2차적으로 필터링 하여 대형 장애 여부를 조사하고, 상기객체트리 ID를 이용하여 대형 장애 발생 객체의 위치를 추적하는 단계; 및 b) 상기 a) 단계의 2차적으로 필터링한 결과를 통해 대형 장애 발생 객체의 하위에 있는 모든 객체트리 ID에 대해 대형 장애 처리하는 단계를 포함하는 것이 바람직하다.In the step C), the processing of the large fault comprises: a) receiving an event for the alarm processing result of the step B) and performing secondary filtering to investigate whether there is a large failure, and using the object tree ID to determine the large failure. Tracking the location of the failed object; And b) performing a large failure processing on all object tree IDs below the large failure object through the second filtering result of step a).

상기 객체트리 ID는, 상기 객체의 물리적, 논리적 상관 관계를 나타내는 객체 ID와, 상기 객체의 상위 레벨 관계를 나타내는 계위 ID로 구분되는 것이 바람직하다. 상기 객체 ID는, 하위 레벨의 객체는 자신이 속한 상위 레벨의 객체 ID를 포함하여 정의되는 것이 바람직하다.The object tree ID is preferably divided into an object ID indicating a physical and logical correlation of the object and a hierarchy ID indicating a high level relationship of the object. Preferably, the object ID is defined such that an object of a lower level includes an object ID of an upper level to which the object ID belongs.

상기 객체 ID는, 망 식별번호, 장비 식별 번호, 링크 식별 번호, 샤시(chassis) 식별 번호, 모듈 식별 번호, 카드 식별 번호, 포트 식별 번호, 논리포트 식별 번호를 차례로 부여하여 생성되는 The object ID is generated by assigning a network identification number, a device identification number, a link identification number, a chassis identification number, a module identification number, a card identification number, a port identification number, and a logical port identification number in this order.

상기 객체 ID는, 상기 망 식별 번호가 각 서비스 도메인별로 부여된 고유 ID를 이용하여 생성되고, 장비 식별 번호는 해당 장비가 속한 지역 본부와 국소번호, 장비의 종류, 일련번호를 붙여서 생성되고, 상기 링크 식별 번호는 각 장비간에 속한 논리적인 객체로 정의하고, 상기 샤시 식별 번호는 각 장비에 속한 물리적인 객체로 정의하고, 상기 모듈 식별 번호 각 새시에 속한 물리적인 객체로 정의하고, 상기 카드 식별 번호 각 모듈에 속한 물리적인 객체로 정의하고, 상기 포트 식별 번호는 장비에 속한 물리적인 객체로 정의하고, 상기 논리포트 식별 번호는 각 포트에 속한 논리적인 객체로 정의하는 것이 바람직하다.The object ID is generated by using a unique ID assigned to the network identification number for each service domain, and the device identification number is generated by attaching a local headquarter, a local number, a type of equipment, and a serial number to which the corresponding device belongs. The link identification number is defined as a logical object belonging to each device, the chassis identification number is defined as a physical object belonging to each device, the module identification number is defined as a physical object belonging to each chassis, and the card identification number A physical object belonging to each module is defined, the port identification number is defined as a physical object belonging to the device, and the logical port identification number is preferably defined as a logical object belonging to each port.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였다. 명세서 전체를 통하여 유사한 부분에 대해서는 동일한 도면 부호를 붙였다. DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention. Like parts are designated by like reference numerals throughout the specification.

도 1에 도시된 바와 같이, 망 관리 시스템은 타 시스템과의 연동 관리 장치(10), GUI(Graphic User Interface) 관리 장치(20), 권한 및 계정 관리 장치(30), 서버 관리 장치(40), 장애 관리 장치(50), 성능 관리 장치(60), 구성 관리 장치(70), 및 데이터베이스(80)를 포함한다.As shown in FIG. 1, the network management system includes an interworking management device 10, a GUI (Graphic User Interface) management device 20, a rights and account management device 30, and a server management device 40. , A failure management device 50, a performance management device 60, a configuration management device 70, and a database 80.

장애 관리 장치(50)는 장애의 검출, 분리, 이상 동작의 감지하고, 성능 관리 장치(60)는 효율적인 통신을 할 수 있는 것에 대한 평가하며, 구성 관리 장치(80)는 네트워크 장비 구성의 관리를 담당한다. The failure management device 50 detects the detection, separation, and abnormal operation of a failure, the performance management device 60 evaluates that the communication can be performed efficiently, and the configuration management device 80 manages the management of the network equipment configuration. In charge.

이와 같은, 망 관리 시스템이 장애의 검출을 하기 위한 데이터의 수집 방법은 크게 3가지의 방안이 있다. As such, the network management system has three methods for collecting data for detecting a failure.

첫 번째로, 장애 검출을 위한 데이터 수집 방안은 장애 검출을 위한 관리망(Data Communication Network)을 구축하는 방안인데, 이는 비용이 많이 들 고, 관리망 자체에서 장애 발생시 장애 상황이 감지되지 않는 단점이 있다.First, the data collection method for fault detection is to establish a data communication network for fault detection, which is expensive and has a disadvantage that a fault situation is not detected when a fault occurs in the network itself. have.

두 번째로, 장애 검출을 위한 데이터 수집 방안은, 망 관리 시스템을 중앙 서버와 지역별 게이트웨이 시스템으로 구분하여 지역 게이트웨이 시스템에서 모든 원시 장애 데이터를 생성하고, 그 결과를 중앙 서버에서 수집하여 장애 데이터로 가공하는 것이다.Secondly, the data collection method for fault detection divides the network management system into a central server and a regional gateway system to generate all the raw fault data in the regional gateway system, and the result is collected from the central server and processed into fault data. It is.

세 번째로, 중앙 서버가 모든 장애 관련 데이터를 수집 및 가공하는 것인데, 이는 서버의 성능이 아주 우수해야 하며, 대규모 망을 관리하기에 적합한 서버를 찾기 어렵다는 단점이 있다. Third, the central server collects and processes all the data related to the failure, which has the disadvantage that the server has to perform very well and it is difficult to find a server suitable for managing a large network.

먼저, 본 발명의 실시예에 따른 망 관리에서의 장애 관리 시스템은 위에서 제안한 두 번째 방안을 응용하여 일반 IP 망을 비롯해 대규모 망의 장애 관리를 수행하도록 한다.First, a failure management system in network management according to an embodiment of the present invention applies the second method proposed above to perform failure management of a large network including a general IP network.

도 2 및 도 3을 참고하면, 본 발명의 실시예에 따른 망 관리에서의 장애 관리 시스템은 크게 지역 게이트웨이 장치(100)와 장애 관리 서버(200)를 포함한다.2 and 3, a failure management system in network management according to an embodiment of the present invention includes a local gateway device 100 and a failure management server 200.

지역 게이트웨이 장치(100)는 네트워크 장비로부터 장애 이벤트 정보를 수집하고, 상기 장애 이벤트 정보들을 1차 필터링 하여 동일 주기 내의 장애 이벤트들을 단일 그룹화 하여 전송한다.The regional gateway device 100 collects the fault event information from the network equipment, first filters the fault event information, and transmits the fault events in a single group.

지역 게이트웨이 장치(100)는 네트워크 장비와 그 장비에 속한 포트의 장애 여부를 확인하기 위해 아래와 같은 4가지를 방식을 이용한다. The regional gateway device 100 uses the following four methods to check whether a network device and a port belonging to the device are faulty.

1. PING(packet internet groper)은 해당 장비에 ICMP PING 테스트를 수행하여, ECHO가 없거나 TIME OUT이 되는 현상이 발생하면 장애 상황으로 인식한다. 1. PING (packet internet groper) performs ICMP PING test on the equipment, and if there is no ECHO or time out, it is recognized as a fault condition.

2. SNMP 폴링은 SNMP를 이용하여 IF 테이블의 ifOperStatus 항목을 주기적으로 폴링하여, 장치의 포트의 장애 여부를 확인하는데, SNMP로 바인딩조차도 안되는 경우는 해당 장치의 장애 상황으로 인식한다.2. SNMP polling periodically polls the ifOperStatus item in the IF table using SNMP, and checks whether the port of the device has failed. If it cannot be bound to SNMP, it is recognized as the failure of the device.

3. SNMP 트랩(TRAP)은 지역별로 존재하는 게이트웨이 장치(100)를 트랩 호스트로 지정하여, 장비에서 발생하는 SNMP 트랩 정보를 전달하여 수집하고, 표준 트랩인 링크 업/다운(link up, link down)은 포트의 장애 여부를, cold start, warm start는 장치의 장애 여부에 활용한다.3. SNMP trap (TRAP) designates the gateway device 100 that exists in each region as a trap host, forwards and collects SNMP trap information generated from the equipment, and links up and down, which are standard traps. ) Is used for port failure, cold start and warm start for device failure.

4. 원격로그(SYSLOG)는 장비에서 관리되고 있는 로그를 원격으로 전송할 수 있는 기능을 이용하는 것으로서, 장비에서 SYSLOG 호스트를 지역별로 존재하는 게이트웨이 장치(100)로 지정하여 장비에서 유지 관리하는 로그정보를 원격으로 전송받는다. 지역 게이트웨이 장치(100)는 이 로그 정보들 중에서 포트의 활성화/비활성화와 관련된 정보를 추출하여 포트의 장애 정보로 활용하며, 장치의 이상 유무에 관련된 정보를 추출하여 장치 장애 정보로 활용한다.4. The remote log (SYSLOG) is a function that can remotely transmit the log managed by the device, by designating the SYSLOG host as the gateway device 100 existing by region in the device to log information maintained by the device Receive remotely. The local gateway device 100 extracts information related to the activation / deactivation of the port from among the log information and utilizes the port failure information, and extracts the information related to the abnormality of the device to utilize the device failure information.

이와 같이, 지역 게이트웨이 장치(100)는 핑 시험기(110), SNMP 처리기(120), 로그정보(SYSLOG) 처리기(130), 1차 필터(140)를 포함하지만 이에 한정되지는 않는다.As such, the local gateway device 100 may include, but is not limited to, a ping tester 110, an SNMP processor 120, a log information (SYSLOG) processor 130, and a primary filter 140.

핑 시험기(110)는 네트워크 장비로 주기적인 인터넷 제어 메시지 프로토콜(Internet Control Message Protocol, ICMP) 핑 테스트를 수행하고, 그 테 스트 결과를 1차 필터(140)로 전송한다.Ping tester 110 performs a periodic Internet Control Message Protocol (ICMP) ping test to the network equipment, and transmits the test results to the primary filter 140.

SNMP 처리기()는 운용 상태 감시부(121)와 트랩 파싱부(122)로 구분되는데, 운용 상태 감시부(121)는 설정된 주기(T) 동안에 각 장비의 IP 주소정보와 설정된 커뮤니티(COMMUNITY) 정보를 가지고 장비에 간이 망 관리 프로토콜(Simple Network Management Protocol, SNMP)을 이용하여 접속한다.The SNMP processor () is divided into an operation state monitoring unit 121 and a trap parsing unit 122. The operation state monitoring unit 121 includes IP address information and set community information of each device during a set period T. Connect to the device using Simple Network Management Protocol (SNMP).

그 후, 운용 상태 감시부(121)는 ifOperStatus( SNMP OID(Object ID) 값 : 1.3.6.1.2.1.2.2.1.8) 을 읽어와서 각 장비의 포트의 업/다운(UP/DOWN) 여부에 대한 정보를 수집하고 이를 1차 필터(140)에 전송한다.After that, the operation status monitoring unit 121 reads ifOperStatus (SNMP OID (Object ID) value: 1.3.6.1.2.1.2.2.1.8) to determine whether the port of each device is up / down. Information is collected and sent to the primary filter 140.

트랩 파싱부(122)는 네트워크 장비에서 수집되는 다수의 트랩 정보 중에서 망 관리 시스템에서 사용하도록 등록되어 있는 타입의 정의에 따라 정보를 파싱하여 망 관리 시스템에서 사용하는 포맷으로 변환하여 이벤트화하고, 이 이벤트 정보를 1차 필터(140)로 전송한다.The trap parsing unit 122 parses the information according to the definition of the type registered for use in the network management system among the plurality of trap information collected by the network equipment, converts the information into a format used by the network management system, and generates an event. The event information is transmitted to the primary filter 140.

SYSLOG 처리기(130)는 장비에서 전송되는 원격 로그 정보를 지역 게이트웨이 장치(100)의 로그 서버 데몬의 설정값으로 변경하여 특정 파일 형태로 저장하도록 한 후에, 이 저장되어 있는 원격 로그 정보를 파싱하여 네트워크 장비에서 사용되는 포맷으로 변환하여 이벤트화하고, 이 이벤트 정보는 1차 필터(140)로 전송한다.The SYSLOG processor 130 changes the remote log information transmitted from the device to a setting value of the log server daemon of the local gateway device 100 and stores the data in a specific file format, and then parses the remote log information stored in the network. The event is converted into a format used in the equipment and eventized, and the event information is transmitted to the primary filter 140.

1차 필터(140)는 각 지역 게이트웨이 장치(100)에서 관리하고 있는 모든 객체들의 장애 프로파일을 관리하며, 일차적인 필터링 작업을 수행한다. The primary filter 140 manages failure profiles of all objects managed by each local gateway device 100 and performs a primary filtering operation.

즉, 1차 필터(140)는 네트워크 장비의 다운으로 트랩이 발생한다든지, 아니면 네트워크 장비의 대표 IP에 몇 차례 핑 테스트를 수행하여 장비의 다운이 탐지 되거나, 원격 전송되는 네트워크 장비의 로그 정보에서 장비의 물리적인 에러로 인한 장애 상황이 모니터링 되는 경우에 해당 장비에 속한 모든 포트를 다운으로 인식한다. That is, the primary filter 140 is trapped due to the down of the network equipment, or by performing a ping test several times to the representative IP of the network equipment, the down of the equipment is detected, or from the log information of the network equipment transmitted remotely If a failure condition due to a physical error is monitored, all ports belonging to the equipment are recognized as down.

그리하여, 1차 필터(140)는 장애가 발생한 네트워크 장비에 속한 포트 장애 이벤트, 동일 주기 내에 속한 모든 장애 이벤트를 하나의 그룹화 하여 장애 관리 서버(200)로 전송한다.Thus, the primary filter 140 groups the port failure events belonging to the failed network equipment and all the failure events belonging to the same period as one group and transmits them to the failure management server 200.

이처럼, 1차 필터(140)에서 동일 주기 내의 장애 이벤트를 단일 그룹으로 처리하게 되면, 지역 게이트웨이 장치(100)와 장애 관리 서버(200)사이의 통신의 부담과 시스템의 부하가 줄어들게 된다.As such, when the primary filter 140 processes failure events within the same period as a single group, the burden of communication and the load on the system between the local gateway device 100 and the failure management server 200 are reduced.

한편, 장애 관리 서버(200)는 지역 게이트웨이 장치(100)로부터 전송되는 장애 이벤트들 2차 필터링 하여, 네트워크 장비 및 포트, 링크 관련 장애 이벤트들은 장애 알람 처리하고, 대형 장애 관련 장애 이벤트들은 해당 장애의 위치를 추적하여 장애 발생 객체와 상관 관계에 있는 객체들을 포함하여 장애 처리한다.On the other hand, the failure management server 200 secondary filtering failure events transmitted from the local gateway device 100, network equipment and port, link-related failure events to handle the failure alarm, large failure-related failure events of the corresponding failure It tracks the location and handles failures including objects that are correlated with the failed object.

이러한 장애 관리 서버는 도 3에 도시된 바와 같이, 버퍼(210), 자원 객체 관리기(220), 알람 처리기(230), 링크 장애 처리기(250), 대형 장애 관리기(260)를 포함하지만 이에 한정되지는 않는다. Such a failure management server includes, but is not limited to, a buffer 210, a resource object manager 220, an alarm handler 230, a link failure handler 250, a large failure manager 260, as shown in FIG. 3. Does not.

버퍼(210)는 지역 게이트웨이 장치로부터 전송되는 장애 이벤트를 저장한다. 이때, 버퍼(210)는 장애 이벤트 정보를 데이터베이스(80)의 테이블로 만들어도 되고, 시스템의 메모리를 이용하여 만들어도 되고, 파일 형태로 보관할 수도 있다. 버퍼(210)는 각 시스템의 운용 환경을 고려하여 최적을 방안을 활용하여 장애 이벤 트 정보를 저장한다.The buffer 210 stores a failure event transmitted from the local gateway device. In this case, the buffer 210 may make the failure event information into a table of the database 80, may be made using a system memory, or may be stored in a file form. The buffer 210 stores fault event information by utilizing an optimal method in consideration of the operating environment of each system.

자원 객체 관리기는 네트워크 장비를 물리적, 논리적 객체트리 ID 형태로 관리한다.The resource object manager manages network devices in the form of physical and logical object tree IDs.

객체트리 ID는 객체의 물리적, 논리적 상관 관계를 나타내는 객체 ID와, 상기 객체의 상위 레벨 관계를 나타내는 계위 ID로 구분될 수 있다.The object tree ID may be classified into an object ID indicating a physical and logical correlation of an object and a hierarchy ID indicating a high level relationship of the object.

도 4는 본 발명에 적용되는 객체트리 ID의 구조를 도시한 것이고, 도 5a~도 5h는 본 발명에 적용되는 객체트리 ID를 부여하는 과정을 도시한 것이다.4 illustrates a structure of an object tree ID applied to the present invention, and FIGS. 5A to 5H illustrate a process of assigning an object tree ID applied to the present invention.

자원 객체 관리기(220)는 도 4에 도시된 바와 같이 물리적이거나 논리적인 객체들을 트리 형태로 관리하는데, 모든 객체에 유일한 키 값을 부여하여 관리한다.The resource object manager 220 manages physical or logical objects in a tree form, as shown in FIG. 4, by assigning a unique key value to all objects.

객체 ID에는 최상위 레벨에 루트(root)가 존재하며, 망 식별번호, 장비 식별 번호, 링크 식별 번호, 샤시(chassis) 식별 번호, 모듈 식별 번호, 카드 식별 번호, 포트 식별 번호, 논리포트 식별 번호가 차례로 부여된다.The object ID has a root at the top level and includes a network identification number, a device identification number, a link identification number, a chassis identification number, a module identification number, a card identification number, a port identification number, and a logical port identification number. Are given in turn.

자원 객체 관리기(220)는 우선 각 서비스 도메인별로 무선랜, 매트로 서비스, ntopia, 백본, 액세스 계 등으로 다양하게 필요에 따라 망 자원을 구분한다. 이러한 망자원은 망 관리 시스템에서 장애나 성능, 그리고 각 서비스별 다양한 보고서의 생성을 위하여도 유용하게 사용된다.The resource object manager 220 first divides network resources according to various needs such as WLAN, macro service, ntopia, backbone, access system, etc. for each service domain. These network resources are also useful for generating faults, performance and various reports for each service in network management systems.

망 식별번호는 도메인에 따라서 ID 를 부여하는데, 예를 들면 네스팟은 1번, 매트로 이더넷 접속 서비스는 2번, ntopia 서비스는 3번, 백본은 4번, 액세스 망은 5번 등으로 사용한다.(도 5a) The network identification number is assigned according to the domain. For example, Nespot is used as 1, Metro Ethernet access service is 2, ntopia service is 3, backbone is 4, access network is 5, and so on. (FIG. 5A)

장비의 식별번호는 2를 부여하고, 장비의 고유ID는 지역본부와 국소번호, 그리고 장비의 종류, 일련번호를 붙여서 고유한 객체 ID로 생성된다.(도 5b)The identification number of the equipment is assigned 2, and the unique ID of the equipment is generated as a unique object ID by attaching the regional headquarters and the local number, and the type and serial number of the equipment (FIG. 5B).

즉, 장비 식별번호는 도메인(1자리) + 객체 구분자(1자리) + 지역본부(2자리) + 국소번호(3자리) + 장비종류(2자리) + 일련번호(3자리)로, 예를 들면 4.24243512005와 같이 부여된다.In other words, the equipment identification number is domain (1 digit) + object identifier (1 digit) + regional headquarters (2 digits) + local number (3 digits) + equipment type (2 digits) + serial number (3 digits). For example, 4.2 4243512005 .

링크가 논리적인 객체이며 각 장비에 속해보이지만 실제로 단독의 장치에 대하여는 의미가 없으므로, 링크 식별번호는 각 장치간에 속한 논리적인 객체로 정의하여 장비와 같은 레벨로 분류한다.(도 5c)Since the link is a logical object and belongs to each device, but has no meaning for a single device, the link identification number is defined as a logical object belonging to each device and classified into the same level as the device (FIG. 5C).

예들 들면, 링크 식별 번호는 3을 부여하는데, 도메인(1자리) + 객체 구분자(1자리) + 일련번호는 8자리를 부여한다. 일례로, 링크 식별 번호는 4.300000032와 같이 부여된다.For example, the link identification number is assigned 3, and the domain (1 digit) + object identifier (1 digit) + serial number is assigned 8 digits. In one example, the link identification number is given as 4.3 00000032 .

링크는 대국 포트의 값이 존재해야 의미가 있으므로 대국이 되는 포트들의 조합 값을 포함하든지 아니면, 데이터베이스(80)의 각 테이블 레코드에 값을 저장하여 활용한다.The link is meaningful only when the value of the power port exists, so it contains a combination value of the ports to be power, or stores the value in each table record of the database 80.

샤시(chassis)는 각 장치에 속한 물리적인 객체이므로 장비 다음의 레벨에 사용하는데, 샤시 식별번호는 4를 부여하고, 샤시 인덱스(chassis index(chas)) 값은 2자리로 표현한다(도 5d). (예) 4.24243512005.401 Chassis is a physical object belonging to each device, so it is used at the next level of equipment. The chassis identification number is assigned to 4, and the chassis index (chas) value is represented by 2 digits (FIG. 5D). . (Example) 4.24243512005.4 01

모듈(module)은 샤시에 속한 물리적인 객체이므로 샤시 다음의 레벨에 사용하는데, 모듈 식별번호는 5를 부여한다.(도 5e) (예) 4. 24243512005.401.502 A module is a physical object belonging to the chassis, so it is used at the level after the chassis, and the module identification number is assigned to 5 (Fig. 5E). (Example) 4. 24243512005.401.5 02

카드는 모듈에 속한 물리적인 객체이므로 모듈 다음의 레벨에 사용하는데 카 드 식별번호는 6을 부여한다.(도 5f) Since the card is a physical object belonging to the module, it is used at the level after the module, and the card identification number is assigned to 6 (FIG. 5F).

카드 식별번호는 모듈 객체트리 ID + 객체 구분자 (1자리) + index(2자리)로, 예들 들면 4. 24243512005.401.501.612와 같이 부여된다. (도 5g)The card identification number is given as module object tree ID + object identifier (1 digit) + index (2 digits), for example 4.24243512005.401.501.6 12 . (Fig. 5g)

포트는 장비에 속한 물리적인 객체로 포트 식별번호는 7을 부여하는데, 장비 객체트리 ID + 객체 구분자(1자리)+ 포트 타입(2자리) + sequence(4자리)로 부여한다.(도 5h) (예) 4. 24243512005.7010005 The port is a physical object belonging to the device, and the port identification number is assigned to 7, which is assigned to the device object tree ID + object identifier (1 digit) + port type (2 digits) + sequence (4 digits) (FIG. 5H). (Example) 4.24243512005.7 010005

논리적 포트는 각 포트에 속한 논리적인 객체이므로 포트 다음의 레벨에 사용하고, 논리적 포트 식별번호는 8을 부여한다. 이때, 논리적 포트식별 번호는 포트 객체트리 ID + 객체 구분자( 1자리) + ifnumber(5자리)로 부여하는데, 인터페이스번호(ifnumber)를 함께 표현한다.(예) 4. 24243512005.7010005.800001 Logical ports are logical objects belonging to each port, so they are used at the level after the port, and the logical port identification number is assigned to eight. At this time, the logical port identification number is assigned as the port object tree ID + object identifier (1 digit) + ifnumber (5 digits), and the interface number (ifnumber) is expressed together. (Example) 4. 24243512005.7010005.8 00001

논리적인 포트에는 서브 포트(sub port), 가상랜(VLAN) 등이 있다. Logical ports include sub ports and virtual LANs.

이와 같이, 객체 ID는 각 하위 레벨의 객체가 자기가 속한 상위 레벨의 객체 ID를 포함하여 정의되도록 한다. 따라서, 객체 ID는 각 객체들이 도 4에 도시된 바와 같은 트리 상에서 어떤 위치에 존재하며, 어떤 포함 관계를 가지는지 직관적으로 알 수 있도록 한다.As such, the object ID allows the object of each lower level to be defined including the object ID of the higher level to which it belongs. Therefore, the object ID makes it possible to intuitively know where each object exists in the tree as shown in FIG. 4 and what inclusion relationship it has.

이러한 객체 ID만으로는 각 장치 및 포트들의 상관관계를 일목 요연하게 알 수 없다. These object IDs alone do not reveal the correlation between devices and ports.

도 6은 본 발명에 적용되는 계위 ID의 구조를 도시한 것으로서, 자원 객체 관리기(220)는 도 6에 도시된 바와 같이 각 장비에 대하여 각 계위별 ID를 부여한 계위 ID를 사용한다. 계위 ID는 링크에 속한 장치들 간의 상관관계에 사용된다. FIG. 6 illustrates a structure of a hierarchy ID applied to the present invention. As shown in FIG. 6, the resource object manager 220 uses a hierarchy ID assigned an ID for each hierarchy for each device. The hierarchy ID is used for correlation between the devices belonging to the link.

이렇게 자원 객체 관리기(220)가 해당 링크의 정보를 저장하는 프로파일에 대국에 되는 장비의 객체 ID와 계위 ID를 저장하고, 1차 필터(140) 또는 2차 필터(240)는 자원 객체 관리기(220)에서 객체 ID와 계위 ID를 읽어들여 이를 필터링에 이용한다. In this way, the resource object manager 220 stores the object ID and the hierarchy ID of the equipment to be played in the profile that stores the information of the link, and the primary filter 140 or the secondary filter 240 stores the resource object manager 220. ) Reads the object ID and hierarchy ID from and uses it for filtering.

즉, 1차 필터(140)나 2차 필터(240)는 4.300000032 (링크 트리 ID) + 상위 장비 계위 ID + 하위 장비 계위 ID와 같은 값을 이용하여 모든 링크에 대하여 해당 링크가 망에서 어떤 위치에 존재하며, 어떤 장비와 어떤 장비를 연결하는 링크인지를 직관적으로 알게 된다. That is, the primary filter 140 or the secondary filter 240 uses a value such as 4.300000032 (link tree ID) + higher device hierarchy ID + lower device hierarchy ID, and the corresponding link is located at a certain position in the network for all links. Intuitively knows which devices exist and which links link them.

1차 필터(140)는 자원 객체 관리기(220)의 객체트리 ID를 이용하여 상위의 레벨에 속한 객체 ID의 장애가 탐지되면, 해당 객체 ID의 하위 레벨의 장애는 의미가 없으므로 하위 레벨의 장애는 아주 손쉽게 필터링 할 수 있다. When the primary filter 140 detects a failure of an object ID belonging to a higher level by using the object tree ID of the resource object manager 220, the failure of the lower level is very significant because the failure of the lower level of the corresponding object ID is meaningless. You can filter easily.

1차 필터(140)에서 객체트리 ID를 이용하지 않는 경우에 장비에 속한 포트에 대한 이벤트 처리는 수행이 되지만, 논리 포트들과 그 논리포트가 속한 장비와 포트의 상관관계를 파악하기 힘들어진다.When the object filter ID is not used in the primary filter 140, event processing for a port belonging to a device is performed, but it is difficult to determine a correlation between logical ports and a device and a port to which the logical port belongs.

알람 처리기(230)는 항시 버퍼(210)를 감시하다가, 장애 이벤트를 추출하여 알람 처리를 수행한 후에 해당 이벤트를 2차 필터(240)로 보낸다.The alarm processor 230 constantly monitors the buffer 210, extracts a fault event, performs alarm processing, and then sends the event to the secondary filter 240.

알람 처리기(230)에서 추출되는 대상이 되는 알람은 주로 링크로서, 이 링크 정보는 지역 게이트웨이 장치(100)에 존재하지 않고 장애 관리 서버(200)에서만 관리를 한다. The alarm to be extracted from the alarm processor 230 is mainly a link, and this link information does not exist in the local gateway device 100 but manages only the failure management server 200.

따라서, 링크 장애 처리기(250)는 지역 게이트웨이 장치(100)로부터 전달되 는 장애 이벤트의 객체트리 ID 중에서 링크에 속한 장애 이벤트를 추출하여 해당 링크를 장애로 알람 처리하고, 해당 이벤트를 2차 필터(240)로 전달한다.Accordingly, the link failure handler 250 extracts a failure event belonging to the link from the object tree ID of the failure event delivered from the local gateway device 100 and alarms the corresponding link as a failure, and filters the event as a secondary filter 240. To pass).

링크 장애 처리기(250)에서 링크 장애를 바로 알람 처리하는 이유는 각 링크의 장애 건별은 신속히 처리되어야 하기 때문이다.The reason why the link failure handler 250 immediately alarms the link failure is that failures of each link must be processed quickly.

2차 필터(240)는 대형 장애 프로파일 정보를 가지고 있어 정확한 장애의 위치 추적하고, 대형 장애 관리기(260)는 2차 필터(240)에서 추적한 장애의 위치에 통해 대형 장애에 대한 장애 처리를 수행한다.The secondary filter 240 has a large failure profile information to track the exact location of the failure, the large failure manager 260 performs a failure processing for the large failure through the location of the failure tracked by the secondary filter 240 do.

망 관리 운용자는 대형 장애 프로파일에 속한 장애 이벤트의 알람 처리 요청이 발생하면 다른 알람보다 최우선적으로 처리한다.The network management operator takes priority over other alarms when an alarm processing request for a fault event in a large fault profile occurs.

이때, 대형 장애 프로파일은 객체 ID, 계위 ID, 타입, 발생 시각 등의 정보를 가진다. At this time, the large failure profile has information such as object ID, hierarchy ID, type, occurrence time, and the like.

여기에서 타입은 장비를 기준으로 하위 계위 방향으로 링크의 이중화 여부에 대한 방안이다. 대부분 망 사업자가 망의 생존성을 높이기 위하여 백본망에 대하여 링크를 이중화하여 운용하고 있다. 따라서 하나의 링크에 대한 장애가 발생하면, 다른 경로를 통하여 자동으로 경로 우회하여 대형 장애에 대하여 대비하고 있다. Here, the type is a method for redundancy of the link in the lower hierarchy direction based on the equipment. In most cases, network operators operate redundant links to the backbone network in order to increase the survivability of the network. Therefore, when a failure occurs on one link, it automatically bypasses the path through another path to prepare for a large failure.

링크의 이중화가 없는 경우에, 하위 계위 ID를 가진 모든 객체에 대하여 장애로 처리한다. 이는 실제로 이중화가 되어 있지 않으면, 이 지점에서의 장애는 하위 계위에 존재하는 모든 객체를 통하여 타 ISP나 해외 라인으로 데이터를 전송하지 못하는 대형 장애에 해당된다.In the absence of link redundancy, all objects with a lower hierarchy ID are treated as a failure. This is a large failure that prevents data from being transferred to other ISPs or foreign lines through all objects in the lower hierarchy, unless there is actually redundancy.

링크가 다운으로 이중화가 있는 경우에, 포트의 객체트리 ID를 이용하여 링 크의 하위 장비들의 상방향 포트가 모두 다운이면, 하위 계위의 모든 객체에 대하여 장애로 처리한다. 이는 인터넷이 라우팅을 통하여 하나의 경로가 다운일 경우 자동적으로 경로 재설정을 통하여 우회하므로 전체적인 망장애를 방지하기 때문이다.If there is redundancy with a link down, if all the upstream ports of the lower devices of the link are down using the object tree ID of the port, all objects in the lower level are treated as a failure. This is because the Internet automatically bypasses through rerouting when one path is down through routing to prevent the overall network failure.

그러므로 이중화되어 있는 두 개의 링크 중 하나의 링크에서 장애 발생시, 해당 포트와 링크의 장애로만 알람 처리한다. 그리고 장애의 복구과정도 마찬가지로 순서로 진행된다. Therefore, when one of the two redundant links fails, only the corresponding port and link fail. The recovery process of failures is similarly performed in sequence.

다음, 도 7을 참조하여 본 발명의 실시예에 따른 망 관리에서의 장애 관리 시스템의 동작에 대하여 자세하게 설명한다. Next, the operation of the failure management system in network management according to an embodiment of the present invention will be described in detail with reference to FIG. 7.

도 7에 도시된 바와 같이, 망 자원에서 장애가 발생하면 지역 게이트웨이 장치(100)가 핑 시험기(110), SNMP 처리기(120), 및 SYSLOG 처리기(130)를 이용해 장애 이벤트를 수집한다.(S1)As shown in FIG. 7, when a failure occurs in a network resource, the local gateway device 100 collects a failure event by using the ping tester 110, the SNMP processor 120, and the SYSLOG processor 130 (S1).

1차 필터(140)는 객체트리 ID를 이용해 장애 이벤트를 일차 필터링 하여 장비 장애인 경우에 해당 장비에 속한 모든 포트에 대한 장애 이벤트를 걸러내고, 동일 주기 내에 속한 이벤트를 1그룹으로 처리한다.(S2, S3) The primary filter 140 first filters a failure event using an object tree ID to filter out failure events for all ports belonging to the corresponding device in case of a device failure, and processes events belonging to the same period as a group. , S3)

1차 필터(140)에서 걸러내지 못한 장애 이벤트들은 장애 관리 서버(200)로 전송되어 버퍼(210)에 저장된다(S4). Failure events not filtered by the primary filter 140 are transmitted to the failure management server 200 and stored in the buffer 210 (S4).

알람 처리기(230)는 버퍼(210)를 계속 감시하다가 각 장애 이벤트를 읽어들 여 장비와 포트의 장애 이벤트를 알람 처리하고, GUI 관리 장치(20)이나 연동이 되는 타 시스템에서 활용할 수 있도록 가공 처리한다(S5). The alarm processor 230 continuously monitors the buffer 210 and reads each fault event to alarm the fault event of the device and the port, and processes it to be utilized by the GUI management device 20 or another system to be interlocked. (S5).

2차 필터(240)는 장애 이벤트를 2차 필터링 하여 해당 포트에 대한 장애 이벤트가 링크에 해당되는지를 조사하고(S6), 링크에 대한 장애가 아니면 해당 장애 이벤트를 폐기하며(S7), 링크에 대한 장애이면 링크 장애 처리기(250)에서 알람 처리를 한다.(S8)The secondary filter 240 secondaryly filters the failure event to investigate whether the failure event for the corresponding port corresponds to the link (S6), discards the failure event if the failure is not for the link (S7), and If there is a failure, the link failure handler 250 performs an alarm process (S8).

2차 필터(240)는 장애 이벤트가 대형 장애에 영향을 미치는 이벤트인지를 조사하여 대형 장애인 경우에, 대형 장애 관리기(260)는 객체트리 ID를 이용해 현재 계위 ID 이하에 속하는 모든 객체트리 ID에 대해 장애 처리를 하고, GUI 관리 장치(20)와 타 시스템으로 정보를 전달한다.(S9, S10)The secondary filter 240 examines whether a failure event is an event affecting a large failure, and in the case of a large failure, the large failure manager 260 uses an object tree ID for all object tree IDs below the current hierarchy ID. The failure is processed and information is transmitted to the GUI management apparatus 20 and other systems. (S9, S10)

위에서 알람 처리된 장애는 장애 프로파일에서 해당 객체에 대한 장애 처리를 하며, GUI 관리 장치(20)나 타 시스템들과의 연동을 위한 연동 관리 장치(10)로 정보를 제공하여 장애 상황을 전파한다.The failure processed by the above alarm handles the object in the failure profile and provides information to the GUI management device 20 or the interworking management device 10 for interworking with other systems to propagate the failure situation.

이와 같이, 본 발명의 실시예는 대규모 IP망에서 복잡한 이벤트 상관관계 처리 시스템(Event Correlation System)을 이용하지 않고도 객체 ID와 계위 ID를 혼합하여 사용하여 간단하면서도 좋은 성능의 정확한 장애 위치를 파악하며, 대형 장애의 대처 방안으로 활용할 수 있다.As described above, the embodiment of the present invention uses a mixture of object IDs and hierarchy IDs without using a complex event correlation system in a large IP network to identify an accurate fault location with simple and good performance. It can be used as a countermeasure for large-scale obstacles.

이상에서 본 발명의 바람직한 실시예에 대하여 상세하게 설명하였지만 본 발명은 이에 한정되는 것은 아니며, 그 외의 다양한 변경이나 변형이 가능하다. Although the preferred embodiment of the present invention has been described in detail above, the present invention is not limited thereto, and various other changes and modifications are possible.

이와 같이, 본 발명에 의한 망 관리에서의 장애 관리 시스템 및 그 방법은 망 자원에서 장애 발생시 이중 필터링과 망요소에 대한 객체트리 ID를 이용해 장애의 위치를 추적하여 관리함으로써 망요소들간의 상관 관계가 쉽게 파악되고, 정확하고 신속한 장애 처리를 수행할 수 있고, 복잡한 장애 이벤트의 상관 관계를 고려하지 않고도 대형 장애를 처리할 수 있는 효과가 있다. As described above, the system and method for managing a failure in network management according to the present invention has a correlation between network elements by tracking and managing the location of a failure by using dual filtering and an object tree ID for a network element when a failure occurs in a network resource. Easily identified, accurate and rapid fault handling can be achieved and large faults can be handled without considering the correlation of complex fault events.

또한, 본 발명에 의한 망 관리에서의 장애 관리 시스템 및 그 방법은 망 사업자가 정확하고 신속한 장애 처리를 수행하여 안정적으로 망을 운용함으로써 가입자들로부터 신뢰를 얻을 수 있는 효과가 있다.
In addition, the failure management system and method in the network management according to the present invention has the effect that the network operator can obtain the trust from the subscribers by performing a reliable and rapid failure processing to operate the network stably.

Claims

A local gateway device collecting failure event information from a network device, first filtering the failure event information, and transmitting a single group of failure events within the same period; And

Secondary filtering of failure events transmitted from the local gateway device, the network equipment and port, link-related failure events of the failure events to handle the failure alarm, large failure-related failure events to track the location of the failure Includes a fault management server for fault handling, including objects correlated with the originating object,

The failure management server,

A resource object manager that manages the network equipment in the form of physical and logical object trees, and manages an object tree ID by assigning an ID to each object;

A buffer for storing a failure event transmitted from the local gateway device;

An alarm processor that monitors a failure event in the buffer and extracts the failure event for each type of failure;

A link failure handler configured to process a failure event belonging to a link among the failure events stored in the buffer as a link failure;

A secondary filter that examines whether a large failure of an event transmitted after the alarm processing is performed by the alarm processor or the link failure processor, and tracks the location of the large failure object using the object tree ID of the resource object manager; And

Large failure manager that handles large failures for all object tree IDs below the large failure object through the secondary filter

Failure management system in the network management comprising a.

The method of claim 1,

The regional gateway device,

A ping tester for performing a periodic Internet Control Message Protocol (ICMP) ping test to the network equipment and transmitting the test result;

An SNMP processor which accesses the network device using a Simple Network Management Protocol (SNMP), periodically collects port status and trap information of each device, and transmits the collection result;

A log information processor for transmitting event information obtained by converting the remote log information transmitted from the network equipment into a data form used in the system; And

The information transmitted from the ping tester, the SNMP processor, or the log information processor is analyzed and filtered first, and when a failure condition related to the network equipment is detected, a failure event of all ports belonging to the equipment is grouped into the failure management server. Primary filter to send to

Failure management system in the network management comprising a.

The method of claim 2,

The SNMP processor,

An operation state monitoring unit configured to read an operation state of the network device during a setting period and collect information on whether a port of each network device is up / down; And

A trap parsing unit for parsing the trap information transmitted from the network equipment, converting the trap information into a data form used in the system, and transmitting the converted event information to the primary filter.

Failure management system in the network management comprising a.

delete

The method of claim 1,

The object tree ID of the resource object manager,

And an object ID indicating a physical and logical correlation of the object and a hierarchy ID indicating a high level relationship of the object.

The method of claim 5,

The object ID is,

A failure in network management, which is generated by assigning a network identification number, a device identification number, a link identification number, a chassis identification number, a module identification number, a card identification number, a port identification number, and a logical port identification number. Management system.

The method of claim 5,

The object ID is a failure management system in the network management, characterized in that the lower level object is defined including the object ID of the higher level to which it belongs.

The method of claim 2,

The primary filter,

When detecting a failure of an object tree ID belonging to a specific level by using the object tree ID of the resource object manager, the failure situation of the object tree ID belonging to a lower level of the object tree ID is ignored. Management system.

A) collecting fault event information from network equipment and classifying the fault event information according to the type of fault occurrence;

B) extracting the alarm event information in which the fault associated with the network equipment and the port has occurred in step A), and processing the alarm; And

C) a step of processing a large failure including the objects that are correlated with the failure object by tracking the location of the failure object for the failure event that has a large failure-related failure in step A)

Including,

Collecting fault data in the step A),

The regional gateway device periodically monitoring the failure by the network equipment first filters the failure data, and when a failure situation related to the network equipment is detected, a failure group of all ports belonging to the corresponding equipment is transmitted in a single group,

In the step C), the step of handling the large fault,

a) receiving an event for the alarm processing result of step B) and performing secondary filtering to investigate whether there is a large failure, and tracking the location of the large failure object using the object tree ID; And

b) performing a large failure processing on all object tree IDs below the large failure object through the second filtering result of step a).

Failure management method in the network management comprising a.

delete

10. The method of claim 9,

Categorizing the failure event in the step A),

In network management, a unique ID is assigned to each entity of the network device to generate a physical and logical object tree ID, and the failure event is analyzed using the object tree ID to identify the type and location of the failure. How to manage faults.

The method of claim 11,

The first step of filtering the fault event,

And detecting a failure state of an object tree ID belonging to a lower level of the object tree ID when detecting a failure of an object tree ID belonging to a specific level by using the object tree ID.

delete

The method of claim 11,

The object tree ID is,

The method of claim 14,

The object ID is a failure management method of the network management, characterized in that the lower level object is defined including the object ID of the upper level to which it belongs.

The method of claim 14,

The object ID is,

A failure in network management, which is generated by assigning a network identification number, a device identification number, a link identification number, a chassis identification number, a module identification number, a card identification number, a port identification number, and a logical port identification number. How to manage.

The method of claim 16,

The object ID is,

The network identification number is generated using a unique ID assigned to each service domain,

The equipment identification number is generated by attaching the local headquarters and local number, the type of equipment, and the serial number to which the equipment belongs.

The link identification number is defined as a logical object belonging to each device,

The chassis identification number is defined as a physical object belonging to each device,

Define a physical object belonging to each chassis of the module identification number,

The card identification number is defined as a physical object belonging to each module,

The port identification number is defined as a physical object belonging to the device,

And the logical port identification number is defined as a logical object belonging to each port.