KR20170127876A

KR20170127876A - System and method for dealing with troubles through fault analysis of log

Info

Publication number: KR20170127876A
Application number: KR1020160058657A
Authority: KR
Inventors: 김영호; 임은지; 차규일; 배승조; 김원영
Original assignee: 한국전자통신연구원
Priority date: 2016-05-13
Filing date: 2016-05-13
Publication date: 2017-11-22

Abstract

The present invention relates to a system and a method for analyzing a fault in a cluster-shaped high-performance computing system, in which a large-scale computing system node formed of various resources is connected through a network, and responding to a trouble. According to the present invention, the trouble responding system based on a log fault analysis comprises: a resource log collector for collecting an event or log information from a configured node; a fault log analysis device transmitting fault related information when it is recognized by analyzing the log information that the system has a trouble, analyzing associated resources of a resource with a fault, and analyzing the severity of the fault; a resource fault processing device resolving a faulty state of the resource when a trouble is recognized; a fault log reporting device reporting fault related information to a user; and a policy and filter management device setting and storing an event or a log collection policy, information on a fault analysis filter, resource association information, and a failure notification policy.

Description

TECHNICAL FIELD [0001] The present invention relates to a system and method for failure diagnosis based on log defect analysis,

본 발명은 다양한 자원으로 이루어진 대규모 컴퓨팅 시스템 노드가 네트워크로 연결된 클러스터 형상의 고성능 컴퓨팅 시스템의 결함을 분석하고, 장애에 대응하는 시스템 및 방법에 관한 것이다. The present invention relates to a system and method for analyzing defects in a cluster-like high-performance computing system in which a large-scale computing system node made up of various resources is network-connected, and for responding to a failure.

대규모 노드로 구성된 고성능 컴퓨팅 시스템에서는 많은 수의 관리 대상 노드들이 연동되어 동작하므로, 시스템이나 서비스의 장애를 인지하고 서비스 시스템을 진단함으로써, 전체 시스템의 장애인지 또는 부분 시스템의 장애인지를 판단하게 된다. In a high performance computing system composed of a large number of nodes, a large number of managed nodes operate in an interlocked manner. Therefore, a failure of a system or a service is recognized and a service system is diagnosed to determine whether the entire system is a failure or a failure of a partial system.

에러나 로그가 발생되면 해당 노드는 수행 중이었던 작업이나 서비스를 진행하기 어려운 상태가 되는데, 종래 기술에 따르면 단순히 관리자에게 이러한 에러 상황을 전달하고, 관리자가 오류 상황에 대해 대응하는 방식으로서, 전체 시스템 또는 서비스의 가용성과 안전성이 저하되는 문제점이 있다. When an error or a log is generated, the node becomes a state in which it is difficult to proceed with a job or service that was being executed. According to the related art, the error state is simply transmitted to the manager, Or the availability and safety of the service is deteriorated.

본 발명은 전술한 문제점을 해결하기 위하여 제안된 것으로, 고속 저지연 네트워크로 연결된 다수의 자원을 가진 컴퓨팅 노드로 구성된 고성능 컴퓨팅 클러스터 시스템에서, 결함이나 에러를 하드웨어 자원 또는 소프트웨어로부터 수집된 로그를 기반으로 분석하여, 선재적인 결함 및 오류에 대한 대응이 가능한 로그 결함 분석 기반 장애 대응 시스템 및 방법을 제공하는 데 목적이 있다. SUMMARY OF THE INVENTION The present invention has been proposed in order to solve the above-described problems, and it is an object of the present invention to provide a high performance computing cluster system including a computing node having a plurality of resources connected by a high- And to provide a failure response system and method based on log defect analysis capable of responding to pre-existing faults and errors.

본 발명에 따른 로그 결함 분석 기반 장애 대응 시스템은 구성 노드로부터 이벤트 또는 로그 정보를 수집하는 자원 로그 수집기와, 로그 정보를 분석하여 시스템의 장애가 발생한 것으로 인지한 경우 결함 관련 정보를 전송하되, 결함이 발생한 자원의 연관 자원에 대한 분석 및 결함 심각도에 대한 분석을 수행하는 결함 로그 분석기와, 장애를 인지한 경우 자원에 대한 결함 상태를 해소하는 자원 결함 처리기와, 결함 관련 정보를 사용자에게 보고하는 결함 로그 보고기 및, 이벤트 또는 로그 수집 정책, 결함 분석 필터에 대한 정보, 자원 연관성 정보 및 장애 알림 정책을 설정 및 저장하는 정책 및 필터 관리기를 포함하는 것을 특징으로 한다. The fault analysis system based on log defect analysis according to the present invention comprises a resource log collector for collecting event or log information from a configuration node, and a fault log analyzer for analyzing log information and transmitting defect related information when a system fault is recognized, A fault log analyzer for analyzing the associated resources of the resource and analyzing the severity of the fault, a resource fault handler for resolving the fault state for the resource when the fault is recognized, and a fault log report for reporting the fault related information to the user And a policy and filter manager for setting and storing an event or log collection policy, information on a defect analysis filter, resource association information, and a failure notification policy.

본 발명에 따른 로그 결함 분석 기반 장애 대응 방법은 이벤트 또는 로그에 대한 수집 정책과 결함 필터 정보 및 자원 연관 정보를 설정하는 단계와, 컴퓨팅 노드로부터 하드웨어 자원의 이벤트 로그와 소프트웨어 로그 정보를 수집하는 단계와, 결함 필터 및 자원 연관 정보를 이용하여 결함 여부를 분석하는 단계 및 결함 여부 분석 결과 결함이 발생한 것으로 판단된 경우, 결함 로그 분석 정보를 정규화된 형태로 저장하고, 결함 발생을 통지하며, 결함 처리 루틴을 수행하는 단계를 포함하는 것을 특징으로 한다. A method for responding to a fault based on log defect analysis according to the present invention comprises the steps of: setting a collection policy, an event filter and a defect filter information and resource association information for an event or a log; collecting event logs and software log information of a hardware resource from a computing node; Analyzing the defect using the defect filter and the resource association information, and storing the defect log analysis information in a normalized form when it is determined that a defect has occurred as a result of defectiveness analysis, notifying occurrence of a defect, And performing the steps of:

본 발명에 따른 로그 결함 분석 기반 장애 대응 시스템 및 방법은 다수의 자원을 가진 컴퓨팅 노드로 구성된 고성능 컴퓨팅 클러스터 시스템 환경에서, 노드의 하드웨어 자원 또는 소프트웨어서의 결함 여부 및 그 심각도를 분석하고, 선제적으로 장애 처리 루틴을 수행하고 이를 관리자에게 통지함으로써, 시스템 오류에 대한 사전 대응이 가능하고, 문제가 발생한 시스템을 빠른 시간에 복구하며, 시스템의 안정성 및 관리 효율성을 향상시키는 효과가 있다. The system and method for fault-tolerance-based failure analysis according to the present invention analyze a hardware resource or software defect of a node and its severity in a high performance computing cluster system environment composed of a computing node having a plurality of resources, By executing the failure handling routine and notifying the administrator of the failure, it is possible to proactively respond to a system failure, to recover the system in question in a short period of time, and to improve the stability and management efficiency of the system.

또한, 사전 등록된 장애 및 결함 대응 처리 루틴을 통해 장애 인지로부터 복구까지의 시간을 단축시키는 것이 가능하여, 시스템 및 서비스의 가용성과 신뢰성을 향상시키는 효과가 있다. In addition, it is possible to shorten the time from failure recognition to recovery through the pre-registered failure and defect response processing routine, thereby improving the availability and reliability of the system and the service.

또한, 결함 발생에 대한 통계 데이터를 이용하여 시스템에 대한 장애 및 운영 현황을 효율적으로 파악하고 관리하는 것이 가능한 효과가 있다. In addition, it is possible to effectively grasp and manage the fault and operation state of the system by using the statistical data on the occurrence of defects.

본 발명의 효과는 이상에서 언급한 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to those mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 실시예에 따른 로그 결함 분석 기반 장애 대응 시스템이 적용된 전체 시스템을 나타내는 블록도이다.
도 2는 본 발명의 실시예에 따른 로그 결함 분석 기반 장애 대응 시스템이 적용된 전체 시스템을 나타내는 상세 블록도이다.
도 3은 본 발명의 실시예에 따른 자원 연관성을 나타내는 예시도이다.
도 4는 본 발명의 실시예에 따른 결함 심각도를 나타내는 도면이다.
도 5는 본 발명의 실시예에 따른 로그 결함 분석 기반 장애 대응 방법을 나타내는 순서도이다.
도 6은 본 발명의 실시예에 따른 로그 기반 결함 분석 및 결함 처리 연산을 나타내는 순서도이다. 1 is a block diagram illustrating an entire system to which a fault analysis system based on a log defect analysis according to an embodiment of the present invention is applied.
2 is a detailed block diagram illustrating an entire system to which a fault analysis system based on a log defect analysis according to an embodiment of the present invention is applied.
3 is an exemplary diagram illustrating resource associations according to an embodiment of the present invention.
4 is a diagram illustrating defect severity according to an embodiment of the present invention.
5 is a flowchart illustrating a method of responding to a fault based on log defect analysis according to an embodiment of the present invention.
6 is a flowchart illustrating a log-based defect analysis and a defect-handling operation according to an embodiment of the present invention.

본 발명의 전술한 목적 및 그 이외의 목적과 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, advantages and features of the present invention and methods of achieving them will be apparent from the following detailed description of embodiments thereof taken in conjunction with the accompanying drawings.

그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 이하의 실시예들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 목적, 구성 및 효과를 용이하게 알려주기 위해 제공되는 것일 뿐으로서, 본 발명의 권리범위는 청구항의 기재에 의해 정의된다. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the exemplary embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, And advantages of the present invention are defined by the description of the claims.

한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성소자, 단계, 동작 및/또는 소자가 하나 이상의 다른 구성소자, 단계, 동작 및/또는 소자의 존재 또는 추가됨을 배제하지 않는다.It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. &Quot; comprises "and / or" comprising ", as used herein, unless the recited component, step, operation, and / Or added.

본 발명의 바람직한 실시예를 설명하기에 앞서, 이하에서는 당업자의 이해를 돕기 위하여 본 발명이 제안된 배경을 먼저 살펴보기로 한다. Prior to describing the preferred embodiments of the present invention, the background of the present invention will be described below in order to facilitate the understanding of those skilled in the art.

대규모의 노드로 구성된 고성능 컴퓨팅 시스템에서는 수천대 이상의 노드로 구성된 수십 페타 플롭스(PFlops) 규모의 시스템들이 실제로 구축되어 서비스되고 있으며, 현재로부터 몇 년 이내에는 엑사 스케일 시스템이 등장할 예정이다. In high-performance computing systems consisting of large-scale nodes, dozens of petaflops (PFlops) systems consisting of thousands of nodes are actually being built and serviced, and within a few years from now, exascale systems will emerge.

수많은 노드를 사용하는 엑사 스케일 시스템에서는, 기존의 낮은 단계의 복합 컴퓨팅 시스템에서는 찾을 수 없던 희박한 확률의 에러 및 실패들이 매번 발생하는 이벤트가 될 수 있다. In an ExScale system that uses a large number of nodes, it can be an event in which rare errors and failures occur each time, which can not be found in a conventional low-level hybrid computing system.

엑사 스케일 시스템은 매우 많은 수의 컴퓨팅 자원들이 존재하고, 이들이 모여 큰 규모의 네트워킹을 구성하고 있으며, 이러한 자원들이 소프트웨어적으로 관리되는 시스템으로서, 여러 영역 및 레이어에 걸쳐서 resiliency 문제를 발생시키는 원인이 존재하게 된다. The ExaScale system has a large number of computing resources, which together constitute a large scale networking, and these resources are software managed systems that cause resiliency problems across multiple domains and layers .

노드를 구성하는 다양한 하드웨어 자원 및 운영체제 상에서 동작하는 소프트웨어들은 에러나 오류 상황에서 발생시키는 이벤트나 로그 메시지의 포맷과 내용이 다양하여, 일관된 형태를 가지지 않는다. Various hardware resources constituting the node and software running on the operating system have different forms and contents of events and log messages that occur in an error or error situation and do not have a consistent form.

결함 관련 메시지를 표준화하기 위한 연구가 이루어졌으나, 이는 제공하는 인터페이스를 준수하여 구현 및 개발이 이루어져야 하는 바, 이미 개발된 다양한 장치나 소프트웨어에는 적용되기 어려운 문제점이 있다. Researches have been made to standardize defect related messages. However, since the implementation and development must be performed in compliance with the provided interface, there is a problem that it is difficult to apply to various devices and software that have already been developed.

하드웨어 장치의 시스템 이벤트 로그(System Event Log)나 데몬 소프트웨어의 시스템 경과 기록(Syslog)과 같은 산업계 표준 형태의 로그 정보가 사용되지만, 실제로 모든 장치나 소프트웨어에서 지원되지 않는 경우가 존재하므로, 표준화된 로그 메시지나 정보로 처리하기 어려운 문제점이 있다. Although industry standard log information such as the system event log of the hardware device or the system log record of the daemon software is used but not all devices or software actually exist, There is a problem that it is difficult to process with messages or information.

종래 기술에 따르면 에러나 로그가 발생되어 해당 노드에서 작업이나 서비스가 진행되기 어려운 상태가 되면, 이러한 에러 상황을 단순히 관리자에게 전달하고, 관리자가 오류 상황에 대해 대응하는 방식으로 처리되므로, 전체 시스템이나 서비스의 가용성과 안정성이 저하되는 문제점이 있다. According to the related art, when an error or a log is generated and a job or service becomes difficult to proceed in a node, such an error situation is simply handed to the manager, and the manager is processed in a manner corresponding to the error situation. There is a problem that the availability and stability of the service are deteriorated.

또한, 종래 기술에 따르면 일부 상용 관리 시스템에서 에러 상황에 대한 임계값이나 조건을 설정하고, 해당 상황이 발생하면 오류 처리 루틴이 자동으로 수행되는 기능이 제공되고 있는데, 에러의 원인이 해당 자원이거나 일시적인 오류인 경우에는 해당 자원에 대한 재수행 등으로 정상화가 가능하지만, 하드웨어나 다른 자원에 의해 발생한 오류인 경우에는 근본적인 문제 해결이 이루어지지 않는 문제점이 있다. According to the related art, there is provided a function that a threshold value or a condition for an error situation is set in some commercial management systems and an error processing routine is automatically performed when a corresponding situation occurs. However, In case of an error, normalization can be performed by re-executing the resource, but in the case of an error caused by hardware or other resources, a fundamental problem can not be solved.

본 발명은 전술한 문제점을 해결하기 위하여 제안된 것으로, 고속 저지연 네트워크로 연결된 다수의 자원을 가진 컴퓨팅 노드로 구성된 고성능 컴퓨팅 클러스터 시스템에서, 하드웨어 자원이나 소프트웨어에서 수집된 로그를 기반으로 결함이나 에러를 분석하여, 이에 대한 선제적인 대응을 수행하고 결함 여부를 통지하는 로그 결함 분석 기반 장애 대응 시스템 및 방법을 제안한다. The present invention has been proposed in order to solve the above problems. In the high performance computing cluster system composed of a computing node having a large number of resources connected by a high-speed low-latency network, a fault or error is generated based on hardware resources or logs collected from software Analyzing the fault, and performing a preliminary response to the fault and reporting the fault.

본 발명에 따르면, 하드웨어 및 소프트웨어 자원의 동작 중에 발생한 로그 정보를 수집하고, 필터를 적용하여 결함 및 에러 발생 여부를 분석하고 저장하며, 설정된 결함 처리 루틴을 수행하여 결함 복구를 수행하고, 정의된 정책에 의해 관리자에게 주기적으로 통지하는 결함 분석 및 모니터링을 지원한다. According to the present invention, log information generated during the operation of hardware and software resources is collected, a filter is applied to analyze and store the occurrence of defects and errors, performs defect repair by performing a set fault processing routine, And provides fault analysis and monitoring to the administrator periodically.

또한, 자원 연관성 정보를 통해 결함이 발생된 자원과 연관된 자원에 대하여 결함 및 정상 동작 여부를 확인하고, 문제가 확인되면 이에 대한 오류 처리 루틴을 수행하며, 결함 심각도 개념을 도입하여 동일한 문제가 반복해서 발생되는 경우에는 단순히 통지나 오류 처리 루틴을 수행함에 그치지 아니하고, 해당 자원에 대한 사용 및 동작을 중단시키고 관리자에게 심각한 수준의 장애에 대한 별도의 통지를 수행한다. Also, it is possible to check the defect and normal operation of the resource associated with the defective resource through the resource association information, execute the error processing routine for the defect, and repeat the same problem repeatedly If it does occur, it does not just execute notification or error handling routine, but stops use and operation of the resource, and makes a separate notification to the administrator of a serious level of failure.

본 발명에 따르면, 시스템에서 기본적으로 제공하는 결함 관련 필터와 사용자가 정의하는 필터를 적용하여 결함 발생 및 장애 여부를 분석하고, 로그 수집 대상이 되는 하드웨어 자원 또는 소프트웨어 간의 상관 관계를 필터에 연동하여, 결함이 발생한 자원과 연관된 자원에 대한 결함을 분석하고, 정상 동작을 위한 결함 처리 루틴을 수행한다. According to the present invention, a fault-related filter and a filter defined by a user are fundamentally analyzed to analyze the occurrence and failure of a defect, and a correlation between hardware resources or software to be log- Analyzes faults for resources associated with defective resources, and performs fault handling routines for normal operation.

도 1은 본 발명의 실시예에 따른 로그 결함 분석 기반 장애 대응 시스템이 적용된 전체 시스템을 나타내는 블록도이다. 1 is a block diagram illustrating an entire system to which a fault analysis system based on a log defect analysis according to an embodiment of the present invention is applied.

본 발명의 실시예에 따른 로그 결함 분석 기반 장애 대응 시스템(100)은 전체 구성 노드(200)로부터 주기적으로 이벤트나 로그 정보를 수집하며, 이 때 기수립된 로그 수집 정책에 의하여 이벤트나 로그 정보를 수집한다. The fault analysis system 100 according to an embodiment of the present invention periodically collects event or log information from the entire configuration node 200. At this time, Collect.

하드웨어 자원(230)과 소프트웨어 자원(220)으로 이루어진 각 노드(210)에서는 이벤트와 로그 정보가 다양한 상황에서 발생하게 된다. In each node 210 including the hardware resource 230 and the software resource 220, event and log information are generated in various situations.

로그 결함 분석 기반 장애 대응 시스템(100)은 결함 분석을 위한 필터를 적용하여 수집된 정보를 분석하고, 분석된 정보 및 결함 정보를 데이터베이스(110)에 저장하며, 결함 여부를 사용자 도구(300)를 통해 사용자에게 통지한다. The log defect analysis-based fault response system 100 analyzes the collected information by applying a filter for defect analysis, stores the analyzed information and the defect information in the database 110, To the user.

사용자는 이러한 사용자 도구(300)의 결함 알림 수신부(320)를 통해 결함 로그에 대한 통지를 확인하고, 상세한 결함 정보는 결함 로그 뷰어(310)를 통해 확인한다. The user confirms the notification of the defect log through the defect notification receiving unit 320 of the user tool 300 and confirms the detailed defect information through the defect log viewer 310. [

로그 결함 분석 기반 장애 대응 시스템(100)은 단순히 결함 통지만을 수행하는 것이 아니라, 사용자가 등록한 결함 처리 루틴을 호출하여 결함 및 장애 복구를 수행한다. The log defect analysis-based failure response system 100 performs defect and failback by calling a defect processing routine registered by the user rather than merely performing defect notification.

또한, 로그 결함 분석 기반 장애 대응 시스템(100)은 자원간의 연관성 매핑 정보를 이용하여, 해당 결함 발생 자원과 연관된 자원에 대한 결함 여부를 확인하고, 이에 대한 대응을 순차적으로 수행한다. In addition, the fault analysis system 100 based on the log defect analysis uses the association mapping information between the resources to check whether or not a defect associated with a resource associated with the fault occurrence resource is defective, and sequentially performs the correspondence thereto.

도 2는 본 발명의 실시예에 따른 로그 결함 분석 기반 장애 대응 시스템이 적용된 전체 시스템을 나타내는 상세 블록도이다. 2 is a detailed block diagram illustrating an entire system to which a fault analysis system based on a log defect analysis according to an embodiment of the present invention is applied.

본 발명의 실시예에 따른 로그 결함 분석 기반 장애 대응 시스템은 구성 노드로부터 이벤트 또는 로그 정보를 수집하는 자원 로그 수집기(120)와, 로그 정보를 분석하여 시스템의 장애가 발생한 것으로 인지한 경우 결함 관련 정보를 전송하되, 결함이 발생한 자원의 연관 자원에 대한 분석 및 결함 심각도에 대한 분석을 수행하는 결함 로그 분석기(130)와, 결함 로그 분석기(130)가 장애를 인지한 경우 자원에 대한 결함 상태를 해소하는 자원 결함 처리기(140)와, 결함 관련 정보를 사용자에게 보고하는 결함 로그 보고기(150) 및 이벤트 또는 로그 수집 정책, 결함 분석 필터에 대한 정보, 자원 연관성 정보 및 장애 알림 정책을 설정 및 저장하는 정책 및 필터 관리기(160)를 포함하여 구성된다. The log defect analysis based fault handling system according to an embodiment of the present invention includes a resource log collector 120 for collecting event or log information from a configuration node, A defect log analyzer 130 for analyzing an associated resource of the defective resource and analyzing the severity of the defect, and a fault log analyzer 130 for detecting a fault, A defect log processor 150 for reporting defect related information to a user and a policy setting and storing information for event or log collection policy, defect analysis filter, resource association information and fault notification policy, And a filter manager (160).

본 발명의 실시예에 따른 정책 및 필터 관리기(160)는 시스템에서 기본적으로 제공되는 결함 관련 필터 및 사용자가 정의하는 필터를 포함하는 결함 분석 필터에 대한 정보를 저장한다. The policy and filter manager 160 according to an embodiment of the present invention stores information on a defect analysis filter including a defect-related filter and a user-defined filter that are basically provided in the system.

본 발명의 실시예에 따른 자원 로그 수집기(120)는 노드의 하드웨어 자원 및 소프트웨어 자원으로부터 발생한 이벤트나 로그 정보를 수집하는 구성으로, 정책 및 필터 관리기(160)로부터 설정된 로그 수집 정책에 의하여 주기적으로 이벤트 또는 로그 정보를 수집한다. The resource log collector 120 according to an exemplary embodiment of the present invention collects events or log information generated from hardware resources and software resources of a node. The resource log collector 120 periodically collects events and log information from the policy and filter manager 160, Or log information.

이 때, 자원 로그 수집기(120)는 하드웨어 이벤트 로그와 소프트웨어 로그를 파싱하여, 정규화된 포맷으로 저장한다. At this time, the resource log collector 120 parses the hardware event log and the software log, and stores them in the normalized format.

하드웨어 자원의 로그 정보 저장을 위한 정규화된 포맷은 아래 [표 1]과 같다. The normalized format for storing log information of hardware resources is shown in [Table 1] below.

이 때, filter는 에러 레벨을 나타내는 필드, seq는 Number 필드, host는 호스트명을 나타내는 필드, date는 이벤트가 발생한 날짜를 나타내는 필드, time은 이벤트가 발생하는 시간을 나타내는 필드, msg는 발생한 이벤트의 내용이 저장되는 필드, tag는 tag 정보를 나타내는 필드, mail_yn는 발생한 이벤트에 메일 전송여부를 표기하는 필드를 나타낸다. In this case, filter is a field indicating error level, seq is a number field, host is a field indicating host name, date is a field indicating the date when the event occurred, time is a field indicating time at which the event occurs, A tag is a field for storing tag information, and mail_yn is a field for indicating whether mail is sent to an event that has occurred.

소프트웨어의 로그 정보 저장을 위한 정규화된 포맷은 아래 [표 2]와 같다. The normalized format for storing the log information of the software is shown in [Table 2] below.

이 때, host는 호스트명을 나타내는 필드, facility는 event가 발생하는 장소, daemon, authopriv, kern등을 나타내는 필드, priority는 에러 레벨을 나타내는 필드(filter 로 쓰임), tag는 tag 정보를 나타내는 필드, date는 이벤트가 발생한 날짜를 나타내는 필드, time은 이벤트가 발생하는 시간을 나타내는 필드, program은 이벤트가 발생하는 프로그램을 나타내는 필드, msg는 발생한 이벤트의 내용이 저장되는 필드, seq는 Number를 나타내는 필드, mail_yn은 발생한 이벤트에 메일 전송여부를 표기하는 필드를 나타낸다. At this time, host is a field indicating the host name, facility is the field where the event occurs, daemon, authopriv, kern, etc., priority is a field indicating error level (used as filter), tag is a field indicating tag information, date is a field indicating the date when the event occurred, time Msg is a field in which the contents of the generated event are stored, seq is a field in which a number is indicated, and mail_yn is a flag indicating whether or not mail is sent to the generated event Field.

본 발명의 실시예에 따른 결함 로그 분석기(130)는 정책 및 필터 관리기(160)로부터 설정된 결함 분석 필터를 적용하여 로그 정보를 분석하고, 수집된 정보를 추출 및 분류하여 장애의 근본 원인을 추론한다. The defect log analyzer 130 according to an embodiment of the present invention analyzes log information by applying a defect analysis filter set from the policy and filter manager 160 and extracts and classifies the collected information to infer the root cause of the failure .

결함 로그 분석기(130)는 자원간의 연관성 매핑 정보인 자원 연관성 정보를 이용하여, 해당 결함 자원과 연관된 자원에 대한 결함을 확인한다. The defect log analyzer 130 identifies a defect for a resource associated with the corresponding defect resource by using the resource association information as the association mapping information between the resources.

도 3은 본 발명의 실시예에 따른 자원 연관성을 나타내는 예시도로서, 소프트웨어 자원인 프로세스 A, B, C 그리고 D는 동작을 위해 프로세스 F, G와 H를 필요로 한다. FIG. 3 is a diagram illustrating resource associations according to an embodiment of the present invention. Processes A, B, C, and D, which are software resources, require processes F, G and H for operation.

결함 분석 결과 프로세스 F에서 결함 발생이 확인되면, 결함 로그 분석기(130)는 프로세스 F와 연관성을 갖는 프로세스 A와 B에 대한 결함 분석이나 동작 상태를 검사한다. If the occurrence of a defect is confirmed in process F as a result of the defect analysis, the defect log analyzer 130 checks the defect analysis or operation state for processes A and B that are related to process F. [

마찬가지로 프로세스 G에 문제가 확인 되었으면, 결함 로그 분석기(130)는 프로세스 B와 C에 대한 결함 분석과 확인 작업을 수행한다. Similarly, if a problem is identified in process G, the defect log analyzer 130 performs defect analysis and verification for processes B and C.

본 발명에 따르면, 단순히 결함이 발생한 자원에 대한 결함 발생 분석이나 대응에 그치지 않고, 연관 자원의 매핑 설정을 통해 결함 발생 자원이 영향을 미치는 다른 연관 자원에 대한 분석 및 대응을 수행함으로써, 보다 향상된 자원 결함 관리 기능을 지원하는 효과가 있다. According to the present invention, by analyzing and responding to other related resources that are affected by a defect occurrence resource through mapping setting of an associated resource, rather than simply analyzing or responding to a defect occurrence occurrence for a defect occurrence resource, It has the effect of supporting defect management function.

본 발명의 실시예에 따른 자원 결함 처리기(140)는 결함 로그 분석기(130)의 분석 결과 해당 자원에 대한 결함 분석 결과가 장애 또는 결함으로 확인된 경우, 데이터베이스(결함 로그, 결함 분석 정보, 필터 정보 및 통계 정보를 저장하는 구성, 110)에 기등록된 결함 처리 루틴을 호출하여 해당 자원에 대한 결함 상태를 해소하고 정상 동작하도록 하는 기능을 제공한다. The resource defect processor 140 according to the embodiment of the present invention analyzes the defect logs of the corresponding resources as defect or defect as a result of analysis by the defect log analyzer 130, And a statistical information storing unit 110. The defect management unit 110 provides a function of calling a defect processing routine registered in advance to resolve a defect state for a corresponding resource and to perform a normal operation.

또한, 자원 결함 처리기(140)는 전술한 바와 같이, 결함이 발생한 해당 자원과 연관된 자원에 대한 분석 결과에 따라, 연관된 자원에 대하여도 그 결함 상태를 해소하여 정상 동작 시킨다. Also, as described above, the resource defect processor 140 normalizes the associated resource by eliminating the defect state according to the analysis result of the resource associated with the corresponding resource in which the defect occurs.

본 발명의 실시예에 따른 결함 로그 보고기(150)는 사용자의 요청에 따라 결함 로그 정보를 화면에 출력시키거나, 알림 주기에 대하여 기설정된 장애 알림 정책에 따라 메일 또는 메시지를 통해 결함 관련 정보를 전송한다. The defect log report unit 150 according to an embodiment of the present invention may output defect log information to a screen according to a user's request or may transmit defect related information through a mail or a message according to a preset failure notification policy send.

사용자는 도 1에 도시한 결함 로그 뷰어(310)를 통해 상세한 결함 정보를 확인할 수 있으며, 결함 알림 수신부(320)를 통해 결함 로그 보고기로부터 전송 받은 메일 또는 메시지를 확인하는 것이 가능하다.The user can confirm the detailed defect information through the defect log viewer 310 shown in FIG. 1, and check the mail or the message received from the defect log reporter through the defect notification receiving unit 320.

본 발명의 실시예에 따른 결함 로그 분석기(130)는 결함 발생 횟수, 주기 및 결함 로그 정보를 이용하여 결함 심각도를 분석한다. The defect log analyzer 130 according to the embodiment of the present invention analyzes the defect severity using the number of occurrences of defects, the period, and the defect log information.

도 4는 본 발명의 실시예에 따른 결함 심각도를 나타내는 도면이다. 4 is a diagram illustrating defect severity according to an embodiment of the present invention.

결함 심각도는 결함 로그 분석 정보와 결함 통계 정보에서 추출되는 결함 수준, 결함 발생 주기와 결함 발생 횟수에 의해 결정된다. The defect severity is determined by the defect log analysis information and the defect level extracted from the defect statistical information, the defect occurrence frequency, and the frequency of occurrence of defects.

결함 발생 주기와 결함 발생 횟수의 설정 값은 시스템이나 자원 상황에 따라 관리자가 별도로 설정하여 사용할 수 있다. The set value of the fault occurrence frequency and the frequency of occurrence of the fault can be set separately by the administrator according to the system or resource situation.

결함 수준은 도 4에 도시한 바와 같이, level에 의해 장애 대응 메커니즘이 결정된다.The defect level is determined by the level, as shown in FIG. 4, and the failure response mechanism is determined.

level 3과 4가 장애 처리 루틴의 수행이 요구되는 수준이며, level 3는 자원의 동작 상태를 확인하고 장애 처리 루틴을 호출하고, level 4는 장애 처리 루틴을 바로 호출하는 방식으로 동작한다. Levels 3 and 4 are the levels at which failure handling routines are required, level 3 identifies the operating status of the resource, invokes the fault handling routine, and level 4 acts as the fault handling routine.

결함 발생 주기와 결함 발생 횟수는 연관되어 판단하는 인자로서, 결함 발생 횟수가 작더라도 결함 발생 주기가 점점 짧아지는 추세로 진행되는 경우라면 결함 심각도가 높게 산출된다. The defect severity is calculated to be high if the defect occurrence cycle is shortened even if the defect occurrence frequency is small even if the defect occurrence frequency is small.

또한, 동일한 발생 횟수 구간에 위치하여도 발생 주기와 발생 추기의 변화 추이에 의해 그 결함 심각도가 달라지게 된다.In addition, even if it is located in the same occurrence frequency section, the defect severity varies depending on the change in the generation period and the generation period.

자원 결함 처리기(140)는 결함 심각도가 임계값을 초과하는 경우, 해당 결함 자원의 사용을 중단시키고, 결함 로그 보고기(150)는 결함 심각도에 따른 해당 결함 자원의 사용 중단을 사용자에게 통지한다. If the defect severity exceeds the threshold value, the resource defect processor 140 stops using the corresponding defect resource, and the defect log reporter 150 notifies the user of the interruption of the use of the corresponding defect resource according to the defect severity.

본 발명에 따르면 결함 심각도 개념을 도입하여, 결함이나 오류에 대한 통계 정보를 관리하여 동일한 문제가 반복적으로 발생하는 경우에는 단순히 통지 또는 결함 처리 루틴을 수행함에 그치지 아니하고, 해당 자원에 대한 사용 및 동작을 중단시키고, 사용자(관리자)에게 심각한 수준의 장애에 대한 별도 통지가 이루어지도록 한다. According to the present invention, by introducing the concept of defect severity, when the same problem repeatedly occurs by managing statistical information on a defect or an error, not only a notification or a defect processing routine is performed, but also the use and operation And the user (manager) is notified of a serious level of failure.

도 5는 본 발명의 실시예에 따른 로그 결함 분석 기반 장애 대응 방법을 나타내는 순서도이다. 5 is a flowchart illustrating a method of responding to a fault based on log defect analysis according to an embodiment of the present invention.

본 발명의 실시예에 따른 로그 결함 분석 기반 장애 대응 방법은 이벤트 또는 로그에 대한 수집 정책과 결함 필터 정보 및 자원 연관 정보를 설정하는 단계(S100)와, 컴퓨팅 노드로부터 하드웨어 자원의 이벤트 로그와 소프트웨어 로그 정보를 수집하는 단계(S200)와, 결함 필터 및 자원 연관 정보를 이용하여 결함 여부를 분석하는 단계(S300) 및 결함 여부 분석 결과 결함 로그가 발생하였는지 여부를 판단하는 단계(S400) 및 결함이 발생한 것으로 판단된 경우, 결함 로그 분석 정보를 정규화된 형태로 저장하고, 결함 발생을 통지하며, 결함 처리 루틴을 수행하는 단계(S500)를 포함하여 구성된다. A method for responding to a fault based on log defect analysis according to an embodiment of the present invention includes setting a collection policy, defect filter information and resource association information for an event or a log (S100) from a computing node, (S300) of determining whether or not a defect log is generated (S400); and a step of determining whether a defect log is generated The defect log analysis information is stored in the normalized form, the defect occurrence is notified, and the defect processing routine is performed (S500).

S100 단계는 정책 및 필터 관리기에 저장된 이벤트 또는 로그에 대한 수집 주기인 수집 정책과, 시스템 또는 사용자로부터 제공받아 정책 및 필터 관리기에 저장된 결함 필터 정보와, 자원간의 연관성 매핑 정보인 자원 연관 정보를 설정하는 단계이다. In step S100, a collection policy, which is a collection period for an event or a log stored in the policy and filter manager, and defect filter information provided from the system or user, stored in the policy and filter manager, and resource association information, .

S200 단계는 하드웨어 자원의 이벤트 로그와 소프트웨어 로그를 파싱하여 정규화된 포맷으로 저장하는 단계이다.In operation S200, the event log and the software log of the hardware resource are parsed and stored in a normalized format.

S300 단계는 결함 필터 및 자원 연관 정보를 기반으로 수집된 로그 정보로부터 결함 및 장애 발생 여부를 분석하는 작업을 수행한다. In operation S300, the defect information and the fault occurrence are analyzed from the collected log information based on the defect filter and the resource association information.

S400 단계에서 특정 자원에서 결함 및 장애 발생이 확인되면, S500 단계는 결함 로그 보고기(150)를 통해 로그 결함 분석 결과를 사용자(관리자)에게 통지하고, 결함 로그 뷰어(310)를 통한 사용자(관리자)의 상세 정보 요청을 처리하고 그 결과를 전송한다. If it is determined in step S400 that a defect and a fault have occurred in the specific resource, step S500 is to notify the user (administrator) of the log defect analysis result through the defect log report 150, ), And transmits the result.

이 때, S500 단계는 기설정된 알림 정책(주기, 반복 횟수 등)에 따라 지정된 사용자(관리자)에게 메일 또는 메시지 형태로 결함 발생을 통지한다. At this time, the step S500 notifies the designated user (manager) of the occurrence of the defect in the form of a mail or a message in accordance with the preset notification policy (cycle, repetition number, etc.).

도 6은 본 발명의 실시예에 따른 S500 단계, 즉 로그 기반 결함 분석 및 결함 처리 연산을 나타내는 순서도이다. FIG. 6 is a flowchart illustrating steps S500 according to an embodiment of the present invention, that is, a log-based defect analysis and a defect processing operation.

결함 필터를 적용한 로그 정보 분석 결과, 결함 발생을 확인하면(S510), 반복적인 결함이 발생하였는지 여부를 확인한다(S520). As a result of analyzing the log information to which the defect filter is applied, if the occurrence of the defect is confirmed (S510), it is checked whether a repetitive defect occurs (S520).

반복적인 결함이 발생한 것으로 판단된 경우, 결함 발생 횟수, 주기 및 결함 로그 정보를 이용하여 결함 심각도를 분석한다(S530). If it is determined that a repetitive defect has occurred, the defect severity is analyzed using the number of occurrences of defects, the period, and the defect log information (S530).

결함 심각도와 임계값을 비교하여(S540), 결함 심각도가 임계값을 초과하는 경우에는 결함 자원이 정상적으로 동작하지 않거나, 서비스되기 어려운 상황인 것으로 판단하여 해당 결함 자원의 사용을 중단한다(S550). If the defect severity exceeds the threshold value, it is determined that the defect resource does not operate normally or is difficult to be serviced, and the use of the defective resource is interrupted (S550).

이어서, 결함 자원에 대한 사용을 중단한 후 결함 발생 및 자원 사용 중단을 결함 로그 보고기를 통해 사용자(관리자)에게 통지하고, 관련 정보를 저장한다(S560). Subsequently, after the use of the defective resource is stopped, the user (administrator) is notified of the occurrence of the defect and the discontinuation of the resource use through the defect log report unit, and the related information is stored (S560).

S520 단계에서 반복적인 결함이 발생하지 않은 것으로 확인된 경우 및 S540단계에서 결함 심각도가 임계값 이하인 경우, 해당 결함 발생을 사용자에게 보고한다(S570). If it is determined that the repetitive defect does not occur in step S520 and the defect severity is equal to or less than the threshold value in step S540, the occurrence of the defect is reported to the user in step S570.

S570 단계는 단계결함 발생 알림과 결함 분석 정보를 저장하고, S580 단계는 기등록된 결함 대응 루틴을 호출하여 수행하게 된다. In step S570, the step defect occurrence notification and the defect analysis information are stored. In step S580, the previously registered defect corresponding routine is called and performed.

이어서, S590 단계는 해당 결함 자원과 연관된 자원이 존재하는지 여부를 확인한다. Then, in step S590, it is determined whether a resource associated with the defective resource exists.

이 때, 연관된 자원이 존재하지 않으면 결함 처리 과정이 완료되며, 연관된 자원이 존재하는 경우 결함 분석 단계인 S510내지 결함 대응 단계인 S580은 모든 연관 자원에 대하여 수행이 완료될 때까지 진행된다. At this time, if there is no associated resource, the defect processing is completed. If there is an associated resource, the defect analysis step S510 and the defect handling step S580 are performed until all the related resources are completed.

본 발명에 따르면, 자원의 결함 상태가 심각한 경우 해당 자원에 대한 사용 중단을 통해 격리시키는 작업을 수행하고, 연관된 자원에 대한 동작을 중단시켜, 시스템 자원의 동작 및 서비스에 대한 부하를 경감시키는 효과가 있다. According to the present invention, when a defect state of a resource is serious, the effect of isolating the resource by suspending the use of the resource, stopping the operation on the associated resource, and reducing the load on system resources and services have.

이제까지 본 발명의 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다. The embodiments of the present invention have been described above. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the disclosed embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is defined by the appended claims rather than by the foregoing description, and all differences within the scope of equivalents thereof should be construed as being included in the present invention.

100: 로그 결함 분석 기반 장애 대응 시스템
110: 데이터베이스 120: 자원 로그 수집기
130: 결함 로그 분석기 140: 자원 결함 처리기
150: 결함 로그 보고기 160: 정책 및 필터 관리기
200: 노드 집합 210: 노드
220: 소프트웨어 자원 230: 하드웨어 자원
300: 사용자 도구 310: 결함 로그 뷰어
320: 결함 알림 수신부100: Log fault analysis based failure response system
110: Database 120: Resource Log Collector
130: Defect log analyzer 140: Resource defect processor
150: Defect Log Reporter 160: Policy & Filter Manager
200: node set 210: node
220: software resource 230: hardware resource
300: User Tools 310: Defect Log Viewer
320: Fault notification receiver

Claims

A resource log collector for collecting event or log information from the configuration node;
A defect log analyzer that analyzes the log information and transmits defect related information when it is recognized that a system failure has occurred, and analyzes an associated resource of the defective resource and analyzes the defect severity;
A resource fault handler for resolving a fault condition for a resource when the fault log analyzer recognizes a fault;
A defect log reporting unit for reporting defect related information to a user; And
A policy and filter manager for setting and storing the event or log collection policy, the information about the defect analysis filter, the resource association information,
Wherein said fault diagnosis system comprises:

The method according to claim 1,
The resource log collector periodically collects the event or log information according to the log collection policy set from the policy and filter manager
Failure response system based on log defect analysis.

The method according to claim 1,
The resource log collector parses hardware event logs and software logs and stores them in a normalized format
Failure response system based on log defect analysis.

The method according to claim 1,
The defect log analyzer analyzes the log information by applying the defect analysis filter set from the policy and filter manager, extracts and classifies the collected information, and deduces the root cause of the failure
Failure response system based on log defect analysis.

The method according to claim 1,
The defect log analyzer uses the resource association information, which is association mapping information between resources, to check a defect for the associated resource
Failure response system based on log defect analysis.

6. The method of claim 5,
The resource defect processor calls a defect processing routine previously registered in the database to resolve the defect state for the resource and the associated resource to perform a normal operation
Failure response system based on log defect analysis.

The method according to claim 1,
The defect log reporter outputs the defect log information to the screen according to a user's request or transmits the defect related information through a mail or a message according to the predetermined failure notification policy for the notification period
Failure response system based on log defect analysis.

The method according to claim 1,
The defect log analyzer analyzes the defect severity using the number of occurrences of defects, the period and the defect log information
Failure response system based on log defect analysis.

9. The method of claim 8,
The resource defect processor may stop using the defect resource if the analyzed defect severity exceeds a threshold value
Failure response system based on log defect analysis.

10. The method of claim 9,
The defect log reporter notifies the user of the discontinuation of use of the corresponding defective resource in accordance with the defect severity
Failure response system based on log defect analysis.

The method according to claim 1,
The policy and filter manager stores information about the defect analysis filter including a defect-related filter and a user-defined filter that are basically provided in the system
Failure response system based on log defect analysis.

(a) setting a collection policy and defect filter information and resource association information for an event or a log;
(b) collecting event logs and software log information of hardware resources from the computing nodes;
(c) analyzing whether the defect is defective using the defect filter and the resource association information; And
(d) if it is determined that a defect has occurred as a result of the defectiveness analysis, storing the defect log analysis information in a normalized form, notifying occurrence of a defect, and performing a defect processing routine
Based on the log defect analysis.

13. The method of claim 12,
The step (a) may comprise setting the collection association policy, which is a collection cycle for the event or log, the defect filter information provided from the system or the user, and the resource association information, which is association mapping information between resources
Fault response based fault analysis.

13. The method of claim 12,
Wherein the step (b) includes parsing the event log and the software log of the hardware resource and storing the same in a normalized format
Fault response based fault analysis.

13. The method of claim 12,
In the step (c), the defect severity is analyzed using the number of occurrences of defects, the period and the defect log information
Fault response based fault analysis.

16. The method of claim 15,
The step (d) may include suspending the defective resource if the defect severity exceeds a threshold, and performing a notification of the suspension
Fault response based fault analysis.

13. The method of claim 12,
The step (c) may include analyzing whether a defect associated with the corresponding resource is defective using the resource-related information
Fault response based fault analysis.

13. The method of claim 12,
The step (d) includes notifying the occurrence of the defect according to a user's request, or notifying occurrence of the defect through a mail or a message in accordance with a predetermined notification cycle
Fault response based fault analysis.