KR102473637B1

KR102473637B1 - Apparatus and method for managing trouble using big data of 5G distributed cloud system

Info

Publication number: KR102473637B1
Application number: KR1020180038420A
Authority: KR
Inventors: 이진구; 김범수; 김재우; 임형묵
Original assignee: 주식회사 케이티
Priority date: 2017-06-29
Filing date: 2018-04-03
Publication date: 2022-12-02
Also published as: KR20190002280A; KR20220166760A; KR102580916B1

Abstract

본 발명은 5G 분산 클라우드 시스템에서 빅 데이터의 로그를 분석하여 장애를 관리하는 장치 및 방법에 관한 것이다. 본 발명에 따르는 장애 관리의 장치는 클라우드 시스템의 인프라에서 발생되는 빅 데이터의 로그를 장애에 관련된 메시지로 변환하는 메시지 변환부; 변환된 메시지에 룰 엔진을 적용하여 장애를 검출하는 장애 검출부; 검출된 각 장애의 화면 출력을 위해, 장애가 발생된 인프라 객체를 토폴로지 기반의 그래픽 정보로 생성하여 운용자 단말로 제공하는 장애 그래픽 제공부; 및 검출된 장애의 장애 복구 정보를 생성하여 운용자 단말로 제공하는 장애 복구부를 포함한다.The present invention relates to an apparatus and method for managing failures by analyzing logs of big data in a 5G distributed cloud system. An apparatus for failure management according to the present invention includes a message conversion unit for converting a log of big data generated in the infrastructure of a cloud system into a message related to failure; a failure detection unit for detecting failures by applying a rule engine to the converted message; In order to display each detected failure on a screen, a failure graphic providing unit that generates a failure-occurring infrastructure object as topology-based graphic information and provides it to an operator terminal; and a failure recovery unit generating failure recovery information of the detected failure and providing it to an operator terminal.

Description

Apparatus and method for managing trouble using big data of 5G distributed cloud system}

본 발명은 장애 관리 기술로서, 5G 분산 클라우드 시스템에서 발생되는 빅 데이터를 이용하여 데이터의 수집 및 분석, 장애의 표현 및 복구가 수반되는 장애 관리를 제공하는 장치 및 방법에 관한 것이다.The present invention, as a failure management technology, relates to an apparatus and method for providing failure management accompanied by data collection and analysis, failure expression and recovery using big data generated in a 5G distributed cloud system.

도 1은 5G 서비스를 위한 종래의 분산 클라우드 시스템(100)을 구성하는 클라우드 인프라(infrastructure)의 계층 구조이다. 첫번째 하드웨어 계층(110)은 서버, 스토리지, 네트워크(스위치) 등 여러 종류의 물리 장비들로 구성된다. 두번째 OS 계층(130)은 하부 하드웨어들을 운영 관리하는 소프트웨어(예 : 리눅스)로 구성된다. 세번째 가상화 계층(150)은 물리적인 자원을 논리적인 자원으로 나누어서 고객의 요구가 있을 때 동적으로 컴퓨팅, 네트워킹, 스토리지 자원을 할당한다. 네번째 클라우드 계층(170)은 상위 3개 계층(110~150)의 모든 자원들을 총괄 제어하는 역할을 수행하는 플랫폼으로 상위 계층(110~150)이 요구하는 다양한 명령을 수행하고 그에 대한 응답을 제공한다. 또한, 클라우드 계층(170)은 서비스 및 자동화 계층(180)에 정의된 서비스 및 상품 등의 어플리케이션을 실행하여 가입자들에게 클라우드 서비스를 제공한다. 마지막으로 우측의 관리 계층(190)은 이런 다층 구조의 클라우드 시스템(100)을 효율적으로 운용 관리하기 위해 필요한 리소스 관리, 구성 및 제어, 장애 모니터링 및 복구 등의 기능을 수행한다. 1 is a hierarchical structure of a cloud infrastructure constituting a conventional distributed cloud system 100 for 5G service. The first hardware layer 110 is composed of various types of physical devices such as servers, storages, and networks (switches). The second OS layer 130 is composed of software (eg, Linux) that operates and manages lower hardware. The third virtualization layer 150 divides physical resources into logical resources and dynamically allocates computing, networking, and storage resources when a customer requests. The fourth cloud layer 170 is a platform that serves to collectively control all resources of the upper three layers 110 to 150, and performs various commands requested by the upper layers 110 to 150 and provides responses thereto. . In addition, the cloud layer 170 provides cloud services to subscribers by executing applications such as services and products defined in the service and automation layer 180. Finally, the management layer 190 on the right side performs functions such as resource management, configuration and control, failure monitoring and recovery, etc. required to efficiently operate and manage the cloud system 100 having such a multi-layer structure.

이러한 클라우드 시스템(100)의 인프라 계층 구조에서, 어느 한 계층에서 발생된 장애는 상위 계층 및 하위 계층에 영향을 미친다. 즉, 장애가 어느 계층에서 발생하고, 장애의 영향이 어느 상위 계층 또는 하위 계층까지 미치고, 장애의 복구는 어느 계층을 통해 접근해야 하는지 파악해야 하므로, 효율적인 운용 관리를 위해서는 하드웨어와 소프트웨어 분야의 다양한 전문가들이 필요하다.In the infrastructure hierarchical structure of the cloud system 100, a failure occurring in any one layer affects upper and lower layers. In other words, it is necessary to figure out which layer a failure occurs, which upper layer or lower layer is affected by the failure, and which layer should be accessed to recover from a failure. need.

도 2는 도 1의 분산 클라우드 시스템(100)에서 클라우드 계층(170)의 소프트웨어인 클라우드 플랫폼(200)의 개략적 구성도이다. 상기 분산 클라우드 시스템(100)을 구성하는 핵심 소프트웨어인 클라우드 플랫폼(200)은 독립적으로 발전된 서브 시스템(210~270)들이 여러개 모여 하나의 클라우드를 구성하는 분산 아키텍처를 사용하고 있다. 최근 가장 많이 사용되는 오픈 스택의 클라우드 플랫폼(200)의 경우, 7개의 서브 시스템(210~270)으로 구성된 아키텍처가 클라우드를 구성하는 최소의 단위이다. FIG. 2 is a schematic configuration diagram of a cloud platform 200 that is software of a cloud layer 170 in the distributed cloud system 100 of FIG. 1 . The cloud platform 200, which is core software constituting the distributed cloud system 100, uses a distributed architecture in which several independently developed subsystems 210 to 270 gather to form one cloud. In the case of the open stack cloud platform 200, which is used most recently, an architecture composed of seven subsystems 210 to 270 is the smallest unit constituting the cloud.

이 클라우드 플랫폼(200)은, 고객이 웹을 통해서 클라우드에 접속하는 포탈 서브 시스템(210), 고객이 요청하는 컴퓨팅 리소스를 제공하는 가상 컴퓨팅 서브 시스템(220), 네트워크 리소스를 제공하는 가상 네트워킹 서브 시스템(230), 스토리지 리소스를 제공하는 가상 스토리지 서브 시스템(240), 이미지(운영체제) 리소스를 제공하는 이미지 서브 시스템(250), 보안 서비스를 제공하는 인증 관리 서브 시스템(260) 및 객체 형식의 스토리지 서비스를 제공하는 객체 스토리지 서브 시스템(270)으로 구성되어 있다. The cloud platform 200 includes a portal subsystem 210 through which customers access the cloud through the web, a virtual computing subsystem 220 that provides computing resources requested by customers, and a virtual networking subsystem that provides network resources. 230, a virtual storage subsystem 240 providing storage resources, an image subsystem 250 providing image (operating system) resources, an authentication management subsystem 260 providing security services, and an object type storage service. It is composed of an object storage subsystem 270 that provides.

이러한 다수의 서브 시스템(210~270)으로 구성된 분산 아키텍처의 클라우드 플랫폼(200)에서 장애가 발생하면, 어느 서브 시스템(210~270)에서 발생된 장애인지 추적이 필요하고, 주변 서브 시스템에 대한 장애 영향 및 타 인프라 계층에 대한 영향의 분석이 필요하다. 때문에, 오랜 기간의 숙련된 IT 전문 인력이 아니면, 전체 클라우드 시스템의 운용 감시, 장애 원인 분석 및 복구에 오랜 시간이 소요되어 시스템의 신뢰성을 크게 떨어뜨릴 수 있어 개선 대책이 필요하다. 오랜 기간의 숙련된 IT 전문 인력이 아니면, 전체 클라우드 시스템의 운용 감시, 장애 원인 분석 및 복구에 오랜 시간이 소요되어 시스템의 신뢰성을 크게 떨어뜨릴 수 있어 개선 대책이 필요하다.When a failure occurs in the cloud platform 200 of the distributed architecture composed of these multiple subsystems 210 to 270, it is necessary to track which subsystem 210 to 270 caused the failure, and the effect of the failure on the surrounding subsystems. and analysis of the impact on other infrastructure layers is needed. Therefore, it takes a long time to monitor the operation of the entire cloud system, analyze the cause of failure, and recover, unless it is a skilled IT professional for a long period of time, which can significantly reduce the reliability of the system, so improvement measures are needed. If it is not for a long period of skilled IT experts, it takes a long time to monitor the operation of the entire cloud system, analyze the cause of failure, and recover, which can greatly reduce the reliability of the system, so improvement measures are needed.

도 3은 도 1의 분산 클라우드 시스템(100)의 종래 운용 관리 처리의 개략적 흐름도이다. 도 3에서, 종래의 운용 관리의 처리 흐름은 수집(310), 필터(330), 표현(350) 및 장애 복구(370)의 처리 과정을 갖는다. 수집(310) 처리에서, 클라우드 시스템(100)은 전체 클라우드의 장애 상태를 감시하기 위해서 중앙 처리 장치(CPU)의 부하율, 메모리 사용률, 하드디스크 사용률과 같은 수치 정보나 클라우드 플랫폼(200)의 중요 프로세스와 서비스들의 동작 상태와 같이 직관적으로 이해할 수 있는 일정한 포맷을 가지는 정형 데이터를 수집한다. 필터(330) 처리에서, 클라우드 시스템(100)은 수집된 정형 데이터들을 특정 기준치를 기준으로 비교하거나, 상태가 동작 중인지 아닌지 판별하는 단순한 기준으로 필터링하여 장애 여부를 판단한다. 표현(350) 처리에서, 필터링에 의해 발견된 장애 이벤트를 웹과 같은 그래픽 환경의 테이블 화면에 가시적으로 표현하거나 가청 정보를 제공하여 운용자에게 알린다. FIG. 3 is a schematic flowchart of a conventional operation management process of the distributed cloud system 100 of FIG. 1 . In FIG. 3, the process flow of conventional operations management has the processes of collection 310, filter 330, presentation 350, and failover 370. In the collection 310 process, the cloud system 100 uses numerical information such as central processing unit (CPU) load rate, memory utilization rate, and hard disk utilization rate or important processes of the cloud platform 200 to monitor the failure state of the entire cloud. It collects structured data having a certain format that can be intuitively understood, such as operating status of services and services. In the processing of the filter 330, the cloud system 100 compares the collected structured data based on a specific reference value or filters based on a simple criterion for determining whether the status is in operation or not, and determines whether or not there is a failure. In the presentation 350 process, the failure event found by filtering is visually expressed on a table screen of a graphical environment such as the web or provided with audible information to notify the operator.

장애 복구(370)의 처리에서, 장애 이벤트를 확인한 운용자는 장애 처리의 표준 문서나 자신의 역량에 의존하여 시스템 오류를 해결함으로써 장애를 복구한다. 이러한 종래의 클라우드 운용 관리 방법은 다음과 같은 3가지 문제점을 가지고 있다. In the process of failure recovery 370, the operator who has identified the failure event restores the failure by resolving the system error, relying on standard documentation of failure handling or his own capabilities. This conventional cloud operation management method has the following three problems.

제 1문제점은 도 1과 같은 다양한 클라우드 인프라 계층(110~190) 및 도 2와 같은 서브 시스템(210~270)으로 구성된 클라우드 플랫폼(200)의 장애 여부를 판단하기 위해서는 단순한 상태 정보의 수집과 필터링으로는 시스템 내부에서 발생하는 논리적인 에러를 발견하지 못한다. 논리적 에러의 경우, 실제로 고객으로부터 장애 문의(VoC)를 받았음에도, 모니터링 시스템에는 전혀 장애 정보가 표시되지 않는다. 따라서, 운용자가 장애를 파악하고 복구하기 위해 인프라 계층(110~190) 및 서브 시스템(210~280)을 기반으로 모든 시스템의 상태, 구성파일 그리고 로그 파일을 검색하는 경우, 운용자의 기량에 따라 수시간에서 수분이 장애 복구 시간이 소요될 수 있다. The first problem is simple collection and filtering of state information in order to determine whether the cloud platform 200 composed of various cloud infrastructure layers 110 to 190 as shown in FIG. 1 and subsystems 210 to 270 as shown in FIG. 2 has a failure. cannot detect logical errors occurring inside the system. In the case of a logical error, no failure information is displayed in the monitoring system, even though a VoC is actually received from the customer. Therefore, when an operator searches the status, configuration files, and log files of all systems based on the infrastructure layer (110 to 190) and subsystem (210 to 280) to identify and recover from failures, the number of operators depends on the skill of the operator. From hours to minutes this failure recovery time can take.

제 2문제점은 장애가 발생하여 운용 관리 시스템의 웹 화면에 표시된다고 하더라도 단순한 테이블 GUI에 표현되기 때문에, 이 장애 이벤트가 도 3과 같은 복잡한 클라우드 플랫폼의 어느 곳에서 발생된 것인지 위치 파악이 어렵고, 그 위치를 파악하더라도 전체 인프라 계층(110~190) 및 서브 시스템(210~270)의 차원에서 어떤 영향과 고객의 장애 피해 여부를 파악하는 것도 어렵다.The second problem is that even if a failure occurs and is displayed on the web screen of the operation management system, it is expressed in a simple table GUI, so it is difficult to determine where the failure event occurred in the complex cloud platform as shown in FIG. 3, and its location Even if it is understood, it is difficult to determine what impact and whether the customer is affected by failure at the level of the entire infrastructure layer (110 to 190) and subsystem (210 to 270).

제 3문제점은 현재 발생하는 클라우드 플랫폼의 로그 파일은 개발자 관점의 프로그램 에러 정보가 대부분이기 때문에 소프트웨어와 IT의 전문 지식이 없는 운용자는 그 메시지가 의미하는 바를 이해하고 그에 대한 조치를 취하는 것이 어렵다. 이러한 문제를 해결하기 위해서는 IT, Network, Software 분야의 전문가를 교육하고 양성함이 시급하지만 이는 오랜 시간과 비용이 소요되는 단점이 있다.The third problem is that most of the log files of the currently occurring cloud platform contain program error information from the developer's point of view, so it is difficult for an operator without expertise in software and IT to understand what the message means and take action on it. In order to solve these problems, it is urgent to educate and nurture experts in the fields of IT, Network, and Software, but this has the disadvantage of taking a long time and cost.

한국등록특허 10-1636796(2016.06.30.)Korea Patent No. 10-1636796 (2016.06.30.)

본 발명은 상기와 같은 종래 문제점들을 해결하기 위한 것으로서, 5G 분산 클라우드 시스템에서 발생되는 빅 데이터의 로그를 수집하여 메시지로 분석하고, 룰 엔진을 통해 상기 분석된 메시지로부터 장애를 검출하고, 검출된 장애의 위치와 영향을 관리자에게 알리고, 장애 복구를 실행하는 장치 및 방법 제공하는데 목적이 있다.The present invention is to solve the above conventional problems, and collects logs of big data generated in a 5G distributed cloud system, analyzes them into messages, detects failures from the analyzed messages through a rule engine, and detects failures The purpose of the present invention is to inform the manager of the location and influence of the system and to provide a device and method for executing failure recovery.

상기 장치는 클라우드 시스템에서 하드웨어, 운영체제, 클라우드 플랫폼 및 네트워크 트래픽을 포함한 각종 로그 정보의 빅 데이터를 장애 검출을 위한 메시지로 변환하는데 다른 목적이 있다.Another purpose of the device is to convert big data of various log information including hardware, operating system, cloud platform, and network traffic in a cloud system into a message for detecting a failure.

상기 장치는 변환된 메시지를 룰 엔진에 적용하여 장애 검출을 위한 and/or 조건을 비교하는 분석 처리를 통해 물리적 및 논리적 장애를 검출하는데 다른 목적이 있다.Another object of the apparatus is to detect physical and logical failures through an analysis process of applying the converted message to a rule engine and comparing and/or conditions for detecting failures.

상기 장치는 검출된 장애를 하드웨어, 운영체제, 클라우드 플랫폼 및 네트워크 트래픽을 기반으로 그래픽 토폴로지로 표시하여, 운용자에게 장애 발생의 위치와 영향을 즉시 알리고, 장애 복구의 자동 처리 결과 및 장애 수동 복구의 가이드를 제공하는데 다른 목적이 있다.The device displays the detected failure in a graphical topology based on hardware, operating system, cloud platform and network traffic, immediately informs the operator of the location and impact of the failure, provides automatic processing results of failure recovery and guides for manual failure recovery. It serves a different purpose.

일 측면에 따른, 5G 분산 클라우드 시스템의 빅 데이터를 이용하여 장애를 관리하는 장치는, 상기 시스템의 인프라에서 발생되는 빅 데이터의 로그를 장애에 관련된 메시지로 변환하는 메시지 변환부; 변환된 메시지에 룰 엔진을 적용하여 장애를 검출하는 장애 검출부; 검출된 각 장애의 화면 출력을 위해, 장애가 발생된 인프라 객체를 토폴로지 기반의 그래픽 정보로 생성하여 운용자 단말로 제공하는 장애 그래픽 제공부; 및 검출된 장애의 장애 복구 정보를 생성하여 상기 운용자 단말로 제공하는 장애 복구부를 포함한다.According to an aspect, an apparatus for managing failures using big data of a 5G distributed cloud system includes a message conversion unit that converts a log of big data generated in the infrastructure of the system into a message related to failure; a failure detection unit for detecting failures by applying a rule engine to the converted message; In order to display each detected failure on a screen, a failure graphic providing unit that generates a failure-occurring infrastructure object as topology-based graphic information and provides it to an operator terminal; and a failure recovery unit generating failure recovery information of the detected failure and providing it to the operator terminal.

상기 메시지 변환부는, 하드웨어, 운영체제, 클라우드 플랫폼 및 네트워크 플랫폼의 객체를 포함하는 클라우드 인프라의 각 객체별 로그를 수집하여 저장한다.The message conversion unit collects and stores logs for each object of the cloud infrastructure including objects of hardware, operating system, cloud platform, and network platform.

상기 메시지 변환부는, 각 로그 파일에서 적어도 하나 이상의 로그 라인을 장애 검출을 위한 메시지로 변환한다.The message converting unit converts at least one log line in each log file into a message for detecting a failure.

상기 메시지 변환부는, 상기 로그가 수집된 서버의 서버 이름; 상기 로그가 기록된 파일 이름; 상기 로그에서 적어도 하나 이상의 로그 라인이 분석되어 변환된 메시지를 식별하는 라인 아이디; 상기 메시지의 상기 서버 이름, 상기 파일 이름 및 상기 라인 아이디로 구성된 메시지 키; 및 상기 메시지로 분석된 각 로그 라인이 저장되는 라인 버퍼를 포함하는 구조체를 이용하여 로그 라인 및 변환된 메시지를 저장한다. The message conversion unit may include a server name of a server from which the logs are collected; File name in which the log is recorded; a line ID identifying a message converted by analyzing at least one log line in the log; a message key composed of the server name, the file name, and the line ID of the message; and log lines and converted messages are stored using a structure including a line buffer in which each log line analyzed as the message is stored.

상기 장애 검출부는, 적어도 하나 이상의 필터링 룰을 포함하는 룰 그룹 이름; 상기 필터링 룰; 상기 필터링 룰에서 처리되는 필터 조건; 및 상기 메시지로부터 검출되어 상기 필터 조건이 적용되는 필터링 키를 포함하는 상기 룰 엔진의 구조체를 이용하여 메시지로부터 필터링 키를 검출하고, 검출된 필터링 키에 적용된 필터 조건이 상기 메시지에서 만족되면 장애의 발생으로 검출한다.The failure detection unit may include a rule group name including at least one filtering rule; the filtering rule; a filter condition processed in the filtering rule; and detecting a filtering key from a message using a structure of the rule engine including a filtering key detected from the message and to which the filter condition is applied, and generating an error when the filter condition applied to the detected filtering key is satisfied in the message. detected by

상기 장애 검출부는, 장애가 검출된 상기 인프라의 계층 구조 정보를 포함하는 룰 이름; 상기 장애를 설명하는 룰 설명; 상기 장애가 검출되는 로그 파일 이름; 로그의 등급을 정의하는 로그 등급; 로그의 키워드에서 대문자 및 소문자의 구분을 정의하는 케이스; 상기 메시지에 대응되는 로그 라인으로부터 상기 장애를 검출하기 위해, AND 와 OR의 필터 조건 및 상기 필터 조건이 적용되는 적어도 하나 이상의 키워드를 포함하는 적어도 하나 이상의 필터링 룰; 및 상기 필터링 룰을 만족하여 검출된 상기 장애의 복구를 위해 실행되는 자동 복구, 운용자 매뉴얼 복구 및 제 3전문가 복구의 복구 처리 정보; 상기 장애의 장애 등급을 정의하는 장애 등급을 포함하는 상기 룰 엔진의 구조체를 이용한다.The failure detection unit may include: a rule name including hierarchical structure information of the infrastructure in which failure is detected; a rule description describing the failure; log file name in which the failure is detected; log rank, which defines the rank of the log; case defining the distinction between uppercase and lowercase letters in keywords in the log; at least one filtering rule including filter conditions of AND and OR and at least one keyword to which the filter condition is applied to detect the failure from the log line corresponding to the message; and recovery processing information of automatic recovery, operator manual recovery, and third expert recovery executed to recover the failure detected by satisfying the filtering rule. It uses the structure of the rule engine that includes a failure class defining the failure class of the failure.

상기 장치는 화면 표시를 위해 상기 인프라의 객체 및 객체들간의 계층적 연결 정보가 저장된 그래픽 DB를 더 포함하고, 상기 장애 그래픽 제공부는, 운용자 단말이 요청한 인프라에 대응되는 객체 및 계층적 연결 정보를 상기 그래픽 DB로부터 조회하고, 조회된 객체들간의 계층적 연결 정보를 그래픽 정보로 생성하고, 생성된 그래픽 정보를 상기 운용자 단말로 제공한다.The apparatus further includes a graphic DB in which objects of the infrastructure and hierarchical connection information between objects are stored for screen display, and the disability graphic providing unit displays objects and hierarchical connection information corresponding to the infrastructure requested by the operator terminal. It queries from the graphic DB, creates hierarchical connection information between the searched objects as graphic information, and provides the generated graphic information to the operator terminal.

상기 장애 그래픽 제공부는, 상기 운용자 단말에서 표시된 상기 그래픽 정보에 포함된 객체에서 장애가 존재하면, 화면 표시를 위해 상기 운용자 단말로 장애 정보를 전송한다.The failure graphic providing unit transmits failure information to the operator terminal for screen display, if there is a failure in an object included in the graphic information displayed on the operator terminal.

상기 장애 복구부는, 운용자 단말로 상기 장애 복구 정보를 제공하고, 운용자의 요청에 따라 자동 복구 처리, 운용자 매뉴얼 복구 처리 및 제 3전문가 복구 처리를 실행하여 장애를 복구한다.The failure recovery unit provides the failure recovery information to the operator's terminal, and recovers the failure by executing automatic recovery process, operator manual recovery process, and third expert recovery process according to the operator's request.

상기 장애 복구부는, 자동 실행 또는 운용자의 요청에 따라, 장애가 발생된 상기 인프라로 자동 복구의 명령을 전송하고, 명령이 처리된 결과를 운용자 단말로 제공한다.The failure recovery unit transmits an automatic recovery command to the infrastructure in which a failure occurs, automatically executed or according to an operator's request, and provides a result of processing the command to an operator terminal.

상기 장애 복구부는, 상기 자동 복구가 실패될 경우, 운용자의 요청에 따라 운용자 매뉴얼 복구의 명령을 장애가 발생된 상기 인프라로 전송하고, 명령이 처리된 결과를 상기 운용자 단말로 제공하고, 상기 자동 복구 또는 상기 운용자 매뉴얼 복구가 실패될 경우, 제 3전문가 복구를 위해 장애 정보를 전문가 단말로 통보한다.The failure recovery unit, when the automatic recovery fails, transmits an operator manual recovery command to the infrastructure where the failure occurred at the request of the operator, provides a command-processed result to the operator terminal, and provides the automatic recovery or If the operator's manual recovery fails, failure information is notified to the expert terminal for recovery by a third expert.

다른 측면에 따른 5G 분산 클라우드 시스템의 빅 데이터를 이용하여 장애를 관리하는 장치가 실행하는 방법은, 상기 시스템의 인프라에서 발생되는 빅 데이터의 로그를 장애에 관련된 메시지로 변환하는 단계; 변환된 메시지에 룰 엔진을 적용하여 장애를 검출하는 단계; 검출된 각 장애의 화면 출력을 위해, 장애가 발생된 인프라 객체를 토폴로지 기반의 그래픽 정보로 생성하여 운용자 단말로 제공하는 단계; 및 검출된 장애의 장애 복구 정보를 생성하고 상기 운용자 단말로 제공하여 장애를 복구하는 단계를 포함한다.According to another aspect, a method executed by an apparatus for managing a failure using big data of a 5G distributed cloud system includes converting a log of big data generated in the infrastructure of the system into a message related to the failure; detecting a failure by applying a rule engine to the converted message; Generating an infrastructure object with a failure as topology-based graphic information and providing it to an operator terminal for screen output of each detected failure; and generating failure recovery information of the detected failure and providing it to the operator terminal to recover the failure.

본 발명의 일 측면에 따르면, 빅 데이터의 로그를 이용한 메시지 분석 및 룰 엔진 기반의 장애 분석을 통해 클라우드 시스템의 계층 구조 및 클라우드 플랫폼의 서브 시스템에서 발생되는 물리적 장애 및 논리적 장애를 검출할 수 있다.According to one aspect of the present invention, physical failures and logical failures occurring in the hierarchical structure of a cloud system and subsystems of a cloud platform may be detected through message analysis using big data logs and failure analysis based on a rule engine.

검출된 장애는 클라우드 시스템의 인프라를 구성하는 객체가 배치된 그래픽 토폴로지 상에서 표시하여, 운영자가 클라우드 시스템의 계층 구조 및 클라우드 플랫폼의 서브 시스템에 의해 상호 연결된 객체를 이동하는 과정에서 장애 발생의 위치 및 영향을 즉시 파악할 수 있다.Detected failures are displayed on the graphical topology in which the objects constituting the infrastructure of the cloud system are placed, and the position and effect of occurrence of failures while the operator moves objects interconnected by the hierarchical structure of the cloud system and the subsystems of the cloud platform. can be immediately identified.

상기 장애 표시 화면에서, 검출된 장애의 복구를 위해, 자동 복구 실행, 운용자 수동 복구 실행 및 제 3의 전문가 복구 실행 등으로 구분하여 장애 복구의 가이드를 동시에 제공할 수 있다.In the failure display screen, in order to recover the detected failure, guides for failure recovery may be simultaneously provided by dividing into automatic recovery execution, operator manual recovery execution, and third expert recovery execution.

본 명세서에 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 후술한 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되지 않아야 한다.
도 1은 5G 서비스를 위한 종래의 분산 클라우드 시스템의 개략적 계층 구조도이다.
도 2는 도 1의 분산 클라우드 시스템의 소프트웨어인 클라우드 플랫폼의 개략적 구성도이다.
도 3은 도 1의 분산 클라우드 시스템의 종래 운용 관리 처리의 개략적 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 장애를 관리하는 5G 분산 클라우드 시스템의 개략적 구성도이다.
도 5는 도 4의 중앙 매니저가 제공하는 장애 관리 서비스의 개략적 흐름도이다.
도 6은 도 4의 중앙 클라우드에서 로그 수집 및 저장을 위한 코드의 예시도이다.
도 7은 도 4의 중앙 매니저가 수집된 로그 파일의 로그 라인을 분석하여 메시지로 변환하는 구조체의 예시도이다.
도 8은 도 4의 중앙 매니저가 메시지를 분석하여 장애를 검출하는 룰 엔진 포맷의 예시도이다.
도 9는 도 4의 중앙 매니저가 룰 엔진의 필터링 룰을 처리하는 트리의 예시도이다.
도 10은 도 4의 중앙 매니저가 클라우드 플랫폼의 서비스 토폴로지를 기반으로 생성한 장애 관리 화면의 예시도이다.
도 11은 본 발명의 일 실시예에 따른 5G 분산 클라우드 시스템에서 장애를 관리하는 방법의 개략적 순서도이다.
도 12는 도 11에서 로그를 분석하여 메시지로 변환하는 단계의 상세 순서도이다.
도 13은 도 11에서 장애 관리를 위해 그래픽 정보를 생성하여 제공하는 단계의 상세 순서도이다.
도 14는 도 11에서 장애를 복구하여 처리 결과를 제공하는 단계의 상세 순서도이다.The following drawings attached to this specification illustrate preferred embodiments of the present invention, and together with the detailed description of the present invention serve to further understand the technical idea of the present invention, the present invention is the details described in such drawings should not be construed as limited to
1 is a schematic hierarchical structure diagram of a conventional distributed cloud system for 5G service.
FIG. 2 is a schematic configuration diagram of a cloud platform that is software of the distributed cloud system of FIG. 1 .
3 is a schematic flowchart of a conventional operation management process of the distributed cloud system of FIG. 1;
4 is a schematic configuration diagram of a 5G distributed cloud system for managing failures according to an embodiment of the present invention.
5 is a schematic flowchart of a failure management service provided by the central manager of FIG. 4 .
6 is an exemplary diagram of code for collecting and storing logs in the central cloud of FIG. 4 .
7 is an exemplary diagram of a structure in which the central manager of FIG. 4 analyzes log lines of collected log files and converts them into messages.
FIG. 8 is an exemplary view of a rule engine format in which the central manager of FIG. 4 analyzes messages to detect failures.
9 is an exemplary view of a tree in which the central manager of FIG. 4 processes filtering rules of a rule engine.
10 is an exemplary view of a failure management screen created by the central manager of FIG. 4 based on the service topology of the cloud platform.
11 is a schematic flowchart of a method for managing failures in a 5G distributed cloud system according to an embodiment of the present invention.
12 is a detailed flowchart of the step of analyzing and converting logs into messages in FIG. 11 .
FIG. 13 is a detailed flowchart illustrating a step of generating and providing graphic information for failure management in FIG. 11 .
FIG. 14 is a detailed flowchart of the step of recovering from failure and providing processing results in FIG. 11 .

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. 이에 앞서, 본 명세서 및 청구 범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상에 모두 대변하는 것은 아니므로, 본 출원 시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, the terms or words used in this specification and claims should not be construed as being limited to ordinary or dictionary meanings, and the inventors use the concept of terms appropriately to describe their invention in the best way. It should be interpreted as a meaning and concept consistent with the technical idea of the present invention based on the principle that it can be defined. Therefore, the embodiments described in this specification and the configurations shown in the drawings are only one of the most preferred embodiments of the present invention and do not represent all of the technical spirit of the present invention. It should be understood that there may be equivalents and variations.

도 4는 본 발명의 일 실시예에 따른 장애를 관리하는 5G 분산 클라우드 시스템(400)의 개략적 구성도이다.4 is a schematic configuration diagram of a 5G distributed cloud system 400 for managing failures according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 장애 관리 서비스를 제공하는 5G 분산 클라우드 시스템(400)은 중앙 클라우드(410)와 복수개의 엣지 클라우드(430)를 포함하여 구성된다.A 5G distributed cloud system 400 providing a failure management service according to an embodiment of the present invention includes a central cloud 410 and a plurality of edge clouds 430 .

상기 중앙 클라우드(410)는 본 발명의 장치로서 각 지역의 광역 국사에 설치된 각 엣지 클라우드(430)들을 중앙에서 관리하는 역할을 수행한다. 중앙 클라우드(410)가 장애 관리 기능을 갖는 서버 장치로 구축될 경우, 각 엣지 클라우드(430)들로 장애 관리 서비스를 제공할 수 있다.The central cloud 410 is a device of the present invention, and serves to centrally manage each edge cloud 430 installed in a wide-area office in each region. When the central cloud 410 is built as a server device having a failure management function, a failure management service may be provided to each edge cloud 430 .

중앙 클라우드(410)에 설치된 중앙 매니저(411)는 엣지 클라우드(430)의 엣지 매니저 에이전트(431)와 데이터 통신을 수행하여 장애 관리 서비스를 제공한다. 또한, 중앙 클라우드(410)는 웹 UI를 운용자 단말로 제공하여 상기 장애 관리 서비스를 제공한다. 상기 장애 관리 서비스는 로그 데이터의 분석 및 메시지 변환, 변환된 메시지로부터 장애 검출, 검출된 장애 이벤트의 토폴로지 그래픽 화면의 생성 및 제공, 화면 GUI 기반의 처리를 포함한다. The central manager 411 installed in the central cloud 410 performs data communication with the edge manager agent 431 of the edge cloud 430 to provide a failure management service. In addition, the central cloud 410 provides the failure management service by providing a web UI to an operator terminal. The failure management service includes log data analysis and message conversion, failure detection from the converted message, creation and provision of a topology graphic screen of the detected failure event, and screen GUI-based processing.

상기 엣지 클라우드(430)는 엣지 매니저 에이전트(431), 엣지 클라우드 제어기(432), 하드웨어(433), 운영체제(434), 클라우드 플랫폼(435), NFV 존(436) 및 IT 존(437)을 포함하여 구성된다.The edge cloud 430 includes an edge manager agent 431, an edge cloud controller 432, hardware 433, an operating system 434, a cloud platform 435, an NFV zone 436, and an IT zone 437. It is composed by

상기 엣지 매니저 에이전트(431)는 중앙 매니저(411)와 통신하여 가입자들에게 클라우드 서비스를 제공하는 과정에서 본 발명에 따른 장애 관리 서비스를 중앙 매니저(411)로부터 제공받는다. 상기 장애 관리 서비스에서, 엣지 매니저 에이전트(431)는 클라우드의 로그를 수집하여 중앙 클라우드(410)로 전송하고, 장애가 발생되면 중앙 클라우드(410)로부터 장애 복구 명령을 수신하여 실행하고, 실행 결과를 중앙 클라우드(410)로 보고한다.The edge manager agent 431 communicates with the central manager 411 to receive a failure management service according to the present invention from the central manager 411 in the process of providing cloud services to subscribers. In the failure management service, the edge manager agent 431 collects cloud logs and transmits them to the central cloud 410, receives and executes a failure recovery command from the central cloud 410 when a failure occurs, and sends the execution result to the central cloud 410. Report to cloud 410.

상기 엣지 클라우드 제어기(432)는 하드웨어(433), 운영체제(434) 및 클라우드 플랫폼(435)을 제어하여 클라우드 서비스를 제공한다. 하드웨어(433)는 서버, 스토리지 및 네트워크 장비들을 포함한다. 운영체제(434)는 하드웨어(433)에 로딩되어 클라우드 플랫폼(435)의 실행을 지원한다.The edge cloud controller 432 controls hardware 433, operating system 434, and cloud platform 435 to provide cloud services. Hardware 433 includes servers, storage and network equipment. The operating system 434 is loaded into the hardware 433 and supports execution of the cloud platform 435 .

클라우드 플랫폼(435)의 실행에 의해, 5G 서비스의 인프라를 제공하는 자원 그룹에 해당되는 NFV 존(436) 및 IT 존(437)이 생성되어 5G 클라우드 서비스가 가입자들에게 제공된다. 클라우드 서비스에서, NFV 존(436)은 가입자들에게 네트워크 통신 서비스를 제공하기 위한 인프라를 제공하고, IT 존(437)은 가입자들에게 어플리케이션 서비스를 제공하기 위한 인프라를 제공한다.By executing the cloud platform 435, an NFV zone 436 and an IT zone 437 corresponding to a resource group providing infrastructure for 5G service are created and 5G cloud service is provided to subscribers. In the cloud service, the NFV zone 436 provides infrastructure for providing network communication services to subscribers, and the IT zone 437 provides infrastructure for providing application services to subscribers.

도 5는 도 4의 중앙 매니저(411)가 제공하는 장애 관리 서비스의 개략적 흐름도이다. 도 6은 도 4의 중앙 클라우드(410)에서 로그 수집 및 저장을 위한 코드의 예시도이다. 도 7은 도 4의 중앙 매니저(411)가 수집된 로그 파일의 로그 라인을 분석하여 메시지로 변환하는 구조체의 예시도이다. 도 8은 도 4의 중앙 매니저(411)가 메시지를 분석하여 장애를 검출하는 룰 엔진 포맷의 예시도이다. 도 9는 도 4의 중앙 매니저(411)가 룰 엔진의 필터링 룰을 처리하는 트리의 예시도이다. 도 10은 검출된 장애의 표시를 위해 도 4의 중앙 매니저(411)가 클라우드 플랫폼의 서비스 토폴로지를 기반으로 생성한 장애 관리 화면의 예시도이다. 이하에서는 도 5 내지 도 10을 참조하여 설명한다.5 is a schematic flowchart of a failure management service provided by the central manager 411 of FIG. 4 . 6 is an exemplary diagram of code for log collection and storage in the central cloud 410 of FIG. 4 . 7 is an exemplary diagram of a structure in which the central manager 411 of FIG. 4 analyzes log lines of collected log files and converts them into messages. FIG. 8 is an exemplary view of a rule engine format in which the central manager 411 of FIG. 4 analyzes a message and detects a failure. FIG. 9 is an exemplary view of a tree in which the central manager 411 of FIG. 4 processes filtering rules of a rule engine. 10 is an exemplary diagram of a failure management screen created by the central manager 411 of FIG. 4 based on a service topology of a cloud platform to display detected failures. Hereinafter, it will be described with reference to FIGS. 5 to 10 .

도 5를 참조하면, 중앙 매니저(411)는 메시지 변환부(511), 장애 검출부(513), 장애 그래픽 제공부(515) 및 장애 복구부(517)를 포함하여 구성된다.Referring to FIG. 5 , the central manager 411 includes a message conversion unit 511 , an error detection unit 513 , an error graphic providing unit 515 and a failure recovery unit 517 .

중앙 클라우드(410)가 저장 매체, 메모리(예 : RAM) 및 프로세서를 포함하는 서버 장치라고 가정하면, 각 구성부(511~517)들은 프로그램의 형태로 저장 매체에 설치되고, 실행된 프로그램은 메모리에 로딩되어 프로세서를 통해 처리된다.Assuming that the central cloud 410 is a server device including a storage medium, a memory (eg, RAM), and a processor, each component 511 to 517 is installed in the storage medium in the form of a program, and the executed program is stored in the memory. is loaded into and processed by the processor.

이하에서는 상기에서 설명된 제 1~3문제점을 해결하기 위한 각 구성부(511~517)들이 처리하는 장애 관리 서비스의 운용 관리 흐름(①~⑧)을 설명한다. Hereinafter, the operational management flow (① to ⑧) of the failure management service processed by each component 511 to 517 to solve the first to third problems described above will be described.

상기 메시지 변환부(511)는 분산 클라우드 시스템(400)의 인프라를 구성하는 각 엣지 클라우드(430)의 객체들로부터 발생되는 빅 데이터의 로그를 수집하고(①), 수집된 로그를 장애 분석을 위한 메시지로 변환한다(②).The message conversion unit 511 collects logs of big data generated from objects of each edge cloud 430 constituting the infrastructure of the distributed cloud system 400 (①), and uses the collected logs for failure analysis. It is converted into a message (②).

여기서, 메시지 변환부(511)는 각 엣지 클라우드(430)를 구성하는 하드웨어, 운영체제, 클라우드 플랫폼의 모든 로그 정보 및 네트워크 플랫폼의 모든 네트워크 트래픽 정보를 실시간으로 수집하여 스토리지 기반의 대용량 메시지 큐에 저장한다. 중앙 클라우드(41)는 스토리지 서버로서 메시지 변환부(511)가 네트워크를 통해 수집한 로그를 저장할 수 있다. Here, the message converter 511 collects all log information of hardware, operating system, cloud platform and all network traffic information of network platform constituting each edge cloud 430 in real time and stores them in a storage-based large-capacity message queue. . The central cloud 41 is a storage server and may store logs collected by the message conversion unit 511 through a network.

도 6을 참조하면, 데이터 수집 및 스토리지 저장을 위해, 일반적으로 널리 사용되는 오픈 소스가 사용될 수 있다. 최근에 널리 사용되는 logstash를 예로 들면, 로그 파일의 입력 파일 경로(610) 및 스토리지 서버에 해당되는 출력 부트스트랩 서버(630)가 정의된 config 파일이 생성된 후 실행된다. 각 엣지 클라우드(430)에서 config 파일이 실행되면, 감시 대상의 입력 로그 파일에서 변경이 발생되면, 중앙 클라우드(410)의 스토리지 메시지 큐에 해당 로그가 저장된다.Referring to FIG. 6 , for data collection and storage, generally widely used open sources may be used. Taking logstash, which is widely used recently, as an example, a config file in which an input file path 610 of a log file and an output bootstrap server 630 corresponding to a storage server are defined is created and then executed. When a config file is executed in each edge cloud 430 and a change occurs in an input log file of a monitoring target, the corresponding log is stored in the storage message queue of the central cloud 410 .

본 발명에서, 일반적으로 사용하는 메모리 기반의 메시지 큐가 아니라 스토리지 기반의 메시지 큐를 사용하는 것은 "apache kafka"과 같은 스토리지 기반의 대용량 메시지 큐는 빅 데이터를 저장하고 읽을 때 적정한 속도를 보장하면서 충분한 확장성을 제공하기 때문이다. 메모리 기반의 메시지 큐는 성능은 더 뛰어나지만 장애시 데이터가 소실될 수 있으며 대용량의 데이터를 적용하기에는 확장성에 문제가 발생할 수 있다.In the present invention, the use of a storage-based message queue rather than a generally used memory-based message queue means that a storage-based large-capacity message queue such as "apache kafka" is sufficient while ensuring appropriate speed when storing and reading big data. Because it provides extensibility. Memory-based message queues have better performance, but data may be lost in case of failure, and scalability problems may occur to apply large amounts of data.

도 7을 참조하면, 중앙 매니저(411)의 메시지 변환부(511)는 수집된 로그 파일의 로그 라인을 분석하여 메시지로 변환하고, 변환된 메시지를 메시지 구조체(700)를 통해 저장한다.Referring to FIG. 7 , the message conversion unit 511 of the central manager 411 analyzes log lines of collected log files, converts them into messages, and stores the converted messages through the message structure 700 .

상기 로그 파일은 복수개의 로그 라인을 포함한다. 일반적으로 단순 라인으로 표시되는 로그 파일에서 하나의 라인을 읽어서 분석하는 경우 의미있는 분석 결과를 도출하기 쉽지 않다. 때문에, 개별 라인을 의미있는 메시지 형태로 변환시켜주는 과정이 필요하고, 이를 위해 메시지 변환부(511)는 아래와 같은 3가지 메시지 변환 규칙을 사용한다. The log file includes a plurality of log lines. In general, it is not easy to derive meaningful analysis results when reading and analyzing a single line from a log file that is displayed as a simple line. Therefore, a process of converting individual lines into a meaningful message format is required, and for this purpose, the message conversion unit 511 uses the following three message conversion rules.

제 1메시지 변환 규칙은 로그 라인과 메시지가 1:1로 변환되는 경우로 하나의 라인이 자체적으로 장애 분석의 의미가 있는 메시지인 경우이다. 제 2메시지 변환 규칙은 로그 라인과 메시지가 n:1로 변환되는 경우로 여러 개의 라인이 하나의 장애 분석의 의미가 있는 메시지를 형성하는 경우이다. 제 3메시지 변환 규칙은 제 2메시지 변환 규칙과 동일하게 로그 라인과 메시지가 n:1로 변환되는 경우이지만, 개별 로그 라인마다 동일한 메시지 헤더(2017-05-15 15:48:43.768 2097 ERROR nova.compute.manager)를 갖는다.The first message conversion rule is a case in which a log line and a message are converted 1:1, in which case one line itself is a message that has meaning in failure analysis. The second message conversion rule is a case where a log line and a message are converted n:1, and a case in which several lines form a message that has meaning in one failure analysis. The third message conversion rule is a case where log lines and messages are converted n:1 like the second message conversion rule, but the same message header ( 2017-05-15 15:48:43.768 2097 ERROR nova. compute.manager ).

도 7에 도시된 자료 구조의 메시지 구조체(700)는 서버 이름(hostname), 로그 파일 이름(logname), 라인 아이디(line_id), 메시지 키(msg_key) 및 라인 버퍼(msg_line_buf)의 필드를 포함한다.The message structure 700 of the data structure shown in FIG. 7 includes fields of a server name (hostname), a log file name (logname), a line ID (line_id), a message key (msg_key), and a line buffer (msg_line_buf).

상기 서버 이름은 로그를 발생시키는 엣지 클라우드(430)의 인프라를 구성하는 물리 서버를 고유하게 식별하는 서버명이다. 예를 들면, "cnode-01.dj-01-01.kt."로서 'kt':사업자, 'dj-01-01':대전 집중국에 설치된 01번 클라우드의 01번 랙, 'cnode-01':특정 랙(01)에 설치된 01번 컴퓨팅 서버와 같은 서버명을 의미한다. 상기 로그 파일 이름은 분석 대상의 로그 파일명에 해당된다. 상기 라인 아이디는 로그 파일에서 로그 라인을 분석할 때 동일한 장애 메시지임을 식별하기 위해 사용하는 식별자로서, 대부분의 소프트웨어에서 사용되는 로그 포맷(날짜, 시간, 등급, 모듈명)의 헤더를 이용하는데 이는 로그 파일의 종류에 따라 달라질 수 있다. 즉, 라인 아이디는 메시지 아이디에 해당된다. 상기 메시지 키는 전체 엣지 클라우드(430)에서 발생된 모든 로그 파일에서 현재 분석 중인 라인의 메시지를 고유하게 식별하기 위해 서버 이름, 로그 파일 이름 및 라인 아이디(hostname+logname+line_id)을 조합한 식별자(식별 키)이다. 그리고 상기 라인 버퍼는 메시지 키가 동일한 로그 라인들의 모음으로 메시지가 완성되면, 각 로그 라인들이 저장된 집합체에 해당된다.The server name is a server name that uniquely identifies a physical server constituting the infrastructure of the edge cloud 430 generating logs. For example, as "cnode-01.dj-01-01.kt." 'kt': business operator, 'dj-01-01': rack No. 01 of Cloud No. 01 installed in Daejeon Concentration Station, 'cnode-01' : Means the same server name as No. 01 computing server installed in a specific rack (01). The log file name corresponds to the log file name of the analysis target. The line ID is an identifier used to identify the same error message when analyzing a log line in a log file, and uses the header of the log format (date, time, level, module name) used in most software. This may vary depending on the type of file. That is, the line ID corresponds to the message ID. The message key is an identifier combining the server name, log file name, and line ID (hostname + logname + line_id) to uniquely identify the message of the line currently being analyzed in all log files generated in the entire edge cloud 430 ( identification key). The line buffer is a collection of log lines having the same message key, and when a message is completed, it corresponds to an aggregate in which each log line is stored.

그러면, 메시지 변환부(511)는 스토리지 서버의 DB로부터 각 로그 파일의 로그 라인을 하나씩 읽어들이고, 로그 라인에 제 1~3메시지 변환 규칙을 적용하여 메시지 구조체(700)에 라인 아이디로 식별되는 메시지 및 라인 버퍼에 저장되는 로그 라인 등의 정보를 저장하는 것으로 메시지 변환을 처리한다.Then, the message conversion unit 511 reads the log lines of each log file one by one from the DB of the storage server, applies the first to third message conversion rules to the log lines, and displays the message identified by the line ID in the message structure 700. And message conversion is processed by storing information such as log lines stored in the line buffer.

도 5에서, 상기 장애 검출부(513)는 메시지 구조체(700)를 이용하여 상기 변환된 메시지를 참조하고, 메시지에 룰 엔진을 적용하고(③), 룰 엔진의 처리에 의해 장애를 검출한다(④).In FIG. 5, the failure detection unit 513 refers to the converted message using the message structure 700, applies a rule engine to the message (③), and detects failure by processing the rule engine (④). ).

도 8을 참조하면, 중앙 매니저(411)의 장애 검출부(513)가 적용하는 룰 엔진의 포맷이 도시된다. 룰 엔진의 포맷은 [filter_rule_group->filter_rule->filter_condition->filter_key]로 구성된 계층 구조를 이용하여 룰 엔진의 구조체(800)를 정의한다. 룰 엔진의 계층 구조에서, filter_rule_group은 룰 그룹 이름에 해당되는 제 1계층이다. filter_rule은 제 1계층의 하위 계층이고, 상위의 제 1계층의 룰 그룹에 소속된 필터링 룰에 해당되는 제 2계층이다. filter_condition은 제 2계층의 하위 계층이고, 상위의 제 2계층의 필터링 룰에서 비교 처리되는 필터 조건에 해당되는 제 3계층이다. filter_key는 제 3계층의 하위 계층이고, 상위의 제 3계층의 필터 조건에 의해 비교 처리되는 필터링 키에 해당되는 제 4계층이다.Referring to FIG. 8 , a rule engine format applied by the failure detection unit 513 of the central manager 411 is illustrated. The format of the rule engine defines the structure 800 of the rule engine using a hierarchical structure consisting of [ filter_rule_group->filter_rule->filter_condition->filter_key ]. In the hierarchical structure of the rule engine, filter_rule_group is the first layer corresponding to the rule group name. filter_rule is a lower layer of the first layer and is a second layer corresponding to a filtering rule belonging to a rule group of the upper layer 1. filter_condition is a lower layer of the second layer, and is a third layer corresponding to a filter condition that is compared and processed in the filtering rule of the upper second layer. filter_key is a lower layer of the third layer, and is a fourth layer corresponding to a filtering key that is compared and processed by filter conditions of the upper third layer.

상기 룰 엔진 구조체(800)는 룰 이름(name), 룰 설명(desc), 로그 파일 이름(logfile), 로그 등급(loglevel), 케이스(case), 필터 조건(cond) 및 키워드(keyword)를 포함하는 필터링 룰(body), 복구 처리 정보(shooter) 및 장애 등급(fltlevel)의 필드를 포함한다.The rule engine structure 800 includes a rule name (name), a rule description (desc), a log file name (logfile), a log level (loglevel), a case, a filter condition (cond), and a keyword (keyword). It includes fields of a filtering rule (body), recovery process information (shooter), and failure level (fltlevel).

상기 룰 이름(name) "cloud.openstack.system.quota_exceed"는 룰 엔진의 장애 분석 룰을 설명하는 이름으로 '.'에 의해 분리되는 계층 구조를 사용하여 체계적이고 이해가 쉽도록 하였다. 상기 룰 설명(desc)은 룰 엔진에 의해 검출된 장애를 운용자가 쉽게 이해할 수 있도록 간략하게 설명한 내용이다. 룰 설명은 운용자 단말(530)로 제공될 수 있다. 상기 로그 파일 이름(logfile)은 룰 엔진의 장애 분석 처리가 적용되는 메시지가 변환된 로그 파일의 이름이다. 즉, 장애 검출부(513)는 변환된 메시지의 로그 파일 이름과 일치되는 로그 파일 이름을 갖는 룰 엔진 구조체(800)의 필터링 룰을 메시지에 적용한다. 상기 로그 등급(loglevel)은 관리자에 의해 정의된 로그 등급이다. 최상위 로그 등급은 장애 분석을 위해 중요한 로그임을 나타낸다. 상기 케이스(case)는 메시지로부터 키워드(keyword)를 검색할 때, 대, 소문자 구분을 사용할지 여부를 나타낸다. 룰 엔진에 의해 메시지를 필터링할 때, 로그 파일, 로그 레벨 및 케이스의 3가지 항목이 기본적인 조건으로 사용된다.The rule name "cloud.openstack.system.quota_exceed" is a name that describes failure analysis rules of the rule engine, and is structured and easy to understand by using a hierarchical structure separated by '.'. The rule description (desc) briefly describes the failure detected by the rule engine so that the operator can easily understand. The rule description may be provided to the operator terminal 530 . The log file name (logfile) is the name of a log file in which a message to which the failure analysis process of the rule engine is applied is converted. That is, the failure detection unit 513 applies the filtering rule of the rule engine structure 800 having a log file name matching the log file name of the converted message to the message. The log level is a log level defined by an administrator. The highest log level indicates an important log for failure analysis. The case indicates whether or not case distinction is used when searching for a keyword from a message. When filtering messages by the rule engine, three items are used as basic conditions: log file, log level, and case.

상기 룰 엔진의 구조체(800)에서 제 1계층의 룰 그룹 이름은 "openstack_expert_trouble_map"이다. 이 제 1계층의 룰 그룹은 "cloud.openstack.system.quota_exceed" 및 "cloud.openstack.system.no_valid_host" 라는 이름을 갖는 2개의 제 2계층의 필터링 룰을 갖는다. In the structure 800 of the rule engine, the rule group name of the first layer is "openstack_expert_trouble_map". This first layer of rule groups are "cloud.openstack.system.quota_exceed" and "cloud.openstack.system.no_valid_host" It has two second-tier filtering rules named

여기서, 제 2계층의 필터링 룰 "cloud.openstack.system.quota_exceed"는 "01" 및 "02"로 정의된 2개의 제 3계층 필터 조건이 존재한다. 이 필터 조건은 순차적으로 적용되고 앞에 적용된 "01" 필터 조건이 만족되는 경우에만, 다음의 "02" 필터 조건이 적용된다. 그리고 '01' 필터 조건은 "exception" 및 "quota"라는 2개의 제 4계층 필터링 키를 케이스 "ignore"로 사용하고, "and" 필터 조건으로 연산한다. 즉, 장애 검출부(513)의 룰 엔진은 로그 메시지로부터 "exception" 및 "quota" 키워드를 검출하면, "01" 필터 조건을 성공으로 판단한다. "exception" 및 "quota"의 키워드가 검출되지 않으면, "01" 필터 조건은 실패이고, '02' 조건의 처리는 생략된다. 만약, "01" 및 "02"의 필터 조건이 모두 성공되면, 룰 엔진의 메시지 분석 처리를 통해 장애가 검출된 것이다.Here, the second layer filtering rule “cloud.openstack.system.quota_exceed” has two third layer filter conditions defined as “01” and “02”. These filter conditions are sequentially applied, and only when the previously applied filter condition "01" is satisfied, the next filter condition "02" is applied. And the '01' filter condition uses two fourth-layer filtering keys of "exception" and "quota" as the case "ignore", and operates as an "and" filter condition. That is, when the rule engine of the error detection unit 513 detects the keywords "exception" and "quota" from the log message, it determines that the "01" filter condition is successful. If the keywords of "exception" and "quota" are not detected, the "01" filter condition is a failure, and processing of the '02' condition is skipped. If both of the filter conditions of “01” and “02” are successful, an error is detected through the message analysis process of the rule engine.

상기 복구 처리 정보(shooter)는 분석 대상의 메시지가 필터 조건의 필터링 룰을 만족시켰을 때, 장애 복구를 위해 실행되는 프로그램의 주소 정보이다. 만약, 복구 처리 정보가 없는 장애가 발생되면, 제 3전문가에 연락하여 협조를 구하여야 한다. 상기 장애 등급(fltlevel)은 필터링 룰을 정의한 관리자가 판단한 장애 등급이다. 예를 들면, WR(Warning)/ ER(Error)/CR(Critical) 등으로 장애 등급이 분류될 수 있다.The recovery processing information (shooter) is address information of a program that is executed for recovery from a failure when a message to be analyzed satisfies a filtering rule of a filter condition. If a failure occurs without recovery process information, a third expert should be contacted to seek cooperation. The failure level (fltlevel) is a failure level determined by an administrator defining filtering rules. For example, failure grades may be classified into WR (Warning)/ER (Error)/CR (Critical).

장애 검출부(513)는 메시지의 분석 처리를 통해, 메시지에 포함된 모든 수치, 상태 등의 로그 데이터를 읽고, 관리자가 정의한 다양한 필터링 규칙을 실시간으로 적용하여 장애 이벤트를 검출하고, 그 결과를 메시지 큐에 저장한다. 이러한 장애 분석의 필터링 규칙은 지속적으로 갱신하고 추가할 수 있도록 지식 관리 시스템으로 만들어 중요한 운용 관리 자산이 될 수 있도록 한다.The failure detection unit 513 analyzes and processes the message, reads log data such as all figures and states included in the message, applies various filtering rules defined by the administrator in real time to detect failure events, and outputs the results to the message queue. save to The filtering rules of this failure analysis make it a knowledge management system that can be continuously updated and added, making it an important operational management asset.

도 9를 참조하면, 장애 검출부(513)가 처리하는 룰 엔진의 필터링 룰이 트리 구조로 도시된다. 메시지 변환부(511)가 로그 파일의 분석을 통해 메시지를 생성할 때마다, 장애 검출부(513)는 메시지 큐로부터 생성된 메시지를 읽어온다. 장애 검출부(513)는 메시지의 로그 파일 이름과 일치되는 로그 파일 이름을 갖는 룰 엔진 구조체(800)의 필터링 룰 그룹을 로딩한다. 로딩된 필터링 룰 그룹은 트리 구조로 표시될 수 있다. 여기서, 필터링 룰 그룹의 변경이 있으면, 새로 읽어서 동적으로 적용하고, 변경이 없으면 초기에 로딩한 내용을 사용할 수 있다.Referring to FIG. 9 , the filtering rules of the rule engine processed by the failure detection unit 513 are shown in a tree structure. Whenever the message conversion unit 511 generates a message through log file analysis, the failure detection unit 513 reads the generated message from the message queue. The failure detection unit 513 loads a filtering rule group of the rule engine structure 800 having a log file name matching the log file name of the message. The loaded filtering rule groups may be displayed in a tree structure. Here, if there is a change in the filtering rule group, it can be newly read and dynamically applied, and if there is no change, the initially loaded content can be used.

상기 트리 구조에서, 제 1계층의 롤 그룹은 루트 노트가 되어 제 2계층의 필터링 룰을 자식 노드로 갖는다. 제 2계층의 필터링 룰 노드는 부모 노드가 되어 제 3계층의 필터 조건을 자식 노드로 갖는다. 또한, 제 3계층의 필터 조건 노드는 부모 노드가 되어 제 4계층의 필터링 키를 자식 노드로 갖는다. 트리 순회는 룰 엔진의 필터링 처리에 해당한다.In the tree structure, the role group of the first layer becomes a root node and has a filtering rule of the second layer as a child node. A filtering rule node of the second layer becomes a parent node and has a filter condition of the third layer as a child node. In addition, the filter condition node of the third layer becomes a parent node and has the filtering key of the fourth layer as a child node. Tree traversal corresponds to the filtering process of the rule engine.

트리 순회에서, 필터링 룰 그룹의 노드(901)가 방문된다. 필터링 룰 그룹의 노드(901)는 제 1, 2필터링 룰 노드(902, 910)을 갖는다. 먼저, 제 1필터링 룰 노드(902)부터 방문되어 제 1필터링 룰이 로그 메시지에 적용된다. 필터링 룰 노드(902)는 제 1, 2필터 조건 노드(903, 907)를 갖는데 첫번째 필터 조건 노드(903)부터 하나씩 적용된다. 제 1필터 조건 노드(903)의 'and' 나 'or'조건이 자식 노드인 제 1~3 필터링 키 노드(904, 905, 906)의 키워드에 적용되어 로그 메시지가 판단된다. 즉, 로그 메시지에서 3개의 키워드에 대한 필터 조건이 성공되는지 판단된다. 제 1필터 조건 노드(903)에서 필터링 처리가 성공되면, 제 2필터 조건 노드(907)의 필터 조건이 자식 노드인 제 1~2 필터링 키 노드(908, 909)의 키워드에 적용되어 로그 메시지가 판단된다. 제 1, 2필터 조건 노드(903, 907)에서 모두 필터링 처리가 성공되면, 제 1필터링 룰 노드(902)에서 제 1필터링 룰에 의해 장애가 검출된 것이다. 제 1필터링 룰 노드(902)의 필터링 처리가 완료되면, 제 2필터링 룰 노드(910)의 방문에 의해, 각 노드(911~918)의 필터링 처리가 적용된다.In the tree traversal, node 901 of the filtering rule group is visited. The node 901 of the filtering rule group has first and second filtering rule nodes 902 and 910 . First, it is visited from the first filtering rule node 902 and the first filtering rule is applied to the log message. The filtering rule node 902 has first and second filter condition nodes 903 and 907, which are applied one by one starting from the first filter condition node 903. A log message is determined by applying the 'and' or 'or' condition of the first filter condition node 903 to keywords of the first to third filtering key nodes 904, 905, and 906 as child nodes. That is, it is determined whether filter conditions for three keywords in the log message are successful. If the filtering process is successful in the first filter condition node 903, the filter condition of the second filter condition node 907 is applied to the keywords of the first and second filtering key nodes 908 and 909, which are child nodes, so that a log message is generated. judged If the filtering process is successful in both the first and second filter condition nodes 903 and 907, a failure is detected by the first filtering rule in the first filtering rule node 902. When the filtering process of the first filtering rule node 902 is completed, the filtering process of each node 911 to 918 is applied by visiting the second filtering rule node 910 .

도 5에서, 상기 장애 그래픽 제공부(515)는 운용자 단말(530)로부터 장애 관리 화면을 요청받고(⑤), 요청된 화면을 표시하기 위해 클라우드 인프라를 구성하는 각 객체들을 토폴로지 기반의 그래픽 정보를 생성하고(⑥), 장애 관리 화면의 표시하기 위한 그래픽 정보를 운용자 단말(530)로 제공한다(⑦).5, the failure graphic providing unit 515 receives a request for a failure management screen from the operator terminal 530 (⑤), and provides topology-based graphic information for each object constituting the cloud infrastructure to display the requested screen. It creates (⑥), and provides graphic information for displaying the failure management screen to the operator terminal 530 (⑦).

도 10을 참조하면, 운용자 단말(530)에서 클라우드 인프라의 토폴로지에 기반하여 9개의 인프라 객체 및 이들의 연결 정보로 생성된 장애 관리 화면(1000)이 표시된다. Referring to FIG. 10 , the operator terminal 530 displays a failure management screen 1000 generated with 9 infrastructure objects and their connection information based on the topology of the cloud infrastructure.

장애 그래픽 제공부(515)는 운용자가 요청하는 화면 생성을 위해, 클라우드 인프라의 객체 및 객체들간의 계층적 연결 정보가 저장된 그래픽 DB를 참조한다. 그러면, 장애 그래픽 제공부(51)는 그래픽 DB로부터 조회된 객체들간의 계층적 연결 정보 및 장애 정보를 토폴로지 기반의 그래픽 정보로 생성하고, 생성된 그래픽 정보를 화면 표시 정보로써 운용자 단말(530)로 제공한다. 상기 토폴로지 기반의 그래픽 정보는 도 1에서 도시된 분산 클라우드 시스템(100)의 인프라 계층 구조를 구성하는 각 구성 객체들간의 연결을 표시하는 정보이다.The obstacle graphic providing unit 515 refers to a graphic DB in which objects of the cloud infrastructure and hierarchical connection information between objects are stored in order to generate a screen requested by an operator. Then, the obstacle graphic providing unit 51 generates hierarchical connection information and failure information between objects queried from the graphic DB as topology-based graphic information, and transmits the generated graphic information to the operator terminal 530 as screen display information. to provide. The topology-based graphic information is information indicating connections between constituent objects constituting the infrastructure hierarchical structure of the distributed cloud system 100 shown in FIG. 1 .

여기서, 장애 그래픽 제공부(515)는 발생된 장애가 인프라 객체의 어느 위치인지 정확히 파악하고 쉽게 원인을 분석할 수 있도록 인프라, 플랫폼, 고객 서비스의 상세 다이어그램을 그래픽 정보로 제공한다. 그러면, 운용자 단말(530)은 그래픽 정보의 GUI를 기반으로 장애 관리 화면을 생성한다. 장애 관리 화면에서는 각 객체의 요약 정보가 표시되고, 장애 이벤트 관련 상세 로그 및 과거 이력 데이터가 쉽게 조회될 수 있는 통합 운용 환경이 제공된다.Here, the failure graphic provider 515 provides detailed diagrams of the infrastructure, platform, and customer service as graphic information so that the failure can accurately identify the position of the infrastructure object and easily analyze the cause. Then, the operator terminal 530 creates a failure management screen based on the GUI of the graphic information. On the failure management screen, summary information of each object is displayed, and an integrated operation environment is provided in which detailed logs and past history data related to failure events can be easily inquired.

운용자 단말(530)에서 수신된 그래픽 정보에 의해 생성된 장애 관리 화면(1000)이 표시된 이후로, 운용자는 화면에 표시된 객체를 선택하여 상위 객체 및 하위 객체의 화면을 장애 그래픽 제공부(515)로 요청하고, 장애 그래픽 제공부(515)로부터 대응되는 화면 표시 정보를 제공받아 화면을 생성하여 표시한다. 또한, 장애 그래픽 제공부(515)는 장애가 발생될 때마다, 발생된 장애를 표시하는 장애 정보 및 화면 표시 정보를 생성하여 운용자 단말(530)에 실시간으로 전송한다. 즉, 화면에 표시된 객체에서 장애가 검출될 때마다, 운용자 단말(530)의 장애 관리 화면에 실시간 표시된다. 그러면, 운용자 단말(530)에서 운용자는 화면에 표시된 장애 정보에 대해 상세 정보, 복구 정보를 요청하고, 운용자 단말(530)은 상기 운용자의 요청을 수신한 장애 그래픽 제공부(515)로부터 대응되는 화면 정보를 응답받아 화면에 표시한다.After the failure management screen 1000 generated by the graphic information received from the operator terminal 530 is displayed, the operator selects an object displayed on the screen and transfers the upper and lower object screens to the failure graphic providing unit 515. Request is made and corresponding screen display information is received from the obstacle graphic providing unit 515, and a screen is generated and displayed. In addition, whenever a failure occurs, the failure graphic provider 515 generates failure information and screen display information indicating the failure and transmits them to the operator terminal 530 in real time. That is, whenever a failure is detected in an object displayed on the screen, it is displayed on the failure management screen of the operator terminal 530 in real time. Then, in the operator terminal 530, the operator requests detailed information and recovery information for the failure information displayed on the screen, and the operator terminal 530 receives the corresponding screen from the failure graphic providing unit 515 receiving the operator's request. Receives information and displays it on the screen.

도 5에서, 상기 장애 복구부(517)는 검출된 장애의 복구 가이드 정보를 생성하여 운용자 단말(530)로 제공한다(⑧). 운용자는 수신된 장애 복구의 가이드 정보를 참조하여 복구 명령을 내리고, 장애 복구부(517)는 복구 명령을 장애가 발생된 엣지 클라우드(430)의 엣지 매니저 에이전트(431)로 전송한다. 장애 복구부(517)는 엣지 매니저 에이전트(431)로부터 복구 처리 결과를 수신하여 운용자 단말(530)로 전송한다.In FIG. 5 , the failure recovery unit 517 generates recovery guide information of the detected failure and provides it to the operator terminal 530 (⑧). The operator issues a recovery command by referring to the received failure recovery guide information, and the failure recovery unit 517 transmits the recovery command to the edge manager agent 431 of the edge cloud 430 where the failure occurred. The failure recovery unit 517 receives a recovery process result from the edge manager agent 431 and transmits it to the operator terminal 530 .

여기서, 장애 복구부(517)는 발생된 장애에 대해 상기 구조체(800)의 룰 설명과 상세 로그의 데이터 등을 운용자 단말(530)의 장애 복구 전용 창으로 제공할 수 있다. 장애 복구 처리는 관리자의 장애 복구 정책에 따라서 다양할 수 있다. 예를 들면, 1차 자동 복구, 2차 운용자 매뉴얼 복구, 3차 제 3전문가 복구가 순차적으로 진행될 수 있다. 정의된 장애 복구 정책에 따라, 자동으로 복구가 가능한 경우, 그 절차에 따라 장애 복구부(517)는 운용자 단말(530)과 양방향으로 통신하며 자동 복구 명령을 수행하고 운용자 단말(530)은 장애 복구 결과를 화면에 표시한다. 발생된 장애를 자동으로 복구하는 것이 불가능한 경우, 장애 복구 가이드를 화면에 단계적으로 표시하여 운용자가 스스로 장애를 복구할 수 있도록 지원한다. 마지막으로 발생된 장애의 자동 복구 및 운용자 복구가 실패될 경우 또는 전문가 복구로 기 설정된 경우, 제 3전문가에게 복구를 통보한다.Here, the failure recovery unit 517 may provide a description of the rules of the structure 800 and the data of the detailed log for the failure to the failure recovery-only window of the operator terminal 530 . Failure recovery processing may vary according to the administrator's failure recovery policy. For example, first automatic recovery, second operator manual recovery, and third third expert recovery may be sequentially performed. According to the defined failure recovery policy, if automatic recovery is possible, the failure recovery unit 517 communicates with the operator terminal 530 in both directions according to the procedure, performs an automatic recovery command, and the operator terminal 530 recovers the failure. display the result on the screen. If it is impossible to automatically recover the failure, the failure recovery guide is displayed step by step on the screen so that the operator can recover the failure himself. If the automatic recovery and operator recovery of the last failure occurred or if it is preset as an expert recovery, a third expert is notified of the recovery.

도 11은 본 발명의 일 실시예에 따른 5G 분산 클라우드 시스템(400)에서 장애를 관리하는 방법의 개략적 순서도이다. 장애 관리의 방법을 위해, 도 5 내지 도 10를 참조하여 상기에서 설명된 기재가 이하에서 원용될 수 있다.11 is a schematic flowchart of a method for managing failures in the 5G distributed cloud system 400 according to an embodiment of the present invention. For the method of failure management, the description described above with reference to FIGS. 5 to 10 may be used below.

중앙 클라우드(410)는 시스템(400)의 인프라를 구성하는 각 객체에서 발생되는 빅 데이터의 로그를 수집한다(S1111). 로그가 스토리지 서버에서 수집되면, 중앙 클라우드(410)는 스토리지 서버에 저장된 로그를 분석하여 메시지로 변환한다(S1113).The central cloud 410 collects logs of big data generated from each object constituting the infrastructure of the system 400 (S1111). When logs are collected in the storage server, the central cloud 410 analyzes the logs stored in the storage server and converts them into messages (S1113).

로그 분석에 의해 변환된 메시지가 생성되면, 중앙 클라우드(410)는 룰 엔진의 필터링 규칙을 처리하여 장애를 검출한다(S1121). 빅 데이터의 로그로부터 변환된 메시지 및 룰 엔진의 필터링 규칙에 의해, 시스템(400)의 물리적인 에러뿐만이 아니라 논리적인 에러를 장애로 검출하는 것이 가능하다.When a message converted by log analysis is generated, the central cloud 410 detects a failure by processing the filtering rules of the rule engine (S1121). It is possible to detect logical errors as well as physical errors of the system 400 as failures by means of messages converted from big data logs and filtering rules of the rule engine.

이후, 운용자 단말(530)에서 운용자가 장애 관리 화면을 요청하면, 중앙 클라우드(410)가 운용자의 화면 요청을 수신한다(S1131), 중앙 클라우드(410)는 요청된 화면을 위해 클라우드 인프라의 토폴로지를 기반으로 화면 표시 정보를 생성하고, 발생된 장애 정보를 추가하여 운용자 단말(530)로 제공한다(S1133). 또한, 시스템(400)에서 장애가 발생하면, 중앙 클라우드(410)는 발생된 장애 정보의 실시간 표시를 위해 운용자 단말(530)로 장애 정보를 제공한다(S1135).Thereafter, when the operator requests a failure management screen from the operator terminal 530, the central cloud 410 receives the operator's screen request (S1131), and the central cloud 410 converts the topology of the cloud infrastructure for the requested screen. Based on this, screen display information is generated, and generated failure information is added and provided to the operator terminal 530 (S1133). In addition, when a failure occurs in the system 400, the central cloud 410 provides failure information to the operator terminal 530 to display the failure information in real time (S1135).

여기서, 운용자 단말(530)이 중앙 클라우드(410)로부터 제공받은 화면 표시 정보를 기반으로 장애 관리 화면을 생성하여 표시한다. 이 장애 관리 화면에서 시스템(400)을 구성하는 각 인프라 객체 및 객체들 간의 연결 정보가 계층적 연관 구조로 표시된다. 즉, 운용자는 화면에서 각 객체의 상위 객체, 하위 객체, 연결 객체로 이동하면서 상세 장애 정보를 요청하고 제공받을 수 있다.Here, the operator terminal 530 generates and displays a failure management screen based on screen display information provided from the central cloud 410 . In this failure management screen, each infrastructure object constituting the system 400 and connection information between objects are displayed in a hierarchical association structure. That is, the operator can request and receive detailed failure information while moving to the upper, lower, and connection objects of each object on the screen.

운용자가 장애 관리 화면에서 복구를 요청하면, 중앙 클라우드(410)는 운용자 단말(530)로부터 운용자의 복구 요청을 수신한다(S1137). 운용자는 자동 복구를 요청하거나, 운용자의 매뉴얼 명령을 전송하여 복구를 요청하거나, 제 3전문가에 의한 복구를 요청할 수 있다. 복구 요청이 수신되면, 중앙 클라우드(410)는 엣지 클라우드(430)로 대응되는 복구 명령을 전송하여 복구 처리하고, 엣지 클라우드(430)로부터 수신된 복구 처리 결과를 운용자 단말(530)로 전송한다(S1139). 만약, 제 3전문가 복구가 요청되면, 중앙 클라우드(410)는 전문가 단말로 장애 정보를 통보하여 장애 복구를 요청한다.When the operator requests recovery on the failure management screen, the central cloud 410 receives the operator's recovery request from the operator terminal 530 (S1137). The operator may request automatic recovery, request recovery by sending an operator's manual command, or request recovery by a third expert. When a recovery request is received, the central cloud 410 transmits a corresponding recovery command to the edge cloud 430 for recovery processing, and transmits the recovery processing result received from the edge cloud 430 to the operator terminal 530 ( S1139). If recovery of the third expert is requested, the central cloud 410 requests failure recovery by notifying failure information to the expert terminal.

도 12는 도 11에서 로그를 분석하여 메시지로 변환하는 단계(S1113)의 상세 순서도이다.FIG. 12 is a detailed flowchart of the step (S1113) of analyzing the log and converting it into a message in FIG. 11 .

먼저, 클라우드 시스템(400)의 기본적인 로그 파일은 하드웨어 관련된 로그를 제공하는 인프라 로그, 하드웨어의 운영체제에서 제공하는 시스템 로그, 그리고 클라우드 관련해서 가장 많은 정보를 제공하는 플랫폼 로그 등으로 분류할 수 있다. 중앙 클라우드(410)는 스토리지 서버에 저장된 각 로그 파일에서 전체 로그 라인들을 로그 분석을 위해 메시지 큐에 저장한다.First, the basic log files of the cloud system 400 can be classified into infrastructure logs providing logs related to hardware, system logs provided by the hardware operating system, and platform logs providing the most information related to the cloud. The central cloud 410 stores entire log lines from each log file stored in the storage server in a message queue for log analysis.

중앙 클라우드(410)는 메시지 큐로부터 로그 라인을 읽어 분석 대상의 로그 라인이 존재하는지 판단한다(S1211). 로그 라인이 존재하면, 중앙 클라우드(410)는 서버 이름(hostname) 및 로그 파일 이름(logname)을 이용하여 로그 라인의 발생 위치를 식별하고(S1212), 로그 라인에 대해 공백, 불필요한 라인의 제거 등과 같은 파싱 처리를 하고(S1213), 해당 라인에 유효한 라인 아이디(line_id)가 존재하는지 판단한다(1214).The central cloud 410 reads the log line from the message queue and determines whether a log line to be analyzed exists (S1211). If the log line exists, the central cloud 410 identifies the occurrence location of the log line using the server name (hostname) and the log file name (logname) (S1212), blanks for the log line, removal of unnecessary lines, etc. The same parsing process is performed (S1213), and it is determined whether a valid line ID (line_id) exists in the corresponding line (S1214).

유효한 라인 아이디가 존재할 경우, 변환 프로그램의 시작 후 최초의 로그 라인이면(S1215), 중앙 클라우드(410)는 메시지 구조체(700)의 항목을 초기화하여 라인 버퍼(msg_line__buf)에 해당 라인을 추가하고(S1225), 메시지 큐에서 다음 라인을 읽어들인다(S1211). If a valid line ID exists, if it is the first log line after starting the conversion program (S1215), the central cloud 410 initializes the item of the message structure 700 and adds the corresponding line to the line buffer (msg_line__buf) (S1225 ), the next line is read from the message queue (S1211).

만약, 로그 라인이 유효한 라인 아이디가 존재하지 않으면, 로그 메시지 헤더가 없이 선행 로그 라인에 추가적인 내용만 제공하는 경우이므로, 선행 라인의 메시지 키(msg_key)로 생성된 라인 버퍼(msg_line_buf)에 라인을 추가하고(S1224), 메시지 큐에서 다음 라인을 읽어들인다(1211).If the log line does not have a valid line ID, it is a case where only additional contents are provided to the preceding log line without a log message header, so a line is added to the line buffer (msg_line_buf) created with the message key (msg_key) of the preceding line. And (S1224), the next line is read from the message queue (1211).

만약, 로그 라인이 유효한 라인 아이디를 가지고 있고, 이 라인 아이디로 구성된 동일 메시지 키가 이미 존재하면(S1216), 로그 라인의 메시지 키를 키로 사용하는 라인 버퍼에 추가한다(S1224). If the log line has a valid line ID and the same message key composed of the line ID already exists (S1216), the message key of the log line is added to the line buffer used as a key (S1224).

로그 라인이 유효한 라인 아이디를 가지고 있고, 이 라인 아이디로 구성된 메시지 키가 존재하지 않으면, 이전 메시지 키를 이용한 로그 라인들의 메시지가 완성된 것으로 판단하고, 라인 버퍼에 저장된 각 라인들을 합쳐서 하나의 메시지로 완성하고, 생성된 메시지를 큐(log_message)에 저장한다(S1217). 하나의 메시지가 저장되면, 상기 단계(S1225)에서와 같이, 메시지 구조체(700)를 초기화하고, 라인 버퍼에 해당 라인을 추가한다. 그리고 메시지 큐(log_line)에서 다음 라인을 읽어들인다(S1211). If a log line has a valid line ID and a message key composed of this line ID does not exist, it is determined that the message of the log lines using the previous message key is complete, and each line stored in the line buffer is combined to form a single message. It is completed, and the generated message is stored in the queue (log_message) (S1217). When one message is stored, as in step S1225, the message structure 700 is initialized and a corresponding line is added to the line buffer. Then, the next line is read from the message queue (log_line) (S1211).

또한, 동일 메시지 그룹은 보통 짧은 시간에 여러 개의 로그 라인으로 생성된다. 이전에 하나의 로그 라인을 읽어서 메시지 구조체(700)에 값을 저장하였으나, 오랜 시간 동안 신규 로그 라인이 읽어지지 않으면 메시지 구조체(700)에 있는 내용은 신속히 처리되지 못하고 오랜 시간 대기 상태에 빠질 수 있는 문제점이 있다. 이에 로그 라인의 메시지 큐를 3번 연속 읽었으나, 로그 라인이 없는 경우(S1221), 이전 메시지 구조체(700)에 저장하는 데이터를 이미 완성된 메시지라고 판단하고, 상기 단계(S1217)의 메시지 완성 및 저장을 처리한다.Also, the same message group is usually created with several log lines in a short amount of time. Previously, one log line was read and a value was stored in the message structure 700, but if a new log line is not read for a long time, the contents of the message structure 700 may not be processed quickly and may fall into a waiting state for a long time. There is a problem. Accordingly, if the message queue of the log line is read three times in succession, but there is no log line (S1221), it is determined that the data stored in the previous message structure 700 is an already completed message, and the message completion and handle storage.

도 13은 도 11에서 장애 관리를 위해 그래픽 정보를 생성하여 제공하는 단계(S1133)의 상세 순서도이다.FIG. 13 is a detailed flowchart of generating and providing graphic information for failure management in FIG. 11 (S1133).

운용자 단말(530)은 중앙 클라우드(410)에 웹 접속하고, 웹 접속을 통해 클라우드 시스템(400)을 모니터링하기 위해, 장애 관리 화면의 표시를 중앙 클라우드(410)에 요청한다(1301).The operator terminal 530 accesses the central cloud 410 through the web and requests the central cloud 410 to display a failure management screen in order to monitor the cloud system 400 through the web access (1301).

요청을 받은 중앙 클라우드(410)는 운용자가 요청한 화면의 토폴로지(인프라, 플랫폼, 고객 서비스)에 맞는 그래픽 자료 구조를 자동으로 생성할 클래스를 로딩한다(1311). 중앙 클라우드(410)는 그래픽 토폴로지를 표현하는데 필요한 그래픽 객체와 그 객체들을 연결할 링크 정보를 생성한다(1312). 그래픽 토폴로지의 그래픽 객체 및 링크의 정보가 생성되면, 중앙 클라우드(410)는 생성된 정보를 기반으로 화면 크기를 결정하고, 결정된 화면 크기의 정보가 저장된 화면 크기 자료 구조를 생성한다(1313). 즉, 사용자가 요구하는 장애 관리 화면에 따라 표시될 객체의 수 및 연결 링크가 다르기 때문에, DB로부터 화면에 표시될 객체 정보 및 링크 정보를 조회하고, 조회된 정보를 기반으로 화면의 크기가 결정된다. Upon receiving the request, the central cloud 410 loads a class to automatically generate a graphic data structure suitable for the topology (infrastructure, platform, customer service) of the screen requested by the operator (1311). The central cloud 410 generates graphic objects necessary to express the graphic topology and link information to connect the objects (1312). When the graphic object and link information of the graphic topology is generated, the central cloud 410 determines the screen size based on the generated information and creates a screen size data structure in which the determined screen size information is stored (1313). That is, since the number of objects to be displayed and the connection links are different according to the failure management screen requested by the user, object information and link information to be displayed on the screen are queried from the DB, and the size of the screen is determined based on the queried information. .

화면 아이디(id) 및 화면 크기(width/height)를 포함하는 화면 크기 정보가 화면 크기 자료 구조에 저장되면, 중앙 클라우드(410)는 결정된 화면을 구성하는 그래픽 객체들의 배치 정보를 객체의 아이디, 타이틀, 그래픽 객체 종류, 화면에 표시할 제약 조건 등으로 구성된 화면 객체 자료 구조를 생성한다(S1314). 그리고 이러한 화면 객체들을 연결하기 위한 연결 링크의 화면 링크 식별자, 링크 시작 식별자, 링크 종료 식별자 및 링크의 화살표 방향, 링크 선의 종류의 값이 저장된 링크 자료 구조를 생성한다(S1315). 여기서, 화면 크기 자료 구조, 화면 객체 자료 구조 및 화면 링크 자료 구조에서 사용되는 아이디는 장애 정보에서도 동일한 이름으로 사용되어 토폴로지에 장애를 표현할 수 있다. 중앙 클라우드(410)는 각각 생성된 화면 크기, 화면 객체, 화면 링크의 화면 자료 구조들의 그래픽 정보를 운용자 단말(530)로 전송한다(S1316).When screen size information including screen ID and screen size (width/height) is stored in the screen size data structure, the central cloud 410 transfers arrangement information of graphic objects constituting the determined screen to object ID and title. , A screen object data structure composed of graphic object types, constraints to be displayed on the screen, etc. is created (S1314). Then, a link data structure is created in which screen link identifiers, link start identifiers, link end identifiers, arrow directions of links, and values of link line types of connection links for connecting these screen objects are stored (S1315). Here, the ID used in the screen size data structure, screen object data structure, and screen link data structure can be used as the same name in the failure information to express failure in the topology. The central cloud 410 transmits graphic information of screen data structures of screen sizes, screen objects, and screen links, respectively, to the operator terminal 530 (S1316).

상기 화면 자료 구조를 도 10을 참조하여 설명하면, 도 10의 장애 안내 화면에서 상기 화면 크기 자료 구조의 크기를 기반으로 화면이 표시되고, 화면 객체 자료 구조에 따라 9개의 그래픽 객체 노드가 표시되고, 화면 링크 자료 구조에 따라 9개 객체들 간의 연결 선이 표시된다.Referring to the screen data structure with reference to FIG. 10, a screen is displayed based on the size of the screen size data structure in the failure guidance screen of FIG. 10, and 9 graphic object nodes are displayed according to the screen object data structure, Connection lines between 9 objects are displayed according to the screen link data structure.

운용자 단말(530)은 중앙 클라우드(41)로부터 수신된 화면 자료 구조를 이용하여 장애 관리 화면을 생성하고 표시한다(S1321). 운용자 단말(530)은 정해진 주기마다 장애 정보를 중앙 클라우드(410)로 요청한다(S1322).The operator terminal 530 creates and displays a failure management screen using the screen data structure received from the central cloud 41 (S1321). The operator terminal 530 requests failure information to the central cloud 410 at predetermined intervals (S1322).

장애 정보의 제공을 요청받은 중앙 클라우드(410)는 장애 정보가 존재할 경우(S1331), 실시간으로 화면 정보를 기반으로 표시될 수 있는 장애 정보를 전송한다(1332). 운용자 단말(530)은 중앙 클라우드(410)로부터 전송받은 장애 정보를 장애 관리 화면에서 표시한다(S1341).The central cloud 410 requested to provide failure information transmits failure information that can be displayed based on screen information in real time when failure information exists (S1331) (S1332). The operator terminal 530 displays the failure information received from the central cloud 410 on the failure management screen (S1341).

도 14는 도 11에서 장애를 복구하여 처리 결과를 제공하는 단계(S1139)의 상세 순서도이다.FIG. 14 is a detailed flowchart of the step S1139 of recovering a failure and providing a process result in FIG. 11 .

도 13의 장애 정보 표시 단계(S1314) 이후로, 운용자가 화면에 표시된 장애를 복구하기 위해 장애 복구 버튼을 누르면, 운용자 단말(530)은 해당 장애를 손쉽게 파악할 수 있는 장애 요약 설명을 화면에 출력하고, 중앙 클라우드(510)로 장애 복구의 안내를 요청한다.After the failure information display step (S1314) of FIG. 13, when the operator presses the failure recovery button to recover the failure displayed on the screen, the operator terminal 530 outputs a failure summary explanation on the screen to easily identify the failure, , requesting guidance of disaster recovery to the central cloud 510 .

운용자의 장애 복구 요청에 의해, 중앙 클라우드(410)는 운용자 단말(530)로부터 장애 복구의 안내 요청을 수신한다(S1411). 안내 요청을 수신한 중앙 클라우드(410)는 요청된 장애에 대해 룰 엔진 구조체(800)의 상기 복구 처리 정보의 필드를 참조한다(S1412). 참조된 복구 처리 정보는 자동 복구, 운용자 매뉴얼 복구 및 제 3전문가 복구 중 어느 하나일 수 있다. 참조된 복구 처리 정보에서, 자동 복구가 참조될 경우(S1413), 중앙 클라우드(410)는 운용자 단말(530)로 자동 복구 화면을 전송하여 운용자에게 자동 복구의 실행을 선택할 것을 요청하고(S1414), 운용자 단말(530)로부터 자동 복구 실행의 요청을 수신한다(S1415).Upon the operator's failure recovery request, the central cloud 410 receives a failure recovery guide request from the operator terminal 530 (S1411). Upon receiving the guidance request, the central cloud 410 refers to the field of the recovery processing information of the rule engine structure 800 for the requested failure (S1412). The referenced recovery processing information may be any one of automatic recovery, operator manual recovery, and third expert recovery. In the referenced recovery processing information, when automatic recovery is referred to (S1413), the central cloud 410 transmits an automatic recovery screen to the operator terminal 530 and requests the operator to select execution of automatic recovery (S1414), An automatic recovery execution request is received from the operator terminal 530 (S1415).

장애 복구의 명령이 접속 상태가 아닌 엣지 클라우드(430)에서 맨 처음 시작되는 경우(S1421), 중앙 클라우드(410)는 엣지 클라우드(430)에 접속하고(S1422), 엣지 매니저 에이전트(431)로 자동 복구 명령을 전송한다(S1423). 이미 접속된 상태의 엣지 클라우드(430)에서 장애 복구가 실행될 경우, 상기 접속 처리 단계(S1422)는 생략된다.When the failover command is first initiated from the edge cloud 430 that is not in a connected state (S1421), the central cloud 410 accesses the edge cloud 430 (S1422), and the edge manager agent 431 automatically A recovery command is transmitted (S1423). When failure recovery is executed in the already connected edge cloud 430, the connection processing step (S1422) is omitted.

중앙 클라우드(410)로부터 자동 복구 명령이 전송되면, 엣지 매니저 에이전트(431)는 자동 복구 명령을 수신하여 실행한다. 엣지 매니저 에이전트(431)는 장애 복구 명령의 실행 결과를 중앙 클라우드(410)로 전송한다.When an automatic recovery command is transmitted from the central cloud 410, the edge manager agent 431 receives and executes the automatic recovery command. The edge manager agent 431 transmits the execution result of the failover command to the central cloud 410 .

중앙 클라우드(410)는 엣지 매니저 에이전트(431)로부터 상기 실행 결과를 수신하여 운용자 단말(530)로 전송한다(S1424). 그러면, 운용자 단말(530)은 중앙 클라우드(410)로부터 수신된 실행 결과를 화면에 표시한다. The central cloud 410 receives the execution result from the edge manager agent 431 and transmits it to the operator terminal 530 (S1424). Then, the operator terminal 530 displays the execution result received from the central cloud 410 on the screen.

상기 실행 결과가 성공일 경우(S1431), 장애 복구의 다음 단계가 있는지 검사하고(S1432), 다음 작업이 존재하면, 사용자가 실행을 선택하는 단계로 돌아간다(S1415). 다음 작업이 존재하지 않으면, 당해 장애의 복구 처리가 종료된다.If the execution result is successful (S1431), it is checked whether there is a next step of recovery from failure (S1432), and if there is a next task, the user returns to the step of selecting execution (S1415). If the next job does not exist, the recovery process for the failure is ended.

만약, 자동 복구 명령의 처리 결과가 3회 연속된 실패로 판단될 경우(S1441), 관리자에 의해 설정된 장애 복구 정책에 따라, 해당 장애의 상기 복구 처리 정보는 3회 연속 실패된 자동 복구 명령 대신에 운용자 매뉴얼 복구 또는 제 3전문가 복구로 변경되어, 해당 장애의 복구 처리는 상기 단계(S1413)로 복귀한다. If the process result of the automatic recovery command is determined to be 3 consecutive failures (S1441), according to the failure recovery policy set by the administrator, the recovery processing information of the failure is replaced with the automatic recovery command that failed 3 times in a row. It is changed to the operator's manual recovery or the third expert recovery, and the recovery process of the failure returns to the above step S1413.

또한, 상기 단계(1413)에서 자동 복구가 아닌 장애는 운용자 매뉴얼 복구인지 판단되고(S1451), 운용자 매뉴얼 복구의 장애이면, 중앙 클라우드(410)는 운용자가 내리는 매뉴얼 명령에 따라 장애 복구를 처리한다(S1453). 여기서, 중앙 클라우드(410)는 운용자 단말(530)로 장애 복구를 위한 운용자 매뉴얼을 제공하고, 운용자의 내린 명령을 엣지 클라우드(430)에서 실행하고, 명령의 실행 결과를 운용자 단말(530)로 전송한다. 물론, 운용자의 매뉴얼 복구가 실패되면, 실패된 장애의 상기 장애 복구 정보가 제 3전문가 복구로 변경될 수 있다.In addition, in the above step 1413, it is determined whether the failure rather than automatic recovery is an operator manual recovery (S1451). S1453). Here, the central cloud 410 provides an operator manual for failure recovery to the operator terminal 530, executes the operator's commands in the edge cloud 430, and transmits the execution result of the command to the operator terminal 530. do. Of course, if the operator's manual recovery fails, the failure recovery information of the failed failure can be changed to a third expert recovery.

또한, 장애의 장애 복구 정보가 제 3전문가 복구이면, 중앙 클라우드(410)는 제 3전문가 정보를 참조하여 장애 복구를 통보한다(S1461)In addition, if the failure recovery information of the failure is the third expert recovery, the central cloud 410 notifies the failure recovery with reference to the third expert information (S1461).

본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 이것에 의해 한정되지 않으며 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술사상과 아래에 기재될 특허청구범위의 균등범위 내에서 다양한 수정 및 변형이 가능함은 물론이다.Although the present invention has been described with limited examples and drawings, the present invention is not limited thereto, and the technical idea of the present invention and claims to be described below are made by those skilled in the art to which the present invention belongs. Of course, various modifications and variations are possible within the equivalent range of the scope.

100 : 클라우드 시스템100: cloud system

Claims

In the device for managing failures using big data of a 5G distributed cloud system,
Log lines with the same message key are added to the same line buffer by reading log lines one by one from the message queue storing the log of big data generated in the infrastructure of the system. a message converting unit for combining log lines stored in the line buffer and converting them into a message related to a single failure when there are none;
a failure detection unit for detecting failures by applying a rule engine to the converted message;
In order to display each detected failure on a screen, a failure graphic providing unit that generates a failure-occurring infrastructure object as topology-based graphic information and provides it to an operator terminal; and
Failure recovery unit for generating failure recovery information of the detected failure and providing it to the operator terminal
A device comprising a.

According to claim 1,
The message conversion unit,
A device characterized in that for collecting and storing logs for each object of a cloud infrastructure including objects of hardware, operating system, cloud platform and network platform.

According to claim 1,
The message conversion unit,
An apparatus characterized by converting at least one log line in each log file into a message for detecting a failure.

According to claim 1,
The message conversion unit,
Server name of the server from which the log was collected;
File name in which the log is recorded;
a line ID identifying a message converted by analyzing at least one log line in the log;
a message key composed of the server name, the file name, and the line ID of the message; and
A line buffer in which each log line analyzed by the message is stored
Device characterized in that for storing the log line and the converted message using a structure comprising a.

According to claim 1,
The failure detection unit,
a rule group name containing at least one filtering rule;
the filtering rule;
a filter condition processed in the filtering rule; and
A filtering key that is detected from the message and to which the filter condition is applied
Detecting a filtering key from a message using the structure of the rule engine including a rule engine, and detecting the occurrence of a failure when a filter condition applied to the detected filtering key is satisfied in the message.

According to claim 1,
The failure detection unit,
a rule name including hierarchical structure information of the infrastructure in which failure is detected;
a rule description describing the failure;
log file name in which the failure is detected;
log rank, which defines the rank of the log;
case defining the distinction between uppercase and lowercase letters in keywords in the log;
at least one filtering rule including filter conditions of AND and OR and at least one keyword to which the filter condition is applied to detect the failure from the log line corresponding to the message; and
recovery process information of automatic recovery, operator manual recovery, and third-expert recovery executed to recover the failure detected by satisfying the filtering rule;
Disability class defining the class of disability of the above disability
An apparatus characterized in that using a structure of the rule engine comprising a.

According to claim 1,
Further comprising a graphic DB in which objects of the infrastructure and hierarchical connection information between objects are stored for screen display,
The disability graphic providing unit inquires objects and hierarchical connection information corresponding to the infrastructure requested by the operator terminal from the graphic DB, generates hierarchical connection information between the searched objects as graphic information, and converts the generated graphic information to the graphics database. An apparatus characterized in that it is provided as an operator terminal.

According to claim 1,
The disability graphic providing unit,
If an obstacle exists in an object included in the graphic information displayed on the operator terminal, the apparatus characterized in that for transmitting the failure information to the operator terminal for screen display.

According to claim 1,
The failure recovery unit,
An apparatus characterized by providing the failure recovery information to an operator's terminal and restoring the failure by executing an automatic recovery process, an operator manual recovery process, and a third expert recovery process according to an operator's request.

According to claim 1,
The failure recovery unit,
According to an automatic execution or an operator's request, an automatic recovery command is transmitted to the infrastructure in which a failure occurs, and a result of processing the command is provided to an operator terminal.

According to claim 10,
The failure recovery unit,
When the automatic recovery fails, at the request of the operator, an operator manual recovery command is transmitted to the infrastructure where the failure occurred, and a result of the command processing is provided to the operator terminal,
and when the automatic recovery or the operator manual recovery fails, failure information is notified to an expert terminal for recovery by a third expert.

A method executed by a device for managing failures using big data of a 5G distributed cloud system,
Log lines with the same message key are added to the same line buffer by reading log lines one by one from the message queue storing the log of big data generated in the infrastructure of the system. combining log lines stored in the line buffer when there are none, and converting them into a message related to one failure;
detecting a failure by applying a rule engine to the converted message;
Generating an infrastructure object with a failure as topology-based graphic information and providing it to an operator terminal for screen output of each detected failure; and
Restoring the failure by generating failure recovery information of the detected failure and providing it to the operator terminal
How to include.

According to claim 12,
The step of converting the message into the
A method characterized in that the step of collecting and storing logs for each object of a cloud infrastructure including objects of hardware, operating system, cloud platform and network platform.

According to claim 12,
The step of converting the message into the
and converting at least one log line in each log file into a message for detecting a failure.

According to claim 12,
The step of converting the message into the
Server name of the server from which the log was collected;
File name in which the log is recorded;
a line ID identifying a message converted by analyzing at least one log line in the log;
a message key composed of the server name, the file name, and the line ID of the message; and
A line buffer in which each log line analyzed by the message is stored
characterized in that the step of storing the log line and the converted message using a structure including a.

According to claim 12,
The step of detecting the failure is,
a rule group name containing at least one filtering rule;
the filtering rule;
a filter condition processed in the filtering rule; and
A filtering key that is detected from the message and to which the filter condition is applied
and detecting a filtering key from a message using a structure of the rule engine including a rule engine, and detecting that a failure occurs when a filter condition applied to the detected filtering key is satisfied in the message.

According to claim 12,
The step of detecting the failure is,
a rule name including hierarchical structure information of the infrastructure in which failure is detected;
a rule description describing the failure;
log file name in which the failure is detected;
log rank, which defines the rank of the log;
case defining the distinction between uppercase and lowercase letters in keywords in the log;
at least one filtering rule including filter conditions of AND and OR and at least one keyword to which the filter condition is applied to detect the failure from the log line corresponding to the message; and
recovery process information of automatic recovery, operator manual recovery, and third-expert recovery executed to recover the failure detected by satisfying the filtering rule;
Disability class defining the class of disability of the above disability
The method characterized in that using the structure of the rule engine comprising a.

According to claim 12,
The device includes a graphic DB in which objects of the infrastructure and hierarchical connection information between objects are stored for screen display,
In the providing step, an object corresponding to the operator's request and hierarchical connection information are inquired from the graphic DB, hierarchical connection information between the inquired objects is generated as graphic information, and the generated graphic information is converted to the operator terminal. A method characterized in that the step of providing as.

According to claim 12,
The step of providing,
and transmitting failure information to the operator terminal for screen display, if an obstacle exists in an object included in the graphic information displayed on the operator terminal.

According to claim 12,
The recovery step is
and providing the failure recovery information to an operator terminal and restoring the failure by executing an automatic recovery process, an operator manual recovery process, and a third expert recovery process according to an operator's request.

According to claim 12,
The recovery step is
The method characterized in that the step of transmitting an automatic recovery command to the infrastructure in which a failure occurs according to automatic execution or an operator's request, and providing a result of processing the command to an operator terminal.

According to claim 21,
The recovery step is
When the automatic recovery fails, transmitting an operator manual recovery command to the infrastructure where the failure occurs according to an operator's request, and providing a result of processing the command to the operator terminal; and
and notifying failure information to an expert terminal for recovery by a third expert when the automatic recovery or the operator manual recovery fails.