KR101964412B1

KR101964412B1 - Method for diagnosing anomaly log of mobile commmunication data processing system and system thereof

Info

Publication number: KR101964412B1
Application number: KR1020180160346A
Authority: KR
Inventors: 이정표; 이규민; 박관영
Original assignee: 주식회사 모비젠
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2019-04-01

Abstract

The present invention provides a method for diagnosing the occurrence of an anomaly log of a mobile communication data processing system which can diagnose an anomaly of a log outputted in real time by a mobile communication data processing system by using a machine learning algorithm. According to an embodiment of the present invention, the method for diagnosing the occurrence of an anomaly log of a mobile communication data processing system comprises: a second log generation step of accumulating a log collected at each first unit time to collect logs for a second unit time; a group classification step of classifying the logs for the second unit time into one or more groups based on the similarity between the logs; a sequence generation step of generating a log sequence for the second unit time based on a temporal order in which the logs are collected and the classified groups; and an anomaly diagnosis step of accumulating the log sequence to generate a log sequence matrix, and applying the generated log sequence matrix to a module which performs a machine learning algorithm to diagnose an anomaly occurrence of the mobile communication data processing system.

Description

TECHNICAL FIELD [0001] The present invention relates to a method and a system for diagnosing abnormal log generation in a mobile communication data processing system,

본 발명은 이동통신데이터 처리시스템의 이상로그 발생을 진단하는 방법 및 그 시스템에 관한 발명으로서, 보다 구체적으로는, 방대한 양의 데이터를 처리하는 이동통신데이터 처리시스템이 실시간으로 시스템 로그를 출력할 때, 그 시스템 로그를 딥러닝기법을 통해서 자동으로 분석하여, 이상발생을 진단할 수 있는 방법 및 그 시스템에 관한 것이다.The present invention relates to a method for diagnosing abnormal log generation in a mobile communication data processing system and an invention related to the system, and more particularly, to a mobile communication data processing system for processing a vast amount of data, And a system and method for automatically diagnosing an abnormality by analyzing the system log through a deep running technique.

빅데이터 시대라고 불리우는 현대 시대의 다양한 산업 군에서는 대규모 데이터 처리 시스템을 구축해 운용 중이다. 이 중 이동통신의 데이터 발생량은 페타바이트(petabyte) 규모이며, 초 단위로 발생하는 데이터를 데이터의 무결성을 유지하면서 실시간으로 저장하기 위해서는 데이터 센터의 안정성이 절대적으로 필요하다.In the modern era, called the Big Data Age, various industrial groups are building and operating large-scale data processing systems. Among these, the amount of data generated by mobile communication is petabyte scale. In order to store the data generated in seconds in real time while maintaining the integrity of data, the stability of the data center is absolutely necessary.

데이터 센터의 안정성을 확보하기 위해서는 다양한 장비의 운용이 필수적이고, 다양한 장비를 이용하면 그 장비들로부터 산출되는 데이터들은 분산저장되며, 이렇게 분산저장되는 데이터를 효율적으로 관리해야만 초 단위로 발생되는 데이터의 무결성을 확보할 수 있다.In order to secure the stability of the data center, it is essential to operate various devices. When various devices are used, the data generated from the devices are distributed and stored. Therefore, Integrity can be ensured.

다수의 장비(모듈)에 대한 정상 작동 여부 및 오작동 여부는 해당 장비에서 발생되는 시스템 로그를 통해 알 수 있지만, 현존하는 데이터 처리시스템 로그의 규모와 발생 속도를 가늠해볼 때, 자연인에 불과한 관리자가 모든 로그를 직접 확인하는 것은 현실적으로 불가능하다는 한계가 있다.Although the normal operation and malfunction of a large number of devices (modules) can be detected through the system log generated by the corresponding device, when an existing data processing system log is measured in size and generation speed, There is a limitation that it is practically impossible to directly check the log.

이러한 한계점을 극복하기 위해 이동통신데이터 처리시스템의 로그를 머신러닝 알고리즘을 이용하여 분석함으로써, 이동통신데이터 처리시스템으로부터 이상로그(anomaly log)가 출력되었는지 진단하는 방법은 종래에도 몇 가지 알려져 있으나, 그 방법들은 모두 최적화가 되지 않은 방법으로서, 그 효용성은 높지 않은 수준이다.In order to overcome this limitation, a method of diagnosing whether an anomaly log is outputted from a mobile communication data processing system by analyzing a log of the mobile communication data processing system by using a machine learning algorithm is known in the past, All methods are not optimized, and their utility is not high.

여기서, 최적화가 되지 않은 방법이라 함은 이동통신데이터 처리시스템의 로그(이하 "로그")를 입력으로 수신하여 머신러닝을 수행하는 머신러닝 모듈(장치)의 문제라기보다는, 머신러닝 모듈에 입력되는 로그의 전처리(pro-processing)가 충분하게 되지 않았다고 보는 것이 타당한 해석이다. 그러므로, 이동통신데이터 처리시스템의 이상로그를 효율적으로 진단하기 위해서, 머신러닝 모듈에 입력되는 로그를 적절하게 전처리하는 방법론에 대한 고찰이 필요한 실정이다.Here, the unoptimized method is not a problem of the machine learning module (machine) that receives the log of the mobile communication data processing system as input and performs the machine learning, It is reasonable to assume that pro-processing of logs has not been sufficient. Therefore, in order to efficiently diagnose the abnormal log of the mobile communication data processing system, it is necessary to consider a methodology for appropriately preprocessing the log input to the machine learning module.

대한민국 등록특허공보 제10-1621959호 (2016.06.03 공고)Korean Registered Patent No. 10-1621959 (published on June, 2016)

본 발명이 해결하고자 하는 기술적 과제는, 머신러닝 알고리즘을 이용해서 이동통신데이터 처리시스템이 실시간으로 출력하는 로그의 이상을 진단할 수 있는 방법 및 그 시스템을 제공하기 위한 것이다.SUMMARY OF THE INVENTION It is an object of the present invention to provide a method and system for diagnosing an abnormality of a log output by a mobile communication data processing system in real time using a machine learning algorithm.

상기 기술적 과제를 해결하기 위한 본 발명의 일 실시 예에 따른 방법은, 이동통신데이터 처리시스템의 이상로그 발생을 진단하는 방법으로서, 제1단위시간마다 수집된 로그를 누적시켜서 제2단위시간에 대한 로그들을 수집하는 제2로그생성단계; 상기 제2단위시간에 대한 로그들을 로그들간의 유사도를 기초로 하여 적어도 하나 이상의 군집으로 분류하는 군집분류단계; 상기 로그가 수집된 시간적 순서 및 상기 분류된 군집을 기초로 하여 제2단위시간에 대한 로그시퀀스(log sequence)를 생성하는 시퀀스생성단계; 및 상기 로그시퀀스를 누적시켜서 로그시퀀스행렬을 생성하고, 상기 생성된 로그시퀀스행렬을 기계학습알고리즘을 수행하는 모듈에 적용하여 상기 이동통신데이터 처리시스템의 이상발생을 진단하는 이상진단단계;를 포함한다.According to another aspect of the present invention, there is provided a method of diagnosing occurrence of an abnormal log in a mobile communication data processing system, the method comprising: accumulating logs collected every first unit time, A second log generation step of collecting logs; A grouping step of classifying the logs of the second unit time into at least one cluster based on the similarity between logs; A sequence generation step of generating a log sequence for a second unit time based on the temporal order in which the logs are collected and the classified clusters; And an abnormality diagnosis step of accumulating the log sequence to generate a log sequence matrix and applying the generated log sequence matrix to a module for performing a machine learning algorithm to diagnose an abnormality of the mobile communication data processing system .

상기 기술적 과제를 해결하기 위한 본 발명의 다른 일 실시 예에 따른 시스템은, 이동통신데이터 처리시스템의 이상로그 발생을 진단하는 시스템으로서, 제1단위시간마다 수집된 로그를 누적시켜서 제2단위시간에 대한 로그들을 생성하는 제2로그생성부; 상기 제2단위시간에 대한 로그들을 로그들간의 유사도를 기초로 하여 적어도 하나 이상의 군집으로 분류하는 군집분류부; 상기 로그가 수집된 시간적 순서 및 상기 분류된 군집을 기초로 하여 제2단위시간에 대한 로그시퀀스(log sequence)를 생성하는 시퀀스생성부; 및 상기 로그시퀀스를 누적시켜서 로그시퀀스행렬을 생성하고, 상기 생성된 로그시퀀스행렬을 기계학습알고리즘을 수행하는 모듈에 적용하여 상기 이동통신데이터 처리시스템의 이상발생을 진단하는 이상진단부;를 포함한다.According to another aspect of the present invention, there is provided a system for diagnosing abnormal log generation in a mobile communication data processing system, the system comprising: A second log generation unit for generating logs for the first log generation unit; A cluster classifier for classifying the logs of the second unit time into at least one cluster based on the similarity between logs; A sequence generator for generating a log sequence for a second unit time based on the temporal order in which the logs are collected and the classified clusters; And an abnormality diagnosis unit for accumulating the log sequence to generate a log sequence matrix and applying the generated log sequence matrix to a module for performing a machine learning algorithm to diagnose an abnormality of the mobile communication data processing system .

본 발명의 일 실시 예는, 상기 방법을 실행시키기 위한 프로그램을 저장하고 있는 컴퓨터 판독가능한 기록매체를 제공할 수 있다.An embodiment of the present invention can provide a computer-readable recording medium storing a program for executing the method.

본 발명에 따르면, 대량의 데이터를 실시간으로 처리하는 이동통신데이터 처리시스템으로부터 출력되는 로그의 이상을 빠르고 정확하게 진단할 수 있다.According to the present invention, it is possible to quickly and accurately diagnose an abnormality of a log output from a mobile communication data processing system that processes a large amount of data in real time.

도 1은 본 발명에 의해서 이동통신데이터 처리시스템의 이상로그가 진단되는 과정을 개략적으로 도시한 도면이다.
도 2는 본 발명에 따른 이상로그진단시스템의 일 예의 블록도를 도시한 도면이다.
도 3은 제2단위시간에 대한 로그의 일 예를 나타낸 도면이다.
도 4는 군집분류부에 미리 설정되어 있는 이벤트템플릿의 일 예를 나타낸 도면이다.
도 5는 제2단위시간에 대한 로그에 포함된 로그들이 군집분류부에 의해 군집으로 분류되는 과정 및 그 과정에 따라 분류된 결과를 설명하기 위한 도면이다.
도 6은 본 발명에 따른 이동통신데이터 처리시스템의 이상로그 발생을 진단하는 방법의 일 예에 대한 흐름도를 도시한 도면이다.1 is a diagram schematically illustrating a process of diagnosing an anomaly log of a mobile communication data processing system according to the present invention.
2 is a block diagram of an example of an abnormal log diagnosis system according to the present invention.
FIG. 3 is a diagram showing an example of the log for the second unit time.
4 is a diagram showing an example of an event template preset in the cluster classification unit.
FIG. 5 is a view for explaining a process in which logs included in the log for the second unit time are classified into a cluster by the cluster classifier, and the results classified according to the process.
FIG. 6 is a flowchart illustrating an example of a method for diagnosing abnormal log generation in the mobile communication data processing system according to the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시 예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 본 발명의 효과 및 특징, 그리고 그것들을 달성하는 방법은 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시 예들에 한정되는 것이 아니라 다양한 형태로 구현될 수 있다. BRIEF DESCRIPTION OF THE DRAWINGS The present invention is capable of various modifications and various embodiments, and specific embodiments are illustrated in the drawings and described in detail in the detailed description. The effects and features of the present invention and methods of achieving them will be apparent with reference to the embodiments described in detail below with reference to the drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein.

이하, 첨부된 도면을 참조하여 본 발명의 실시 예들을 상세히 설명하기로 하며, 도면을 참조하여 설명할 때 동일하거나 대응하는 구성 요소는 동일한 도면부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings, wherein like reference numerals refer to like or corresponding components throughout the drawings, and a duplicate description thereof will be omitted .

이하의 실시 예에서, 제1, 제2 등의 용어는 한정적인 의미가 아니라 하나의 구성 요소를 다른 구성 요소와 구별하는 목적으로 사용되었다. In the following embodiments, the terms first, second, and the like are used for the purpose of distinguishing one element from another element, not the limitative meaning.

이하의 실시 예에서, 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.In the following examples, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise.

이하의 실시 예에서, 포함하다 또는 가지다 등의 용어는 명세서상에 기재된 특징, 또는 구성요소가 존재함을 의미하는 것이고, 하나 이상의 다른 특징을 또는 구성요소가 부가될 가능성을 미리 배제하는 것은 아니다. In the following embodiments, terms such as inclusive or possessed mean that a feature or element described in the specification is present, and does not exclude the possibility that one or more other features or components are added in advance.

어떤 실시 예가 달리 구현 가능한 경우에 특정한 공정 순서는 설명되는 순서와 다르게 수행될 수도 있다. 예를 들어, 연속하여 설명되는 두 공정이 실질적으로 동시에 수행될 수도 있고, 설명되는 순서와 반대의 순서로 진행될 수 있다.If certain embodiments are otherwise feasible, the particular process sequence may be performed differently from the sequence described. For example, two processes that are described in succession may be performed substantially concurrently, and may be performed in the reverse order of the order described.

도 1은 본 발명에 의해서 이동통신데이터 처리시스템의 이상로그가 진단되는 과정을 개략적으로 도시한 도면이다.1 is a diagram schematically illustrating a process of diagnosing an anomaly log of a mobile communication data processing system according to the present invention.

먼저, 이동통신데이터 처리시스템(100)은 스마트폰, 테플릿 퍼스널 컴퓨터 등과 같은 모바일 단말과 통신하는 시스템으로서, 적어도 한 대 이상의 모바일 단말과 통신하면서 통신내역을 미리 설정된 형식으로 정리하여 시스템 로그로 출력하는 기능을 수행한다. 이동통신데이터 처리시스템(100)은 수많은 모바일 단말과 통신을 수행할 수 있으므로, 이동통신데이터 처리시스템(100)으로부터 출력되는 시스템 로그의 누적용량은 적게는 수 바이트(bytes)에서 많게는 수 페타바이트(petabytes)까지 집계될 수 있다.First, the mobile communication data processing system 100 communicates with a mobile terminal such as a smart phone, a tablet personal computer, etc., and communicates with at least one mobile terminal, . Since the mobile communication data processing system 100 can communicate with a large number of mobile terminals, the accumulated capacity of the system log output from the mobile communication data processing system 100 can be reduced from few bytes to several thousands of bytes petabytes) can be counted.

이동통신데이터 처리시스템(100)은 그 시스템을 경유하거나, 그 시스템에 저장되는 데이터들의 무결성(data integrity)를 보장하기 위해서, 이동통신데이터 처리시스템(100)에 오류가 발생되었거나, 이동통신데이터 처리시스템(100)이 수신한 데이터에 문제가 있을 경우, 이를 빠르게 파악하기 위한 동작이 필수적이다. 위와 같은 필요성에 의해서, 이동통신데이터 처리시스템(100)은 고유한 주기에 따라서 시스템 로그를 출력하고, 관리자는 시스템로그(110)를 분석하여 이동통신데이터 처리시스템(100)의 동작특성을 감시하고 보수하게 된다. The mobile communication data processing system 100 may be configured to receive data from the mobile communication data processing system 100 if an error has occurred in the mobile communication data processing system 100, If there is a problem with the data received by the system 100, an operation for quickly grasping it is essential. According to the above-mentioned necessity, the mobile communication data processing system 100 outputs the system log according to a unique period, and the manager analyzes the system log 110 to monitor the operation characteristics of the mobile communication data processing system 100 .

다만, 본 발명에서는 자연인인 관리자가 시스템 로그(110)를 실시간으로 완벽하게 분석하는 것은 사실상 불가능하다는 것을 고려하여, 시스템 로그(110)를 기계학습알고리즘을 수행하는 모듈에 적용하여 시스템 로그에 기재되어 있는 오류를 감지할 수 있도록 하는 방법을 제안한다.However, in consideration of the fact that it is virtually impossible for a manager who is a natural person to completely analyze the system log 110 in real time, the system log 110 is applied to a module that performs a machine learning algorithm and is described in the system log We propose a method to detect the error.

시스템로그(110)는 이동통신데이터 처리시스템(100)으로부터 출력되어 시퀀스템플릿(120)에 의해서 제1로그시퀀스(130)로 가공된다. 보다 구체적으로 시스템로그(110)가 이동통신데이터 처리시스템(100)으로부터 출력되면, 시퀀스템플릿(120)에 의해 수집되며, 시퀀스템플릿(120)은 내부적으로 규정되어 있는 시퀀스변환정책에 따라서, 구조화되지 않은 비정형의 시스템로그(110)를 제1로그시퀀스(130)로 변환하게 된다. The system log 110 is output from the mobile communication data processing system 100 and processed by the sequence template 120 into the first log sequence 130. More specifically, when the system log 110 is output from the mobile communication data processing system 100, it is collected by the sequence template 120, and the sequence template 120 is structured according to an internally defined sequence conversion policy And converts the unstructured system log 110 into the first log sequence 130.

시퀀스템플릿(120)은 구조화되지 않은 비정형 시스템로그(110)를 일정한 변환정책에 따라서 기계학습알고리즘 모듈을 통해 학습되기 용이한 형태로 변환시킨다. 시퀀스템플릿(120)에 미리 설정되어 있는 시퀀스변환정책은 관리자에 의해 필요에 따라 변경될 수도 있다.The sequence template 120 converts the unstructured unstructured system log 110 into a form that is easy to learn through the machine learning algorithm module according to a certain conversion policy. The sequence conversion policy previously set in the sequence template 120 may be changed by the administrator as needed.

제1로그시퀀스(130)는 시스템로그(110)가 시퀀스템플릿(120)에 의해 변환된 결과로서, 시스템로그(110)를 특정한 단위시간으로 구분하여 연계시킨 시퀀스(sequence)정보를 의미한다. 동일한 시스템로그(110)도 시퀀스템플릿(120)에 기설정되어 있는 시퀀스변환정책이나, 이동통신데이터 처리시스템(100)이 시스템로그(110)를 출력하는 출력단위시간(output period time)의 변경에 따라 매번 다른 값의 제1로그시퀀스(130)로 변환될 수 있다. 제1로그시퀀스(130)는 기계학습알고리즘모듈(140)에 의해 학습될 수 있는 학습데이터(training data)로서 기능하고, 후술할 제2로그시퀀스(160)와 서로 대응될 수 있다. 도 1에는 도시되어 있지 않으나, 실시 예에 따라, 제1로그시퀀스(130)는 기계학습알고리즘모듈(140)에 입력데이터로서 입력되기 전에 여러 구간에 대한 제1로그시퀀스(130)를 누적시켜서 생성한 시퀀스행렬(sequence matrix)로 추가적으로 가공될 수도 있다. 출력단위시간 및 학습데이터에 대한 설명은 도 2의 설명과 함께 후술하기로 한다.The first log sequence 130 refers to sequence information obtained by dividing the system log 110 by a specific unit time as a result of the system log 110 being converted by the sequence template 120. The same system log 110 may also be used for changing the sequence conversion policy previously set in the sequence template 120 or the output period time in which the mobile communication data processing system 100 outputs the system log 110 May be converted into a first log sequence 130 of different values each time. The first log sequence 130 functions as training data that can be learned by the machine learning algorithm module 140 and may correspond to a second log sequence 160 to be described later. Although not shown in FIG. 1, according to an embodiment, the first log sequence 130 accumulates the first log sequence 130 for several intervals before being input to the machine learning algorithm module 140 as input data, And may be further processed into a sequence matrix. The description of the output unit time and the learning data will be described later with reference to FIG.

기계학습알고리즘모듈(140)은 제1로그시퀀스(130)를 입력데이터로 하여 기계학습알고리즘을 수행하는 모듈로서, 물리적인 장치뿐만 아니라, 기계어나 컴퓨터언어로 작성된 코드에 의해 구현되는 논리적인 장치를 모두 포함하는 개념이다. 기계학습알고리즘모듈(140)은 다양한 기계학습알고리즘을 통해서 학습될 수 있다. 예를 들어, 기계학습알고리즘모듈(140)은 오토인코더(autoencoder)모듈일 수도 있다. 다른 일 예로서, 기계학습알고리즘모듈(140)은 장단기메모리인코더디코더(LSTM Encoder-Decoder)모듈일 수도 있다. 오토인코더 및 장단기메모리인코더디코더는 널리 알려진 방식의 딥러닝알고리즘이므로, 이에 대한 추가적인 설명은 생략하기로 한다.The machine learning algorithm module 140 is a module for performing a machine learning algorithm using the first log sequence 130 as input data and is a module for not only a physical device but also a logical device implemented by a machine code or a code written in a computer language It is a concept that includes everything. The machine learning algorithm module 140 may be learned through various machine learning algorithms. For example, the machine learning algorithm module 140 may be an autoencoder module. As another example, the machine learning algorithm module 140 may be a short or long term memory encoder decoder (LSTM Encoder-Decoder) module. Since the auto-encoder and the short-term and long-term memory encoder decoders are well-known deep-learning algorithms, a further description thereof will be omitted.

학습된 기계학습알고리즘모듈(150)은, 제1로그시퀀스(130)를 입력데이터로 받은 기계학습알고리즘모듈(140)이 반복적인 학습과정을 거쳐서 제1로그시퀀스(130)에 대해서 학습이 완료된 모듈을 의미한다. 예를 들어, 기계학습알고리즘모듈(140)이 오토인코더 모듈이고, 제1로그시퀀스(130)를 학습하였다면, 학습이 완료된 오토인코더 모듈은 제1로그시퀀스(130)를 입력레이어에서 수신하여, 출력레이어에서 제1로그시퀀스(130)와 동일한 값을 출력할 수 있을 뿐만 아니라, 노이즈(noise)가 추가된 제1로그시퀀스(130)를 입력레이어에서 수신하는 경우에도, 출력레이어에서 제1로그시퀀스(130)와 동일한 값을 출력한다. 즉, 제1로그시퀀스(130)에 대해서 학습이 완료된 오토인코더모듈은 제1로그시퀀스(130)에 대해서 디노이징(denoising)기능을 갖는다.The learned machine learning algorithm module 150 is a module in which the machine learning algorithm module 140 that receives the first log sequence 130 as input data undergoes an iterative learning process and performs a learning process on the first log sequence 130 . For example, if the machine learning algorithm module 140 is an auto encoder module and has learned the first log sequence 130, the learned auto-encoder module receives the first log sequence 130 at the input layer, It is possible not only to output the same value as the first log sequence 130 in the layer but also in the case where the input layer receives the first log sequence 130 with added noise, (130). That is, the auto-encoder module, which has completed learning with respect to the first log sequence 130, has a denoising function with respect to the first log sequence 130.

제2로그시퀀스(160)는 이동통신데이터 처리시스템(110)이 출력하는 시스템로그로부터 생성된 데이터로서, 전술한 시스템로그(110)와 다른 시점, 정확히는, 전술한 시스템로그(110)가 이동통신데이터 처리시스템(110)에서 출력되고 있던 시점 이후의 시점에 출력된 시스템로그가 시퀀스템플릿(120)에 의해 로그시퀀스형태로 변경되어 생성된 정보를 의미한다. 기계학습알고리즘모듈(140)을 기준으로 제1로그시퀀스(130)는 학습데이터이고, 제2로그시퀀스(160)는 검증데이터(validation data)로 기능한다. 제2로그시퀀스(160)도 제1로그시퀀스(130)와 마찬가지로, 실시 예에 따라서, 로그시퀀스행렬형태로 추가적으로 가공되어 학습된 기계학습알고리즘 모듈(150)에 적용될 수도 있다.The second log sequence 160 is data generated from the system log output by the mobile communication data processing system 110 and is transmitted to the system log 110 at a time point different from the system log 110 described above, Means a system log generated at a point in time after a point in time when the data is output from the data processing system 110 and is changed into a log sequence by the sequence template 120. Based on the machine learning algorithm module 140, the first log sequence 130 is training data and the second log sequence 160 functions as validation data. The second log sequence 160 may also be applied to the learned machine learning algorithm module 150, further processed in log sequence matrix form, as in the first log sequence 130, according to an embodiment.

이상로그진단시스템(170)은 학습된 기계학습알고리즘모듈(150)이 제2로그시퀀스(160)를 입력받은 후에 데이터를 출력하는 과정을 모니터링하여, 제2로그시퀀스(160)를 생성하는 데에 사용된 시스템(이동통신데이터 처리시스템을 의미)의 원시로그(raw log)에 시스템의 이상동작이 기록되어 있는지 진단한다. 실시 예에 따라서, 본 발명에 따른 이상로그진단시스템(170)은 이동통신데이터 처리시스템(100)에 물리적 또는 논리적으로 연결되거나, 이동통신데이터 처리시스템(100)에 포함되는 형태로 구현될 수 있다.The abnormal log diagnostic system 170 monitors the process of outputting data after the learned machine learning algorithm module 150 receives the second log sequence 160 to generate the second log sequence 160 And diagnoses whether the abnormal operation of the system is recorded in the raw log of the used system (meaning mobile communication data processing system). The anomaly log diagnosis system 170 according to the present invention may be physically or logically connected to the mobile communication data processing system 100 or included in the mobile communication data processing system 100 .

이상로그진단시스템(170)이 이동통신데이터 처리시스템(100)으로부터 이상로그가 출력된 것을 진단하는 일 예로서, 시스템의 이상상황이 발생되었을 때의 원시로그로부터 구축된 제2로그시퀀스(160)가 학습된 오토인코더모듈에 입력되면, 오토인코더모듈은 입력레이어, 히든레이어, 출력레이어를 거치면서 제2로그시퀀스(160)와 동일한 값을 출력하고, 제2로그시퀀스(160)를 재구축하는 과정에서 산출되는 재구축오류값(reconstruction error value)이 임계값을 초과하게 된다. 이상로그진단시스템(170)은 이러한 임계값의 초과여부를 기초로 이동통신데이터 처리시스템(100)으로부터 이상로그가 발생된 것을 감지할 수 있게 된다. 여기서, 임계값은 수학적, 실험적, 경험적에 의해 누적된 데이터에 의해 합리적인 값으로 설정될 수 있다.An example of diagnosing that the abnormal log diagnostic system 170 outputs an abnormal log from the mobile communication data processing system 100 includes a second log sequence 160 constructed from a raw log when an abnormal condition of the system occurs, The auto encoder module outputs the same value as the second log sequence 160 while passing through the input layer, the hidden layer, and the output layer, and reconstructs the second log sequence 160 The reconstruction error value calculated in the process exceeds the threshold value. The abnormal log diagnostic system 170 can detect that the abnormal log is generated from the mobile communication data processing system 100 based on whether the threshold value is exceeded or not. Here, the threshold value can be set to a reasonable value by data accumulated by mathematical, experimental, and empirical.

도 2는 본 발명에 따른 이상로그진단시스템의 일 예의 블록도를 도시한 도면이다.2 is a block diagram of an example of an abnormal log diagnosis system according to the present invention.

도 2를 참조하면, 본 발명에 따른 이상로그진단시스템(200)은 제2로그생성부(210), 군집분류부(230), 시퀀스생성부(250) 및 이상진단부(270)를 포함하는 것을 알 수 있다. 설명의 편의를 위해서, 이하에서는, 도 1에서 설명한 동일한 구성에 대해서는 도 1을 참조하여 설명하기로 하고, 특별한 한정이 없는 한, 로그는 도 1에서의 시스템로그(110)를 의미하는 것으로 본다.2, the abnormal log diagnosis system 200 includes a second log generation unit 210, a cluster classification unit 230, a sequence generation unit 250, and an abnormality diagnosis unit 270 . 1 will be described below with reference to FIG. 1. Unless otherwise specified, the log is regarded as meaning the system log 110 in FIG.

본 발명에 따른 이상로그진단시스템(200)에 포함되는 제2로그생성부(210), 군집분류부(230), 시퀀스생성부(250) 및 이상진단부(270)는 적어도 하나 이상의 프로세서(processor)에 해당하거나, 적어도 하나 이상의 프로세서를 포함할 수 있다. 이에 따라, 제2로그생성부(210), 군집분류부(230), 시퀀스생성부(250) 및 이상진단부(270)는 마이크로 프로세서나 범용 컴퓨터 시스템과 같은 다른 하드웨어 장치에 포함된 형태로 구동될 수 있다.The second log generator 210, the cluster classifier 230, the sequence generator 250 and the abnormality diagnosis unit 270 included in the abnormality log diagnosis system 200 according to the present invention include at least one processor ), Or may include at least one or more processors. Accordingly, the second log generation unit 210, the grouping unit 230, the sequence generation unit 250, and the abnormality diagnosis unit 270 are driven in a form contained in another hardware device such as a microprocessor or a general purpose computer system .

제2로그생성부(210)는 제1단위시간마다 수집된 로그를 누적시켜서 제2단위시간에 대한 로그들을 생성한다. 제1단위시간은 이동통신데이터 처리시스템(100)이 시스템로그를 출력하는 주기를 의미한다. 예를 들어, 이동통신데이터 처리시스템(100)이 1초마다 시스템로그(110)를 출력한다고 하면, 1초가 제1단위시간이 될 수 있다. 여기서 1초는 제1단위시간의 일 예로서, 실시 예에 따라서, 제1단위시간은 1초보다 더 짧거나 더 길 수도 있다.The second log generation unit 210 accumulates logs collected every first unit time and generates logs for the second unit time. The first unit time means a period in which the mobile communication data processing system 100 outputs the system log. For example, if the mobile communication data processing system 100 outputs the system log 110 every second, one second may be the first unit time. Here, one second is an example of a first unit time, and depending on the embodiment, the first unit time may be shorter or longer than one second.

제2단위시간은 제1단위시간마다 수집된 로그를 누적시켜서 생성하는 로그에 포함되는 시스템로그(110)들의 제1단위시간의 합을 의미하며, 제1단위시간보다 더 길다. 예를 들어, 제2로그생성부(210)가 8초동안의 시스템로그(110)를 누적시켜서 제2단위시간에 대한 로그를 생성한다면 제1단위시간 및 제2단위시간은 각각 1초, 8초가 된다. 위와 같은 예에 따르면, 이동통신데이터 처리시스템(100)의 시스템로그(110)가 40초분량만큼 확보된 상태에서 생성가능한 제2단위시간에 대한 로그들은 총 5 단위가 될 수 있다. 제2단위시간은 제1단위시간의 양의 정수배가 되고, 8초는 제2단위시간의 일 예로서, 실시 예에 따라서, 제2단위시간은 8초보다 더 짧거나 더 길 수도 있다. The second unit time refers to the sum of the first unit times of the system logs 110 included in the log generated by accumulating the logs collected every first unit time, which is longer than the first unit time. For example, if the second log generator 210 generates a log of the second unit time by accumulating the system log 110 for 8 seconds, the first unit time and the second unit time are 1 second and 8 Seconds. According to the above example, the logs of the second unit time that can be generated in a state where the system log 110 of the mobile communication data processing system 100 is reserved for 40 seconds can be total 5 units. The second unit time is a positive integral multiple of the first unit time, and 8 seconds is an example of the second unit time, depending on the embodiment, the second unit time may be shorter or longer than 8 seconds.

군집분류부(230)는 제2단위시간에 대한 로그들을 로그들간의 유사도를 기초로 하여 적어도 하나 이상의 군집으로 분류한다. 군집분류부(230)는 1차적으로 수집된 제2단위시간에 대한 로그들의 수를 파악하고 난 후, 미리 설정된 유사도 공식에 따라서, 로그들간의 유사도를 산출하여, 그 유사도에 따라서 적어도 하나 이상의 군집을 확보할 수 있다. 위와 같은 과정을 수행하기 위해서, 군집분류부(230)는 유사도를 산출하기 위한 수학식 및 산출된 유사도를 기초로 제2단위시간에 대한 로그들을 적어도 하나 이상의 군집으로 분류하기 위한 기준값을 미리 저장하고 있다. 이하에서는, 군집분류부(230)가 제2단위시간에 대한 로그들을 적어도 하나 이상의 군집으로 분류하는 구체적인 방법에 대해서 설명하기로 한다.The cluster classifier 230 classifies the logs for the second unit time into at least one cluster based on the similarity between logs. The community classifying unit 230 calculates the similarity between the logs according to a predetermined similarity formula after first recognizing the number of logs of the second unit time collected, . In order to perform the above process, the community classifying unit 230 previously stores a reference value for classifying the logs of the second unit time into at least one cluster based on the mathematical formula for calculating the similarity and the calculated similarity have. Hereinafter, a specific method in which the cluster classifier 230 classifies the logs of the second unit time into at least one cluster will be described.

먼저, 군집분류부(230)는 제2단위시간에 대한 로그들에 포함되어 있는 시스템로그(110)의 단위수를 파악한다. 예를 들어, 제2단위시간이 8초, 제1단위시간이 1초였다면, 시스템로그(110)의 단위수는 8개가 되며, 군집분류부(230)가 생성할 수 있는 군집의 최대값도 8개가 된다.First, the cluster classifier 230 determines the number of units of the system log 110 included in the logs for the second unit time. For example, if the second unit time is 8 seconds and the first unit time is 1 second, then the unit number of the system log 110 becomes 8, and the maximum value of the community that the cluster classifier 230 can generate 8.

군집분류부(230)는 제2단위시간에 대한 로그들에 포함되어 있는 시스템로그(110)들을 토큰화(tokenization)시킨다. 이동통신데이터 처리시스템(100)으로부터 출력되는 시스템로그(110)는 고정값(constant) 및 가변값(variable)을 포함한다. 고정값은 시스템을 구성하는 소스코드(source code)에 의해 고정적으로 생성되는 부분이고, 가변값은 인터넷프로토콜(IP)나 포트(Port)와 같이 동적으로 생성되는 부분을 의미한다. 군집분류부(230)는 제2단위시간에 대한 로그들에 대해서 전처리과정의 일환으로 로그 파싱(log parsing)을 수행하고, 로그에 포함되어 있는 특징적인 정보들은 로그 파싱과정을 통해서 모두 개별적인 토큰으로 변환된다.The cluster classifier 230 tokenizes the system logs 110 included in the logs for the second unit time. The system log 110 output from the mobile communication data processing system 100 includes a fixed value and a variable value. The fixed value is a part that is fixedly generated by the source code constituting the system, and the variable value means a part that is dynamically generated such as an Internet protocol (IP) or a port. The cluster classifier 230 performs log parsing as a preprocessing process for the logs of the second unit time, and the characteristic information included in the log is logically divided into individual tokens .

군집분류부(230)는 제1단위시간에 대한 로그별로 유사도를 산출하고, 유사도가 임계값을 초과하는 제1단위시간에 대한 로그별로 군집을 분류할 수 있다. The community classifying unit 230 may calculate the similarity for each log of the first unit time, and classify the clusters for each log of the first unit time when the similarity exceeds the threshold value.

수학식 1과 수학식 2는 군집분류부(230)가 제2단위시간에 대한 로그에 포함되어 있는 제1단위시간에 대한 로그별로 유사도를 산출하기 위해서 이용하는 수학식들의 일 예이다. 수학식 1에서 S(log_i, log_j)는 두 로그간의 유사도, log_i는 임의의 i번째의 제1단위시간에 대한 로그, |log_i|는 임의의 i번째 제1단위시간에 대한 로그의 토큰 개수, log_i(j)는 임의의 i번째의 제1단위시간에 대한 로그에서 j번째 토큰을 의미한다. 수학식 2에서 D함수는 두 로그에 포함되어 있는 토큰을 비교하여, 두 토큰 모두 실수이거나, 모두 같은 단어이거나, 모두 같은 심볼인 경우 1을 리턴하고, 그 외에는 0을 반환한다. 본 발명은, 로그간에 유사도를 산출하는 방식을 특정한 방법으로 한정하고 있지 않으므로, 실시 예에 따라서, 군집분류부(230)는 수학식 1 및 수학식 2이 아닌 다른 방식을 통해서 로그들간의 유사도를 산출할 수도 있음은 자명하다.Equations (1) and (2) are examples of mathematical equations used by the cluster classifier 230 to calculate the similarity for each log of the first unit time included in the log for the second unit time. In the equation (1), S (log _i , log _j ) is the similarity between two logs, log _i is the log of the first unit time of an arbitrary _ith , log _i | Log _i (j) means the j-th token in the log for any i-th first unit time. In Equation 2, the D function compares the tokens contained in the two logs and returns 1 if both tokens are real, all are the same word, or all are the same symbol, otherwise 0 is returned. Since the method of calculating the degree of similarity between logs is not limited to a specific method in the present invention, the grouping unit 230 may classify the degree of similarity between the logs by using a method other than Equations (1) and (2) It is obvious that it can be calculated.

도 3은 제2단위시간에 대한 로그의 일 예를 나타낸 도면이다.FIG. 3 is a diagram showing an example of the log for the second unit time.

도 3을 참조하면, 제2로그생성부(210)가 제1단위시간마다 시스템에서 출력된 시스템로그를 4개씩 묶어서 제2단위시간에 대한 로그를 생성하는 것을 알 수 있다. 도 3에서 로그번호(Log Number)는 이동통신데이터 처리시스템(100)으로부터 시스템로그가 출력된 시간적 순서를 의미한다. 예를 들어, 제1단위시간이 1초라고 가정하면, 이동통신데이터 처리시스템(100)은 로그번호가 #1인 로그(310)를 출력하고 나서, 1초가 경과한 후에 로그번호가 #2인 로그(330)를 출력한다. 제2로그생성부(210)는 제2단위시간이 제1단위시간의 4배수로 결정됨에 따라서, 제2단위시간에 대한 로그에 로그번호가 #1인 로그(310) 내지 로그번호가 #4인 로그(370), 이상 4개의 로그를 포함시켜서 제2단위시간에 대한 로그를 생성하게 된다.Referring to FIG. 3, it can be seen that the second log generation unit 210 generates a log of the second unit time by grouping four system logs output from the system every first unit time. In FIG. 3, the log number means a time sequence in which the system log is output from the mobile communication data processing system 100. For example, assuming that the first unit time is 1 second, the mobile communication data processing system 100 outputs the log 310 having the log number # 1, And outputs the log 330. Since the second unit time is determined to be a multiple of four times the first unit time, the second log generation unit 210 generates the log # 310 and the log # A log 370, and a log of the second unit time are generated by including the above four logs.

선택적 일 실시 예로서, 군집분류부(230)는 서로 다른 분류기준에 따라 적어도 2회에 걸쳐서 로그들을 서로 다른 군집으로 분류할 수도 있다. 군집 분석은 로그의 양이 늘어나면 늘어날수록 완료되는 데에 걸리는 시간이 기하급수적으로 증가하는 경향이 있으므로, 군집분류부(230)는 군집 분석을 적용하기 전에 간단하고 처리속도가 빠른 기준에 따라서 로그들을 분류하는 방식으로 1차군집들을 형성하고, 이후에 형성된 1차군집들을 기초로 제2차군집을 형성하는 방법을 수행할 수 있다.As an alternative embodiment, the cluster classifier 230 may classify the logs into different clusters at least twice according to different classification criteria. Since the cluster analysis tends to exponentially increase as the amount of log increases as the amount of log increases, the cluster classifier 230 may perform a log analysis based on a simple, The first clusters may be formed by a method of classifying the first clusters, and a second clusters may be formed based on the first clusters formed after the first clusters are formed.

군집분류부(230)는 1차군집을 형성하기 위한 일환으로, 로그 토큰의 개수에 따라서 로그들을 분류한다. 예를 들어, 토큰의 개수가 6개인 로그들은 토큰이 6개의 로그들만 속하는 군집에 속하게 된다. The cluster classifier 230 classifies logs according to the number of log tokens as part of forming a primary cluster. For example, logs with six tokens will belong to a cluster where the token only belongs to six logs.

이어서, 군집분류부(230)는 로그들을 기초로 1차군집들이 형성되면, 각 로그들의 첫 번째 토큰의 동일성을 기초로, 1차군집된 로그들을 2차적으로 군집화시킨다. 예를 들어, 토큰의 개수가 6개인 로그들이 속한 1차군집으로부터 첫 번째 토큰이 'at'인 2차군집과, 'system'인 2차군집이 분류되어 나올 수 있으므로, 2차군집들의 전체 수는 항상 1차군집들의 수 이상이다.Then, the community classifying unit 230 classifies the primary clusters logically based on the identity of the first token of each log, when the first-class houses are formed based on the logs. For example, since the first token with the first token is 'at' and the second to 'system' can be classified from the first to the last token having six tokens, the total number of secondary clusters Is always greater than the number of primary clusters.

군집분류부(230)는 토큰개수, 첫 번째 토큰의 동일성을 기준으로 2차분류된 군집들을 대상으로 유사도 공식을 적용하여, 최종군집(3차군집)을 형성할 수 있다. 이때, 유사도 공식은 수학식 1 및 수학식 2가 사용될 수 있으나, 군집분류부(230)가 수학식 1 및 수학식 2 외의 방식으로 유사도를 산출하더라도 본 발명의 범주를 벗어나지 않는다. The cluster classification unit 230 may form a final cluster (a third cluster) by applying a similarity formula to the secondarily classified clusters based on the number of tokens and the identity of the first token. Equation 1 and Equation 2 may be used for the similarity formula, but even if the grouping unit 230 calculates the similarity in a manner other than Equations 1 and 2, it does not depart from the scope of the present invention.

군집분류부(230)는 최종군집이 형성되고 나면, 각 최종군집으로부터 로그이벤트(대표적 로그메시지)를 추출하기에 앞서 각 최종군집에 속한 로그들의 메시지들을 정렬시키기 위해, 문자열 정렬 방법 중 하나인 스미스워터맨(Smith-Waterman) 알고리즘을 적용할 수 있다. 스미스워터맨 알고리즘은 공지된 알고리즘으로서, 본 발명을 구현하기 위해 필요한 설명 외의 널리 알려진 방법에 대한 설명은 이하에서 생략하기로 한다. 군집분류부(230)에 의해 로그들이 최종군집으로 분류되고, 스미스워터맨 알고리즘에 의해서 각 최종군집들에 속한 로그들의 문자열에 대한 정렬되고 나면, 후술하는 과정을 통해서 군집별로 로그이벤트가 정의된다.After the final cluster is formed, the cluster classifier 230 classifies the messages of the logs belonging to each final cluster before extracting the log event (representative log message) from each final cluster, The Smith-Waterman algorithm can be applied. The Smith Waterman algorithm is a known algorithm, and a description of well-known methods other than those necessary for implementing the present invention will be omitted below. After the logs are classified into the final cluster by the cluster classifier 230 and the strings of the logs belonging to the respective final clusters are sorted by the Smith Waterman algorithm, a log event is defined for each cluster by a process described below.

도 4는 군집분류부에 미리 설정되어 있는 이벤트템플릿의 일 예를 나타낸 도면이다.4 is a diagram showing an example of an event template preset in the cluster classification unit.

도 4에 따른 이벤트템플릿(400)은 군집분류부(230)가 로그파싱을 통해서 로그에 저장된 정보를 토큰으로 변환하고, 각 로그들을 특정한 번호의 이벤트로 분류하는 과정에서 참조하는 기준(standard)으로서, 필요에 따라 관리자에 의해서 변경될 수도 있다. 도 4의 이벤트템플릿(400)은 제1템플릿(410), 제2템플릿(430) 및 제3템플릿(450)을 포함한다.The event template 400 according to FIG. 4 is a table that the cluster classification unit 230 transforms information stored in the log into a token through log parsing, and refers to a standard referred to in the process of classifying each log into a specific number of events , And may be changed by the administrator as needed. The event template 400 of FIG. 4 includes a first template 410, a second template 430, and a third template 450.

제1템플릿(410)은 로그의 PacketResponder, for block blk_, terminating을 각각 토큰으로 정의한 후, 위 세 가지 토큰이 모두 존재하는, 제1단위시간에 대한 로그를 제1이벤트로 정의한다.The first template 410 defines a log of the first unit time in which the three packet tokens are all defined as a first event after defining PacketResponder, for block blk_ and terminating of the log as tokens, respectively.

제2템플릿(430)은 로그의 Received block blk_, of size, from을 각각 토큰으로 정의한 후, 위 세 가지 토큰이 모두 존재하는, 제1단위시간에 대한 로그를 제2이벤트로 정의한다.The second template 430 defines the received block blk_, of size, and from of the log as tokens, respectively, and then defines the log of the first unit time in which all three tokens exist, as the second event.

제3템플릿(450)은 로그의 Verification succeeded for blk_를 토큰으로 정의한 후 그 토큰이 있는, 제1단위시간에 대한 로그를 제3이벤트로 정의한다.The third template 450 defines the log of the verification unit succeeding for blk_ as a token, and then defines the log of the first unit time with the token as the third event.

도 4의 이벤트템플릿(400)은 총 3개의 이벤트를 정의하고 있으나, 실시 예에 따라서 이벤트템플릿(400)에 정의되는 이벤트의 수는 3개보다 더 적거나 더 많을 수 있다는 것은 자명하다.Although the event template 400 of FIG. 4 defines three events in total, it is apparent that the number of events defined in the event template 400 may be smaller or larger than three according to the embodiment.

도 5는 제2단위시간에 대한 로그에 포함된 로그들이 군집분류부에 의해 군집으로 분류되는 과정 및 그 과정에 따라 분류된 결과를 설명하기 위한 도면이다.FIG. 5 is a view for explaining a process in which logs included in the log for the second unit time are classified into a cluster by the cluster classifier, and the results classified according to the process.

이하의 설명은, 도 3 및 도 4를 참조하여 설명하기로 한다.The following description will be made with reference to Figs. 3 and 4. Fig.

군집분류부(230)는 도 3의 제2단위시간에 대한 로그에 포함되어 있는 4개의 제1단위시간에 대한 로그들을 도 4의 이벤트템플릿(400)에 적용시킨다. 군집분류부(230)는 공백(whitespace)을 기준으로 각 로그에 저장된 메시지를 구분하고, 이벤트템플릿(400)에 정의되어 있는 이벤트들에 따라서 제2단위시간에 대한 이상로그진단시스템(200)에 포함된 제1단위시간에 대한 로그들을 적어도 하나 이상의 이벤트들로 분류한다.The cluster classifier 230 applies the logs of the four first unit times included in the log for the second unit time of FIG. 3 to the event template 400 of FIG. The cluster classifier 230 classifies the messages stored in each log on the basis of whitespace and classifies the messages stored in the logs into the anomaly log diagnosis system 200 for the second unit time according to the events defined in the event template 400 And classifies the logs of the included first unit time into at least one or more events.

예를 들어, 도 3에서 PacketResponder, for block blk_, terminating를 모두 포함하고 있는 로그번호 1 및 로그번호 2에 해당되는 로그들은 제1이벤트로 분류된다. 이 과정에서, 군집분류부(230)는 PacketResponder, for block blk_, terminating과 같이 공통적으로 포함되어 있는 토큰을 제외하고 로그번호 1 및 로그번호 2에 해당되는 로그들에 공통적으로 포함되어 있지 않은 부분을 로그의 가변값(variable)으로 간주하고 미리 정의된 필드값으로 대체하게 된다. 일 예로서, 도 4에서 <*>은 미리 정의된 필드값이 될 수 있으며, 그 외에도 미리 정의된 필드값은 다양한 형태로 존재할 수 있다.For example, in FIG. 3, logs corresponding to log number 1 and log number 2 including PacketResponder, for block blk_, and terminating are classified as the first event. In this process, the cluster classifier 230 classifies the parts not commonly included in the logs corresponding to the log numbers 1 and 2, except for the tokens commonly included, such as PacketResponder, for block blk_, and terminating It is assumed to be a variable value of the log and replaced by a predefined field value. As an example, < * > in FIG. 4 may be a predefined field value, and other predefined field values may exist in various forms.

군집분류부(230)는 가변값이 미리 정의된 필드값으로 대체되고 나서, 남아있는 토큰을 기초로 하여, 제1단위시간에 대한 로그들을 이벤트템플릿(400)에 적용시켜서 제1단위시간에 대한 로그의 수와 동일한 수의 이벤트를 생성한다. 군집분류부(230)는 동일한 숫자를 갖는 이벤트를 동일한 군집으로 분류한다. The cluster classification unit 230 applies the log of the first unit time to the event template 400 on the basis of the remaining token after the variable value is replaced with the predefined field value, Generates the same number of events as the number of logs. The cluster classification unit 230 classifies events having the same number into the same cluster.

도 5에 따르면, 가변값이 미리 정의된 필드값으로 정의되고 나서, PacketResponder, for block blk_, terminating, 이상 3개의 토큰만 남아있는 것을 감지한 군집분류부(230)가 로그번호 1의 로그를 이벤트 1로 분류한 것을 알 수 있다. 그 외에, 로그번호 2의 로그 내지 로그번호 4의 로그도 각각 이벤트 1, 이벤트 2, 이벤트 3으로 분류되고, 동일한 토큰을 갖고 있던 로그번호 1 및 로그번호 2의 로그는 동일한 로그이벤트(이벤트 1)로 분류되어 동일한 군집에 속하게 된다.Referring to FIG. 5, after the variable value is defined as a predefined field value, the cluster classification unit 230, which detects that only three tokens are left, terminates the PacketResponder, for block blk_, 1, respectively. In addition, logs of log number 2 and logs of log number 4 are classified as event 1, event 2, and event 3, respectively. Logs of log number 1 and log number 2, which had the same token, And belong to the same cluster.

군집분류부(230)는 제2단위시간에 대한 로그에 포함되어 있는 모든 로그들을 2개씩 묶어서 유사도를 판별하고, 유사도가 기준값(임계값)을 초과하는 로그들은 동일한 군집에 속하도록 분류한다. 일 예로서, 도 3과 같이 제2단위시간에 대한 로그에 4개의 로그가 포함되어 있다면, 군집분류부(230)는 조합(combination)공식, 수학식 1 및 수학식 2에 따라 총 6번의 유사도 비교를 통해서 적어도 하나 이상의 군집을 분류할 수 있게 된다.The cluster classifying unit 230 classifies all logs included in the log for the second unit time into two to classify the similarities, and classifies the logs whose similarity exceeds the reference value (threshold value) to belong to the same cluster. For example, if four logs are included in the log of the second unit time as shown in FIG. 3, the cluster classifier 230 calculates a total of six similarities according to a combination formula, Equation 1 and Equation 2, By comparison, at least one cluster can be classified.

시퀀스생성부(250)는 군집분류부(230)에 의해서 제2단위시간에 대한 로그들이 유사도를 기초로 하여 적어도 하나 이상의 군집으로 분류되면, 제2단위시간에 대한 로그가 포함하고 있는 제1단위시간에 대한 로그들이 수집된 시간적 순서 및 군집분류부(230)가 분류한 군집(group)을 기초로 하여 제2단위시간에 대한 로그시퀀스(log sequence)를 생성한다.If the log of the second unit time is classified into at least one of the clusters based on the degree of similarity by the cluster classifying unit 230, the sequence generating unit 250 generates a first unit A log sequence for the second unit time is generated on the basis of the collected time sequence and the group classified by the grouping unit 230.

일 예로서, 시퀀스생성부(250)는 도 5를 참조하여, 이동통신데이터 처리시스템(100)으로부터 데이터가 출력된 시간적순서에 따라서, 이벤트 1, 이벤트 1, 이벤트 2, 이벤트 3이 발생된 것을 파악하고, [E1, E1, E2, E3]이라는 로그시퀀스를 생성할 수 있다. 여기서, E는 이벤트(event)를 의미하고, E뒤에 위치하는 숫자는 이벤트 번호를 의미한다. 결국 로그시퀀스의 요소(component)는 이벤트 번호 또는 군집번호로 이해될 수 있으며, 추가적인 연산을 통해서 로그시퀀스벡터(log sequence vector)로 변환될 수 있다.5, the sequence generation unit 250 generates event 1, event 1, event 2, and event 3 according to the temporal order in which data is output from the mobile communication data processing system 100 , And generate a log sequence [E1, E1, E2, E3]. Here, E means an event, and the number after E means an event number. Eventually, the component of the log sequence can be understood as an event number or a cluster number, and can be converted into a log sequence vector through additional operations.

제2단위시간(4초 간격)Second unit time (every 4 seconds) 로그시퀀스Log sequence 로그시퀀스벡터Log sequence vector T1T1 [E1, E1, E2, E3][E1, E1, E2, E3] [2, 1, 1, 0][2, 1, 1, 0] T2T2 [E1, E3, E2, E3][E1, E3, E2, E3] [1, 1, 2, 0][1, 1, 2, 0] T3T3 [E1, E4, E4, E4][E1, E4, E4, E4] [1, 0, 0, 3][1, 0, 0, 3]

표 1은 제1단위시간을 1초, 제2단위시간을 4초로 가정했을 때, 생성되는 로그시퀀스 및 로그시퀀스벡터의 일 예를 나타낸 표이다. 표 1에서 제2단위시간이 T1일 때의 로그시퀀스에서는 이벤트 1이 두번, 이벤트 2가 한번, 이벤트 3이 한번 발생하여, 로그시퀀스벡터는 이벤트 발생횟수를 그대로 반영한 [2, 1, 1, 0]가 된다.Table 1 is a table showing an example of a log sequence and a log sequence vector that are generated assuming that the first unit time is 1 second and the second unit time is 4 seconds. In Table 1, in the log sequence when the second unit time is T1, event 1 occurs twice, event 2 occurs once, event 3 occurs once, and the log sequence vector is [2, 1, 1, 0 ].

다른 예로서, 제2단위시간이 T3일 때의 로그시퀀스에서는 이벤트 1이 한번, 이벤트 4가 3번 발생되었으므로, 로그시퀀스벡터는 [1, 0, 0, 3]이 된다.As another example, since the event 1 occurs once and the event 4 occurs three times in the log sequence when the second unit time is T3, the log sequence vector becomes [1, 0, 0, 3].

이상진단부(270)는 로그시퀀스를 누적시켜서 로그시퀀스행렬을 생성하고, 그 로그시퀀스행렬을 기계학습알고리즘을 수행하는 모듈에 적용하여, 이동통신데이터 처리시스템(100)의 이상발생을 진단한다.The abnormality diagnosis unit 270 accumulates log sequences to generate a log sequence matrix, and applies the log sequence matrix to a module that performs a machine learning algorithm to diagnose an abnormality in the mobile communication data processing system 100.

수학식 3은 이상진단부(270)가 표 1에 기재된 T1에서 T3에서 생성된 로그시퀀스벡터를 모두 합하여 생성된 로그시퀀스행렬을 나타내는 식이다. 수학식 3에서 E는 이상진단부(270)에 의해 생성되는 로그시퀀스행렬이고, 로그시퀀스행렬의 요소인 X_i,j는 i번째 로그시퀀스에서 이벤트 j가 몇 번 발생했는지를 의미한다.Equation (3) is an equation representing the log sequence matrix generated by adding the log sequence vectors generated at T 1 to T 3 described in Table 1 by the abnormality diagnosis unit 270. In Equation (3), E is a log sequence matrix generated by the abnormality diagnosis unit 270, and X _{i, j, which} is an element of the log sequence matrix, indicates how many times an event j occurred in the i-th log sequence.

이상진단부(270)는 수학식 3과 같은 로그시퀀스행렬을 생성하여, 기계학습알고리즘 모듈에 적용시킨다. 기계학습알고리즘 모듈은 도 1에서 설명한 오토인코더(autoencoder)가 될 수 있으며, 특정한 종류로 한정되지 않으므로, 오토인코더 외의 다른 방식의 기계학습알고리즘 모듈 또는 딥러닝알고리즘 모듈이 될 수도 있다.The abnormality diagnosis unit 270 generates a log sequence matrix as shown in Equation (3), and applies the log sequence matrix to the machine learning algorithm module. The machine learning algorithm module may be an autoencoder described in FIG. 1, and may not be limited to a specific type, and may be a machine learning algorithm module or a deep learning algorithm module other than an auto encoder.

기계학습알고리즘 모듈은 이상진단부(270)로부터 로그시퀀스행렬을 수신하여, 이를 학습데이터로서 학습함으로써, 이동통신데이터 처리시스템(100)에서 출력되는 로그의 정상패턴을 파악할 수 있다. 기계학습알고리즘 모듈이 이상진단부(270)로부터 로그시퀀스행렬을 수신하여 학습이 완료된 이후에는, 이동통신데이터 처리시스템(100)에서 출력되는 로그(보다 정확하게는, 로그로부터 일련의 과정을 거쳐서 생성된 로그시퀀스행렬)를 수신했을 때, 관리자가 수신된 로그의 이상(anomaly)을 인지할 수 있도록 하는 정보를 출력할 수 있다. The machine learning algorithm module can recognize the normal pattern of the log output from the mobile communication data processing system 100 by receiving the log sequence matrix from the abnormality diagnosis unit 270 and learning it as learning data. After the learning is completed, the machine learning algorithm module receives the log sequence matrix from the abnormality diagnosis unit 270 and outputs the log (more precisely, the log sequence matrix generated through the series of processes from the log) generated by the mobile communication data processing system 100 Log sequence matrix), the manager can output information that allows the administrator to recognize the anomaly of the received log.

일 예로서, 학습이 완료된 오토인코더 모듈에 이동통신데이터 처리시스템(100)에서 출력된 로그로부터 생성된 로그시퀀스행렬이 입력된다면, 그 모듈로부터 입력된 로그시퀀스행렬과 동일한 행렬이 출력되고, 관리자는 그 행렬이 오토인코더에 구현된 히든레이어(hidden layer) 및 출력레이어(output layer)를 거쳐서 재구축되는 과정에서 산출된 재구축오류값(reconstruction error value)이 임계값을 초과하는지 여부를 기초로 하여 이상로그가 발생한 사실을 판단할 수 있다.For example, if a log sequence matrix generated from the log output from the mobile communication data processing system 100 is input to the auto-encoder module that has completed learning, a matrix identical to the log sequence matrix input from the module is output, Based on whether the reconstruction error value calculated in the process of reconstructing the matrix through a hidden layer and an output layer implemented in the auto encoder exceeds a threshold value It can be determined that an abnormal log has occurred.

실시 예에 따라서, 이상진단부(270)는 기계학습알고리즘 모듈을 포함할 수 있을 뿐만 아니라, 관리자가 이상로그를 판단하는 알고리즘을 그대로 구현하는 기능을 추가로 포함할 수 있다. 이러한 실시 예에서, 이상진단부(270)는 재구축오류값과 비교하기 위한 임계값을 추가로 저장하게 되며, 임계값은 합리적으로 이상로그의 발생을 진단하기 위해서 관리자에 의해 수시로 변경될 수 있다.According to an embodiment, the abnormality diagnosis unit 270 may include a machine learning algorithm module, and may further include a function of directly implementing an algorithm for the administrator to determine the abnormality log. In this embodiment, the abnormality diagnosis unit 270 additionally stores a threshold value for comparison with the reconstruction error value, and the threshold value may be changed by the administrator from time to time to diagnose the occurrence of the abnormality logically .

선택적 일 실시 예로서, 이상진단부(270)는 군집분류부(230)에 의해 분류된 군집에 속한 로그로부터 대표패턴을 추출하고, 시퀀스생성부(250)에서 생성된 로그시퀀스에서 그 대표패턴이 출현하는 빈도수를 기초로 하여 생성된 가중치를 로그시퀀스행렬에 적용할 수도 있다. 본 선택적 일 실시 예에 따르면, 이상진단부(270)는 로그시퀀스에서 대표패턴이 출현하는 빈도수를 기초로 하여 생성된 가중치를 로그시퀀스행렬에 적용함으로써, 전술한 실시 예와는 다른 로그시퀀스행렬을 생성할 수 있다. In an alternative embodiment, the abnormality diagnosis unit 270 extracts a representative pattern from a log belonging to a cluster classified by the cluster classification unit 230, and generates a representative pattern in the log sequence generated by the sequence generation unit 250 The weights generated based on the frequency of appearance may be applied to the log sequence matrix. According to the present optional embodiment, the abnormality diagnosis unit 270 applies a weight generated based on the frequency at which the representative pattern appears in the log sequence to the log sequence matrix, thereby obtaining a log sequence matrix different from the above embodiment Can be generated.

보다 구체적으로, 이상진단부(270)는 로그시퀀스행렬의 행렬값에 곱해지는 가중치를 일련의 과정을 통해 산출하여 로그시퀀스행렬에 적용함으로써, 이동통신데이터 처리시스템(100)으로부터 출력되는 로그의 특성을 더욱 잘 반영되도록 함과 동시에 기계학습알고리즘 모듈의 학습능률을 극대화할 수 있게 된다. More specifically, the abnormality diagnosis unit 270 calculates a weight multiplied by the matrix value of the log sequence matrix through a series of processes, and applies the calculated weight to the log sequence matrix to calculate the log characteristic So that the learning efficiency of the machine learning algorithm module can be maximized.

이하에서는, 대표패턴의 의미 및 가중치의 산출방식에 대해서 설명하기로 한다.Hereinafter, the meaning of the representative pattern and the calculation method of the weight will be described.

먼저, 대표패턴은 로그시퀀스에서 적어도 1회 이상 나타난 로그이벤트를 의미한다. 일 예로서, 표 1을 참조하면, 제2단위시간 T1에서의 대표패턴으로서 이벤트 1, 이벤트 2, 이벤트 3은 가능하나 이벤트 4는 대표패턴이 될 수 없다. 다른 의미로, 이벤트 4는 T1이라는 제2단위시간에 대한 로그시퀀스에서 의미있는 값으로서 기능하지 않는 정보라는 것을 의미한다.First, the representative pattern means a log event that occurs at least once in the log sequence. As an example, referring to Table 1, Event 1, Event 2, and Event 3 are possible as representative patterns at the second unit time T 1, but event 4 can not be a representative pattern. In other words, event 4 means information that does not function as a meaningful value in the log sequence for the second unit time T1.

이상진단부(270)는 로그시퀀스에서 대표패턴이 어떤 이벤트인지 결정되었다면, 그 대표패턴이 출현하는 빈도수를 기초로 하여 가중치를 생성한다. 위와 같은 방식을 통해서 생성되는 가중치는 그 대표패턴에 대해서만 추후에 적용될 수 있다.If the representative pattern is determined to be an event in the log sequence, the abnormality diagnosis unit 270 generates a weight based on the frequency at which the representative pattern appears. The weight generated by the above method can be applied only to the representative pattern later.

수학식 4는 이상진단부(270)가 가중치를 생성하는 데에 사용하는 수학식의 일 예이다. 수학식 4에서 TFIDF는 가중치, p_i는 대표패턴, e_t는 로그시퀀스, E는 로그시퀀스행렬, TF는 단어빈도, IDF는 역문서빈도를 의미한다. 수학식 4에서, 로그이벤트인 대표패턴을 단어(word) 하나에 대응된다고 가정하면, 제2단위시간에 대한 로그시퀀스는 하나의 문서(document)에 대응된다고 볼 수 있다.Equation (4) is an example of a mathematical expression used by the abnormality diagnosis unit 270 to generate a weight. In equation (4), TFIDF denotes a weight, p _i denotes a representative pattern, e _t denotes a log sequence, E denotes a log sequence matrix, TF denotes a word frequency, and IDF denotes a reverse document frequency. In Equation (4), assuming that the representative pattern corresponding to a log event corresponds to one word, the log sequence for the second unit time corresponds to one document.

단어빈도(Term Frequency, TF)는 대표패턴이 로그시퀀스에서 얼마나 많이 출현했는지에 대한 지표를 의미한다. 일 예로서, 이상진단부(270)는 특정한 제2단위시간에 대한 로그시퀀스 전체를 검색하여, 대표패턴과 동일한 로그이벤트가 있는지 측정하고, 측정한 값을 그 특정 단어에 대한 단어빈도(TF)로 결정한다. 예를 들어, 표 1의 시간 T1에서 대표패턴 E3의 단어빈도는 1/4 또는 1이 될 수 있다. Term Frequency (TF) is an index of how often a representative pattern appears in a log sequence. For example, the abnormality diagnosis unit 270 may search the entire log sequence for a specific second unit time, measure whether there is a log event identical to the representative pattern, and compare the measured value with a word frequency (TF) . For example, the word frequency of the representative pattern E3 at time T1 in Table 1 can be 1/4 or 1.

실시 예에 따라서, 이상진단부(270)는 단어빈도를 결정하기 위해서 고유한 수학식을 이용할 수도 있다.According to the embodiment, the abnormality diagnosis section 270 may use a unique mathematical expression to determine the word frequency.

수학식 5는 이상진단부(270)가 대표패턴의 단어빈도를 결정하기 위해 이용하는 수학식의 일 예로서, 이상진단부(270)에 미리 저장되어 있다. 수학식 5에서 f_pi, e_t는 해당 로그시퀀스 e_t에서 대표패턴 p_i의 빈도수를 의미한다.Equation (5) is stored in advance in the abnormality diagnosis section 270 as an example of a mathematical expression used by the abnormality diagnosis section 270 to determine the word frequency of the representative pattern. In Equation (5), f _pi , e _t means the frequency of the representative pattern p _i in the log sequence e _t .

이어서, 문서빈도(Document Frequency, DF)는 대표패턴 p_i가, 로그시퀀스 e_t가 아닌 로그시퀀스행렬 E에서 얼마나 균등하게 출현했는지 여부로 결정된다. 일 예로서, 이상진단부(270)는 로그시퀀스행렬을 복수의 로그시퀀스로 분리한 후, 각 로그시퀀스에서 대표패턴이 빠짐없이 출현했는지 측정하여, 그 측정값을 그 대표패턴의 문서빈도로 결정할 수 있다. 예를 들어, 수학식 3과 같은 로그시퀀스행렬 E에서 대표패턴이 E4라면, 이상진단부(270)는 세 번째 로그시퀀스에만 출현한 대표패턴 E4의 문서빈도를 1 또는 1/3로 결정할 수 있다.Next, the document frequency (DF) is determined by how much the representative pattern p _i appears uniformly in the log sequence matrix E, not the log sequence e _t . As an example, the abnormality diagnosis unit 270 may divide the log sequence matrix into a plurality of log sequences, measure whether the representative pattern appears completely in each log sequence, and determine the measured value as the document frequency of the representative pattern . For example, if the representative pattern is E4 in the log sequence matrix E as in Equation (3), the abnormality diagnosis unit 270 can determine the document frequency of the representative pattern E4 appearing in the third log sequence to be 1 or 1/3 .

역문서빈도(Inverse Document Frequency, IDF)는 특정 단어가 메일에서 얼마나 균등하게 출현했는지에 대한 지표로서, 문서빈도와 다르게 값이 클수록 로그시퀀스행렬에서 대표패턴이 드물게 출현한다는 의미이다. 일 예로서, 역문서빈도는 로그시퀀스행렬 E를 구성하는 로그시퀀스의 수를 대표패턴을 포함한 로그시퀀스의 수로 나눈 뒤 상용로그를 취하는 방식을 통해 산출될 수 있다.Inverse Document Frequency (IDF) is an indicator of how uniformly a particular word appears in an e-mail. Unlike the document frequency, the larger the value, the less representative pattern appears in the log sequence matrix. As an example, the inverse document frequency can be calculated by dividing the number of log sequences constituting the log sequence matrix E by the number of log sequences including the representative pattern, and taking a log of the log sequence.

단어빈도와 동일한 맥락으로, 실시 예에 따라서, 이상진단부(270)는 역문서빈도를 결정하기 위해서 고유한 수학식을 이용할 수도 있다.In the same context as the word frequency, depending on the embodiment, the abnormality diagnosis section 270 may use a unique mathematical expression to determine the frequency of the inverse document.

수학식 6은 이상진단부(270)가 대표패턴의 역문서빈도를 결정하기 위해 이용하는 수학식의 일 예로서, 이상진단부(270)에 미리 저장되어 있다. 수학식 6에서 |E|는 로그시퀀스행렬에 포함된 로그시퀀스 개수의 절대값을 의미하고, 일 예로서, 수학식 3에 따르면, 3이 된다. 또한, 수학식 6에서

는 로그 이벤트 p_i가 출현하는 로그시퀀스의 개수를 의미한다.Equation (6) is stored in advance in the abnormality diagnosis section 270 as an example of an equation used by the abnormality diagnosis section 270 to determine the frequency of the inverse document of the representative pattern. In Equation (6), | E | denotes an absolute value of the number of log sequences included in the log sequence matrix. For example, according to Equation (3), 3 | In Equation 6,

Denotes the number of log sequences in which the log event p _i appears.

본 선택적 일 실시 예에서, 이상진단부(270)는 위와 같이 단어빈도 TF와 역문서빈도 IDF가 수학식 5 및 수학식 6 등을 통해서 결정되면, 그 두 값을 곱함으로써, 대표패턴에 대한 가중치(weight)를 산출할 수 있게 된다. 가중치 TFIDF는 대표패턴에 대응되는 값이므로, 수학식 3과 같은 로그시퀀스행렬의 경우, 가중치 TFIDF는 대표패턴(로그시퀀스행렬에 포함된 0이 아닌 로그이벤트의 수)의 수와 같이 여덟 가지가 산출된다. 산출된 가중치 TFIDF는 로그시퀀스행렬의 각 행렬요소(matrix component)에 곱해지는 방식으로 적용될 수 있고, 가중치 TFIDF가 적용된 새로운 로그시퀀스행렬은 기계학습알고리즘 모듈의 입력데이터가 되어, 이동통신데이터 처리시스템(100)의 로그의 이상을 탐지하는 데에 활용될 수 있다.In this optional embodiment, when the word frequency TF and the inverse document frequency IDF are determined through Equations (5) and (6) as described above, by multiplying the two values, the weight the weight can be calculated. Since the weight TFIDF is a value corresponding to the representative pattern, in the case of the log sequence matrix as in Equation (3), the weight TFIDF is calculated as eight as the number of representative patterns (the number of non-zero log events included in the log sequence matrix) do. The calculated weight value TFIDF may be applied to each matrix component of the log sequence matrix, and a new log sequence matrix to which the weight value TFIDF is applied becomes the input data of the machine learning algorithm module, 100) of the log of the abnormality can be utilized.

다른 선택적 일 실시 예로서, 본 발명에 따르면, 제2단위시간동안 발생된 로그들의 시간적 순서에 대한 정보가 있다면, 장단기메모리인코더디코더(LSTM-Encoder Decoder) 모듈을 통해서 이동통신데이터 처리시스템(100)의 이상로그를 탐지할 수도 있다. 기계학습알고리즘 모듈을 오토인코더 모듈로 설정할 경우, 전술한 로그시퀀스행렬을 그대로 입력데이터로서 활용하여 로그의 이상발생을 탐지할 수 있으나, 장단기메모리인코더디코더 모듈을 기계학습알고리즘 모듈로 설정할 경우 추가적인 전처리 과정이 필요하다.According to another preferred embodiment of the present invention, if there is information on the temporal order of the logs generated during the second unit time, the mobile communication data processing system 100, through the LSTM-Encoder Decoder module, May be detected. When the machine learning algorithm module is set as an auto-encoder module, it is possible to detect an abnormality of a log by using the log sequence matrix as input data as it is. However, if the memory encoder decoder module is set as a machine learning algorithm module, Is required.

먼저, 시퀀스생성부(250)는 로그가 수집된 시간적 순서 및 군집분류부(230)에 의해 분류된 군집을 기초로 하여, 제2단위시간의 양의 정수배인 제3단위시간에 대한 로그들로부터, 제3단위시간에 대한 로그시퀀스를 생성한다.First, the sequence generation unit 250 generates a sequence from the logs for the third unit time, which is a positive integer multiple of the second unit time, based on the temporal order in which the logs are collected and the cluster classified by the grouping unit 230 , And generates a log sequence for the third unit time.

제3단위시간Third unit time 제3단위시간에 대한 로그시퀀스The log sequence for the third unit time 제3단위시간 로그시퀀스벡터The third unit time log sequence vector NEW_T1NEW_T1 [[E1, E1, E2, E3],[E1, E3, E2, E3]][[E1, E1, E2, E3], [E1, E3, E2, E3] [[2, 1, 1, 0],[1, 1, 2, 0]][[2, 1, 1, 0], [1, 1, 2, 0]] NEW_T2NEW_T2 [[E1, E3, E2, E3],[E1, E4, E4, E4]][[E1, E3, E2, E3], [E1, E4, E4, E4] [[1, 1, 2, 0],[1, 0, 0, 3]][[1, 1, 2, 0], [1, 0, 0, 3]]

표 2는 시퀀스생성부(250)가 제1단위시간에 대한 로그가 수집된 시간적 순서 및 군집분류부(230)에 의해 분류된 군집을 기초로 하여, 제3단위시간에 대한 로그시퀀스를 생성하는 과정을 설명하기 위한 표이다. 이하에서는, 설명의 편의를 위해서, 표 2는 표 1을 기초로 하여 생성된 로그시퀀스 및 로스시퀀스벡터이며, 제2단위시간의 2배수인 8초가 제3단위시간으로 채택된 것으로 간주한다.Table 2 shows that the sequence generation unit 250 generates a log sequence for the third unit time based on the temporal order in which the logs for the first unit time are collected and the clusters classified by the grouping unit 230 It is a table to explain the process. Hereinafter, for convenience of explanation, Table 2 is a log sequence and a loss sequence vector generated based on Table 1, and it is assumed that 8 seconds, which is twice the second unit time, is adopted as the third unit time.

시퀀스생성부(250)는 표 1에 따른 정보를 1차적으로 획득한 후, 장단기메모리인코더디코더 모듈의 입력시퀀스를 생성하기 위해서, 제2단위시간인 4초의 양의 정수배인 제3단위시간을 결정한다. 표 2에서는 제3단위시간이 제2단위시간의 2배인 8초로 결정되었으나, 실시 예에 따라서, 시퀀스생성부(250)는 제3단위시간을 결정하기 위해서 제2단위시간에 곱해지는 양의 정수를 임의로 결정할 수도 있다.The sequence generator 250 first determines the third unit time, which is a positive integral multiple of 4 seconds, in order to generate the input sequence of the short-term memory encoder decoder module after first obtaining the information according to Table 1 do. In the table 2, the third unit time is determined to be 8 seconds which is twice the second unit time. However, according to the embodiment, the sequence generation unit 250 may calculate the third unit time as a positive integer May be arbitrarily determined.

시퀀스생성부(250)는 제3단위시간을 결정한 후, 제3단위시간에 대한 로그시퀀스를 생성한다. 제3단위시간이 8초로 결정되었으므로, 제3단위시간에 대한 로그시퀀스의 길이는 표 1의 제2단위시간에 대한 로그시퀀스의 길이의 2배가 되며, 장단기메모리인코더디코더 모듈에 입력되는 입력시퀀스의 특성에 따라, 각 제3단위시간에 대한 로그시퀀스에는 서로 중복되는 로그이벤트가 존재하게 된다. 일 예로서, 첫 번째 제3단위시간(New_T1)에 대한 로그시퀀스의 뒷부분에 위치하는 로그이벤트와 두 번째 제3단위시간(New_T2)에 대한 로그시퀀스의 앞부분의 위치하는 로그이벤트는 모두 E1, E3, E2, E3로 서로 일치한다. 표 2에서는 설명의 편의를 위해서 제3단위시간을 제2단위시간의 2배로 하였으나, 전술한 것과 같이 실시 예에 따라서, 2가 아닌 임의의 양의 정수가 될 수 있다. 시퀀스생성부(250)는 제3단위시간에 대한 로그시퀀스를 로그시퀀스벡터로 추가가공할 수 있다.The sequence generator 250 determines a third unit time, and then generates a log sequence for the third unit time. Since the third unit time is determined to be 8 seconds, the length of the log sequence for the third unit time is twice the length of the log sequence for the second unit time in Table 1, and the length of the log sequence for the input sequence Depending on the characteristics, there are log events overlapping in the log sequence for each third unit time. As an example, the log event located at the rear of the log sequence for the first third unit time (New_T1) and the log event located at the beginning of the log sequence for the second unit time (New_T2) are both E1 and E3 , E2, and E3. In Table 2, the third unit time is set to twice the second unit time for convenience of explanation, but it may be any positive integer other than 2 according to the embodiment as described above. The sequence generator 250 may further process the log sequence for the third unit time as a log sequence vector.

이어서, 이상진단부(270)는 제3단위시간에 대한 로그시퀀스(벡터)를 누적시켜서 로그시퀀스행렬을 생성하고, 생성된 로그시퀀스행렬을 장단기메모리인코더디코더방식으로 기계학습알고리즘을 수행하는 모듈에 적용하여 이동통신데이터 처리시스템(100)의 이상발생을 진단한다.Then, the abnormality diagnosis unit 270 generates a log sequence matrix by accumulating the log sequence (vector) for the third unit time, and outputs the generated log sequence matrix to the module for performing the machine learning algorithm by the short-term memory encoder decoder method Thereby diagnosing an abnormality in the mobile communication data processing system 100.

수학식 7은 이상진단부(270)가 제3단위시간에 대한 로그시퀀스벡터를 누적시켜서 생성하는 로그시퀀스행렬의 일 예를 나타낸다. 수학식 7에서 E₃은 제3단위시간에 대한 로그시퀀스행렬이며, 보다 구체적으로, E₃은 표 2에서 제3단위시간인 8초를 두 번 누적시킨 16초에 대한 로그시퀀스행렬을 의미한다. 이상진단부(270)는 수학식 7과 같은 로그시퀀스행렬을 생성한 후, 전술한 TF, IDF의 가중치를 추가로 산출하여, E₃에 적용한 후, 장단기메모리인코더디코더 모듈의 입력데이터로서 입력시킬 수도 있다. Equation (7) shows an example of a log sequence matrix generated by accumulating log sequence vectors for the third unit time by the abnormality diagnosis unit 270. [ In Equation (7), E ₃ denotes a log sequence matrix for the third unit time. More specifically, E ₃ denotes a log sequence matrix for 16 seconds obtained by accumulating the second unit time of 8 seconds twice in Table 2 . After generating the log sequence matrix as shown in Equation (7), the abnormality diagnostic unit 270 further calculates the weights of the TF and IDF and applies the calculated weights to the E ₃ , and then inputs the weighted values to the input unit of the short- and long-term memory encoder decoder module It is possible.

선택적 일 실시 예로서, 이상진단부(270)는 TF, IDF의 가중치를 적용시킨 제3단위시간에 대한 로그시퀀스행렬을 오토인코더 및 장단기메모리인코더디코더 모듈에 모두 적용시켜 학습시킨 결과를 기초로 하여 이동통신데이터 처리시스템(100)의 이상로그를 진단할 수도 있다. 본 선택적 일 실시 예는, 오토인코더 및 장단기메모리인코더디코더 모두 비지도 학습모델(unsupervised model)로써, 인코더와 디코더 네트워크를 활용하여 데이터의 잠재적인 표현을 학습하는 공통점을 있다는 것을 이용한 실시 예에 해당한다.As an alternative embodiment, the abnormality diagnosis unit 270 may be configured to apply the log sequence matrix for the third unit time, to which the weights of TF and IDF are applied, to both the auto encoder and the short-term memory encoder decoder module The abnormal log of the mobile communication data processing system 100 may be diagnosed. This optional embodiment corresponds to an embodiment in which both an auto encoder and a short-term memory encoder decoder have a common point of learning a potential representation of data using an encoder and decoder network as an unsupervised model .

두 모델에 포함되어 있는 인코더는 기존 데이터를 압축해 특성을 추출하고, 이 과정에서 추출된 특성은 히든 레이어 및 출력 레이어에 존재하는 디코더 네트워크에 입력되어 최초로 모델에 입력되었던 입력데이터를 재현하도록 학습하며, 전술한 것과 마찬가지로 학습된 데이터와 다른 비정상 데이터가 입력되면, 디코더 네트워크에서 출력데이터를 재구성하는 과정에서 상대적으로 높은 재구축 오류값을 산출하게 된다. 이상진단부(270)는 위와 같은 재구축 오류값이 미리 설정된 임계값을 초과하는지 여부로, 이동통신데이터 처리시스템(100)로부터 이상로그(anomaly log)가 출력된 것을 진단할 수 있게 된다.The encoder included in both models compresses the existing data and extracts the characteristics. In this process, the extracted characteristics are inputted to the decoder network existing in the hidden layer and output layer, and learned to reproduce the input data that was input to the model for the first time If the learned data and other abnormal data are input as described above, relatively high reconstruction error values are calculated in the process of reconstructing the output data in the decoder network. The abnormality diagnosis unit 270 can diagnose that the anomaly log is output from the mobile communication data processing system 100 based on whether the reconstruction error value exceeds a preset threshold value.

본 선택적 일 실시 예에서는, 제3단위시간에 대한 로그시퀀스행렬이 오토인코더 및 장단기메모리인코더디코더 모듈에 입력데이터로서 적용되었으나, 다른 실시 예로서, 제2단위시간에 대한 로그시퀀스행렬이 오토인코더의 입력데이터로, 제3단위시간에 대한 로그시퀀스행렬이 오토인코더의 입력데이터로 각각 입력된 결과를 통해서, 이상진단부(270)는 이동통신데이터 처리시스템(100)의 이상로그 발생사실을 진단할 수도 있다. 구체적인 과정은 전술한 방법과 동일하므로, 이하 생략하기로 한다.In this optional embodiment, the log sequence matrix for the third unit time is applied as input data to the auto encoder and short and long term memory encoder decoder modules, but in another embodiment, the log sequence matrix for the second unit time is applied to the auto encoder The abnormality diagnosis unit 270 diagnoses the occurrence of an anomaly in the mobile communication data processing system 100 through a result obtained by inputting the log sequence matrix for the third unit time as input data and the input data of the auto encoder respectively It is possible. Since the specific procedure is the same as the above-mentioned method, the following description will be omitted.

위와 같이, 본 발명에 따르면, 방대한 규모의 빅스트림 데이터를 처리하는 이동통신데이터 처리시스템(100)으로부터 주기적으로 출력되는 원시 로그를 딥러닝 기법을 활용하여 분석하기 위해서, 그 원시로그를 최적으로 전처리하는 방법을 제공함으로써, 이동통신데이터 처리시스템(100)에 이상이 발생되었을 때, 관리자의 인위적인 로그 분석이 없이도, 이동통신데이터 처리시스템(100)의 이상발생을 진단할 수 있게 된다.As described above, according to the present invention, in order to analyze a raw log periodically output from a mobile communication data processing system 100 for processing large-scale large-scale data using a deep learning technique, It is possible to diagnose an abnormal occurrence of the mobile communication data processing system 100 without an artificial log analysis of the manager when an abnormality occurs in the mobile communication data processing system 100. [

도 6은 본 발명에 따른 이동통신데이터 처리시스템의 이상로그 발생을 진단하는 방법의 일 예에 대한 흐름도를 도시한 도면이다.FIG. 6 is a flowchart illustrating an example of a method for diagnosing abnormal log generation in the mobile communication data processing system according to the present invention.

도 6은, 도 2에 따른 이동통신데이터 처리시스템(100)의 이상로그 발생을 진단하는 시스템(200)에 의해 구현될 수 있으므로, 이하에서는, 도 2를 참조하여 설명하기로 하고, 도 2 내지 도 5에서 설명한 내용과 중복되는 설명은 생략하기로 한다.6 can be implemented by the system 200 for diagnosing abnormal log generation in the mobile communication data processing system 100 according to FIG. 2, and will be described below with reference to FIG. 2, The description overlapping with that described in FIG. 5 will be omitted.

제2로그생성부(210)는 제1단위시간마다 수집된 로그를 누적시켜서 제2단위시간에 대한 로그들을 생성한다(S610).The second log generator 210 accumulates logs collected every first unit time to generate logs for the second unit time (S610).

군집분류부(230)는 제2단위시간에 대한 로그들을 유사도를 기초로 하여 적어도 하나 이상의 군집으로 분류한다(S620).The cluster classifier 230 classifies the logs of the second unit time into at least one cluster based on the degree of similarity (S620).

시퀀스생성부(250)는 로그가 수집된 시간적 순서 및 군집을 기초로 제2단위시간에 대한 로그시퀀스를 생성한다(S630).The sequence generator 250 generates a log sequence for the second unit time based on the collected temporal order and the cluster (S630).

이상진단부(270)는 로그시퀀스를 미리 설정된 방식에 따라서 누적시켜서 로그시퀀스행렬을 생성한다(S640).The abnormality diagnosis unit 270 accumulates the log sequence according to a preset method to generate a log sequence matrix (S640).

이상진단부(270)는 로그시퀀스행렬에 로그이벤트를 기초로 하여 산출된 가중치를 적용시킨다(S650). 단계 S650는 실시 예에 따라서 생략될 수도 있다.The abnormality diagnostic unit 270 applies the calculated weight to the log sequence matrix based on the log event (S650). Step S650 may be omitted depending on the embodiment.

이상진단부(270)는 로그시퀀스행렬을 기계학습알고리즘 모듈에 적용시킨다(S660). 단계 S650가 생략되지 않은 경우, 단계 S660에서 가중치가 적용된 로그시퀀스행렬이 기계학습알고리즘 모듈에 적용되며, 기계학습알고리즘 모듈은 오토인코더 모듈 또는 장단기메모리인코더디코더 모듈이 될 수 있다는 것을 이미 설명한 바 있다.The abnormality diagnosis unit 270 applies the log sequence matrix to the machine learning algorithm module (S660). If step S650 is not omitted, it has already been described that the weighted log sequence matrix is applied to the machine learning algorithm module in step S660, and that the machine learning algorithm module can be an auto-encoder module or a short-term memory encoder decoder module.

이상진단부(270)는 입력데이터에 대한 기계학습알고리즘 모듈의 학습이 완료되었는지 파악하고(S670), 학습이 완료되었다면, 새로운 구간에서 제1단위시간마다 수집된 신규로그를 로그시퀀스행렬로 가공하여, 학습된 기계학습알고리즘 모듈에 입력하고, 기계학습알고리즘 모듈이 출력하는 재구축오류값(reconstruction error value)이 임계값(threshold value)을 초과하는지 여부를 파악하여, 이동통신데이터 처리시스템으로부터 이상로그가 출력되었는지 여부를 진단한다(S680). 단계 S680에서 제1단위시간마다 수집된 신규로그를 로그시퀀스행렬로 가공하는 단계는, 전술한 단계 S610 내지 S650을 따른다.The abnormality diagnosis unit 270 determines whether learning of the machine learning algorithm module for the input data is completed (S670). If the learning is completed, the abnormality diagnosis unit 270 processes the new log collected every first unit time in the new section into a log sequence matrix , Inputs it to the learned machine learning algorithm module, determines whether the reconstruction error value output from the machine learning algorithm module exceeds a threshold value, and outputs the error log from the mobile communication data processing system Is output (S680). The step of processing the new log collected every first unit time in step S680 into a log sequence matrix follows steps S610 to S650 described above.

이상 설명된 본 발명에 따른 실시 예는 컴퓨터상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.The embodiments of the present invention described above can be embodied in the form of a computer program that can be executed on various components on a computer, and the computer program can be recorded on a computer-readable medium. At this time, the medium may be a magnetic medium such as a hard disk, a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floptical disk, , A RAM, a flash memory, and the like, which are specifically configured to store and execute program instructions.

한편, 상기 컴퓨터 프로그램은 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 프로그램의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함될 수 있다.Meanwhile, the computer program may be designed and configured specifically for the present invention or may be known and used by those skilled in the computer software field. Examples of computer programs may include machine language code such as those produced by a compiler, as well as high-level language code that may be executed by a computer using an interpreter or the like.

본 발명에서 설명하는 특정 실행들은 일 실시 예들로서, 어떠한 방법으로도 본 발명의 범위를 한정하는 것은 아니다. 명세서의 간결함을 위하여, 종래 전자적인 구성들, 제어 시스템들, 소프트웨어, 상기 시스템들의 다른 기능적인 측면들의 기재는 생략될 수 있다. 또한, 도면에 도시된 구성 요소들 간의 선들의 연결 또는 연결 부재들은 기능적인 연결 및/또는 물리적 또는 회로적 연결들을 예시적으로 나타낸 것으로서, 실제 장치에서는 대체 가능하거나 추가의 다양한 기능적인 연결, 물리적인 연결, 또는 회로 연결들로서 나타내어질 수 있다. 또한, “필수적인”, “중요하게” 등과 같이 구체적인 언급이 없다면 본 발명의 적용을 위하여 반드시 필요한 구성 요소가 아닐 수 있다.The specific acts described in the present invention are, by way of example, not intended to limit the scope of the invention in any way. For brevity of description, descriptions of conventional electronic configurations, control systems, software, and other functional aspects of such systems may be omitted. Also, the connections or connecting members of the lines between the components shown in the figures are illustrative of functional connections and / or physical or circuit connections, which may be replaced or additionally provided by a variety of functional connections, physical Connection, or circuit connections. Also, unless explicitly mentioned, such as " essential ", " importantly ", etc., it may not be a necessary component for application of the present invention.

본 발명의 명세서(특히 특허청구범위에서)에서 “상기”의 용어 및 이와 유사한 지시 용어의 사용은 단수 및 복수 모두에 해당하는 것일 수 있다. 또한, 본 발명에서 범위(range)를 기재한 경우 상기 범위에 속하는 개별적인 값을 적용한 발명을 포함하는 것으로서(이에 반하는 기재가 없다면), 발명의 상세한 설명에 상기 범위를 구성하는 각 개별적인 값을 기재한 것과 같다. 마지막으로, 본 발명에 따른 방법을 구성하는 단계들에 대하여 명백하게 순서를 기재하거나 반하는 기재가 없다면, 상기 단계들은 적당한 순서로 행해질 수 있다. 반드시 상기 단계들의 기재 순서에 따라 본 발명이 한정되는 것은 아니다. 본 발명에서 모든 예들 또는 예시적인 용어(예들 들어, 등등)의 사용은 단순히 본 발명을 상세히 설명하기 위한 것으로서 특허청구범위에 의해 한정되지 않는 이상 상기 예들 또는 예시적인 용어로 인해 본 발명의 범위가 한정되는 것은 아니다. 또한, 당업자는 다양한 수정, 조합 및 변경이 부가된 특허청구범위 또는 그 균등물의 범주 내에서 설계 조건 및 팩터에 따라 구성될 수 있음을 알 수 있다.The use of the terms " above " and similar indication words in the specification of the present invention (particularly in the claims) may refer to both singular and plural. In addition, in the present invention, when a range is described, it includes the invention to which the individual values belonging to the above range are applied (unless there is contradiction thereto), and each individual value constituting the above range is described in the detailed description of the invention The same. Finally, the steps may be performed in any suitable order, unless explicitly stated or contrary to the description of the steps constituting the method according to the invention. The present invention is not necessarily limited to the order of description of the above steps. The use of all examples or exemplary language (e.g., etc.) in this invention is for the purpose of describing the present invention only in detail and is not to be limited by the scope of the claims, It is not. It will also be appreciated by those skilled in the art that various modifications, combinations, and alterations may be made depending on design criteria and factors within the scope of the appended claims or equivalents thereof.

Claims

A method for diagnosing abnormal log generation in a mobile communication data processing system,
A second log generation step of accumulating logs collected every first unit time and collecting logs of a second unit time;
A grouping step of classifying the logs of the second unit time into at least one cluster based on the similarity between logs;
A sequence generation step of generating a log sequence for a second unit time based on the temporal order in which the logs are collected and the classified clusters; And
An error diagnosis step of accumulating the log sequence to generate a log sequence matrix and diagnosing an abnormality of the mobile communication data processing system by applying the generated log sequence matrix to a module for performing a machine learning algorithm; How to diagnose abnormal log generation of communication data processing system

The method according to claim 1,
Wherein the log collected every first unit time includes a fixed value and a variable value,
The community classification step may include:
And classifying the logs of the second unit time into at least one or more clusters based on the similarity between the logs in which the variable value is replaced with a specific value. How to.

The method according to claim 1,
The community classification step may include:
A method for diagnosing abnormal log generation in a mobile communication data processing system, the method comprising: sorting logarithm of logs classified into the cluster based on a Smith-Waterman algorithm.

The method according to claim 1,
In the abnormality diagnosis step,
Extracting a representative pattern from a log belonging to the classified group and applying a weight generated based on the frequency of appearance of the extracted representative pattern in the generated log sequence to the generated log sequence matrix. A method for diagnosing abnormal log generation in a mobile data processing system.

5. The method of claim 4,
The frequency of occurrence
Wherein the first threshold value is a value calculated based on a word frequency (TF) and an inverse document frequency (IDF).

The method according to claim 1,
Wherein the second unit time is a positive integral multiple of the first unit time.

The method according to claim 1,
Wherein the sequence generation step comprises:
Generating a log sequence for the third unit time from logs for a third unit time that is a positive integer multiple of the second unit time based on the temporal order in which the logs are collected and the grouped clusters,
In the abnormality diagnosis step,
A log sequence matrix is generated by accumulating the log sequence for the third unit time, and the generated log sequence matrix is applied to a module for performing a machine learning algorithm in a long-short term memory (LSTM) And diagnosing an abnormal occurrence of the mobile communication data processing system.

A computer-readable recording medium storing a program for executing the method according to any one of claims 1 to 7.

A system for diagnosing abnormal log generation in a mobile communication data processing system,
A second log generation unit for accumulating logs collected every first unit time to generate logs for a second unit time;
A cluster classifier for classifying the logs of the second unit time into at least one cluster based on the similarity between logs;
A sequence generator for generating a log sequence for a second unit time based on the temporal order in which the logs are collected and the classified clusters; And
And an abnormality diagnosis unit for generating a log sequence matrix by accumulating the log sequences and diagnosing an abnormality of the mobile communication data processing system by applying the generated log sequence matrix to a module for performing a machine learning algorithm, A system for diagnosing abnormal log generation in a communication data processing system.

10. The method of claim 9,
Wherein the log collected every first unit time includes a fixed value and a variable value,
Wherein,
And classifying the logs of the second unit time into at least one or more clusters based on the similarity between the logs in which the variable value is replaced with a specific value. System.

10. The method of claim 9,
Wherein,
And sorting the strings of the logs classified into the cluster based on a Smith-Waterman algorithm.

10. The method of claim 9,
The abnormality diagnosis unit,
Extracting a representative pattern from a log belonging to the classified group and applying a weight generated based on the frequency of appearance of the extracted representative pattern in the generated log sequence to the generated log sequence matrix. A system for diagnosing abnormal log generation in a mobile communication data processing system.

13. The method of claim 12,
The frequency of occurrence
(TF) and an inverse document frequency (IDF) of the mobile communication data processing system. The system for diagnosing abnormal log generation in a mobile communication data processing system according to claim 1,

10. The method of claim 9,
Wherein the second unit time is a positive integral multiple of the first unit time.

10. The method of claim 9,
Wherein the sequence generator comprises:
Generating a log sequence for the third unit time from logs for a third unit time that is a positive integer multiple of the second unit time based on the temporal order in which the logs are collected and the grouped clusters,
The abnormality diagnosis unit,
A log sequence matrix is generated by accumulating the log sequence for the third unit time, and the generated log sequence matrix is applied to a module for performing a machine learning algorithm in a long-short term memory (LSTM) Wherein the mobile communication data processing system diagnoses an abnormal occurrence of the mobile communication data processing system.