KR20230166726A

KR20230166726A - Apparatus for processing log data and method thereof

Info

Publication number: KR20230166726A
Application number: KR1020220067076A
Authority: KR
Inventors: 김수정; 조종윤; 김성일
Original assignee: 삼성에스디에스 주식회사
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2023-12-07

Abstract

본 개시는 로그데이터 관리 방법 및 시스템을 개시한다. 본 개시에 따른 로그 데이터 관리 방법은 컴퓨팅 시스템에 의하여 수행되는 방법에 있어서, 복수의 로그 시퀀스를 포함하는 로그 데이터의 토큰 중 일부를 마스킹 처리하는 단계와, 상기 마스킹 처리된 로그 시퀀스들을 정렬하는 단계와, 상기 로그 시퀀스들의 정렬 결과를 이용하여 로그 시퀀스들을 클러스터링 하는 단계를 포함할 수 있다.This disclosure discloses a log data management method and system. The log data management method according to the present disclosure is performed by a computing system, comprising: masking some of the tokens of log data including a plurality of log sequences; aligning the masked log sequences; , It may include clustering the log sequences using the alignment results of the log sequences.

Description

Log data management method and system{APPARATUS FOR PROCESSING LOG DATA AND METHOD THEREOF}

본 개시는 로그 데이터 관리 방법 및 시스템에 관한 것이다. 보다 자세하게는, 주어진 로그 데이터를 이용하여 유사한 패턴끼리 분류하여 클러스터를 구성하고 새로이 입력되는 로그 데이터의 경우 어떤 클러스터에 해당되는지 판별하는 방법 및 시스템에 관한 것이다.This disclosure relates to a log data management method and system. More specifically, it relates to a method and system for classifying similar patterns using given log data to form clusters and determining which cluster newly input log data corresponds to.

컴퓨터 시스템 및 각종 자동화된 제조 설비가 동작하면서 동작 상태, 처리 현황을 로그 데이터 형태로 기록하고 있는데, 상기 로그 데이터는 시스템의 이상 여부 확인, 통계 정보 추출, 최적화 작업 등 다양한 방면에 활용될 수 있다.As computer systems and various automated manufacturing facilities operate, their operating status and processing status are recorded in the form of log data. The log data can be used in various fields such as checking for system abnormalities, extracting statistical information, and optimization work.

이를 위하여, 1차적으로 로그 데이터를 가공하여 원하는 데이터를 추출하는 전처리 작업이 로그 데이터에 선행되어야 한다. 대부분의 전처리 작업을 수행하기 위해서는, 로그 데이터의 타입별로 데이터를 추출하는 소프트웨어를 별도로 작성해야 하며, 여기에는 많은 시간과 노력을 필요로 한다.For this purpose, the log data must be first processed and preprocessed to extract the desired data. To perform most preprocessing tasks, separate software must be written to extract data for each type of log data, which requires a lot of time and effort.

그러나, 규격이 정형화 되지 않은 로그 데이터는 시스템, 제조사, S/W 버전 등에 따라 다양한 형태를 지닐 수 있으므로, 사용자의 설정에 따라 로그 형식이 달라질 수 있다. 이로 인해, 사용자가 다양한 형태의 로그 데이터를 처리하기 위해서 대상 로그의 형식을 이해하고 이에 맞는 데이터 처리 로직을 개별적으로 구현해야 하는 문제점이 있다.However, log data that is not standardized can have various forms depending on the system, manufacturer, S/W version, etc., so the log format may vary depending on the user's settings. Because of this, there is a problem in that in order to process various types of log data, users must understand the format of the target log and individually implement data processing logic appropriate for it.

따라서, 사용자의 수동 조작 없이 로그 데이터를 클러스터링 함으로써 로그 데이터 처리 과정의 비용적 저효율 문제를 개선하고 로그 데이터 전처리 작업을 신속하게 수행하는 방법의 필요성이 대두되고 있다.Accordingly, there is an emerging need for a method to improve the cost-inefficiency problem of the log data processing process and quickly perform log data preprocessing by clustering log data without manual operation by the user.

본 개시의 몇몇 실시예들을 통하여 달성하고자 하는 기술적 과제는, 컴퓨팅 자원 사용량 측면에서 효율적인 로그 데이터 클러스터링 모델의 생성 방법 및 그 로그 데이터 분류 모델을 이용한 로그 데이터 분류 방법을 제공하는 것이다.The technical task to be achieved through some embodiments of the present disclosure is to provide a method for generating a log data clustering model that is efficient in terms of computing resource usage and a method for classifying log data using the log data classification model.

본 개시의 몇몇 실시예들을 통하여 달성하고자 하는 다른 기술적 과제는, 새로 입력된 로그 데이터가 어떤 유형에 속하는지 판별하는 경우에, 모든 군집과의 거리를 계산할 필요 없이 선형적인 알고리즘으로 빠르게 군집 탐색을 수행하는 로그 데이터 관리 방법 및 시스템을 제공하는 것이다.Another technical task to be achieved through some embodiments of the present disclosure is to quickly perform cluster search using a linear algorithm without having to calculate the distance to all clusters when determining which type newly input log data belongs to. To provide a log data management method and system.

본 개시의 몇몇 실시예들을 통하여 달성하고자 하는 또 다른 기술적 과제는, 로그 데이터 관리 시스템이 신규 로그 데이터 시퀀스 클러스터링을 수행할 때 삭제된 자식 노드만큼의 검색 횟수를 절감하고 메모리를 확보하여 시간적, 용량적 비용 절감 효과를 제공하는 것이다.Another technical task to be achieved through some embodiments of the present disclosure is to save time and capacity by reducing the number of searches as much as deleted child nodes and securing memory when the log data management system performs clustering of new log data sequences. This provides cost savings.

본 개시의 몇몇 실시예들을 통하여 달성하고자 하는 또 다른 기술적 과제는, 사용자의 의도를 신속하게 반영하고 실시간으로 입력되는 스트림 형태의 데이터도 지연없이 처리할 수 있는 로그 데이터 관리 방법 및 시스템을 제공하는 것이다.Another technical task to be achieved through several embodiments of the present disclosure is to provide a log data management method and system that can quickly reflect the user's intention and process data in the form of a stream input in real time without delay. .

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below.

상기 기술적 과제를 해결하기 위한, 본 개시의 일 실시예에 따른 로그 데이터 관리 방법은, 복수의 로그 시퀀스를 포함하는 로그 데이터의 토큰 중 일부를 마스킹 처리하는 단계와, 상기 마스킹 처리된 로그 시퀀스들을 정렬하는 단계와, 상기 로그 시퀀스들의 정렬 결과를 이용하여 로그 시퀀스들을 클러스터링 하는 단계를 포함할 수 있다.In order to solve the above technical problem, a log data management method according to an embodiment of the present disclosure includes masking some of the tokens of log data including a plurality of log sequences, and sorting the masked log sequences. It may include the step of clustering the log sequences using the sorting results of the log sequences.

상기 마스킹 처리하는 단계는 상기 복수의 로그 시퀀스 각각에 대하여, 마스킹 대상 토큰을 식별하는 단계와 상기 마스킹 대상 토큰을, 상기 마스킹 대상 토큰에 대응되는 마스킹 표현으로 치환하는 단계를 포함할 수 있다.The masking step may include identifying a masking target token for each of the plurality of log sequences and replacing the masking target token with a masking expression corresponding to the masking target token.

상기 마스킹 대상 토큰을 상기 마스킹 대상 토큰에 대응되는 마스킹 표현으로 치환하는 단계는, 상기 마스킹 대상 토큰의 타입을 결정하는 단계와 상기 마스킹 대상 토큰을, 마스킹 대상 토큰의 타입에 대응되는 마스킹 표현으로 치환하는 단계를 포함할 수 있다.The step of replacing the masking target token with a masking expression corresponding to the masking target token includes determining a type of the masking target token and replacing the masking target token with a masking expression corresponding to the type of the masking target token. May include steps.

상기 로그 데이터 관리 방법은 상기 클러스터링의 결과를 이용하여 클러스터링 모델을 생성하는 단계를 더 포함할 수 있다.The log data management method may further include generating a clustering model using the clustering result.

상기 클러스터링 모델을 생성하는 단계는, 각 토큰을 노드로 하는 트라이(trie) 데이터 구조를 생성하는 단계를 포함하고, 상기 각 노드는 클러스터 별 참조 횟수를 속성 값으로 가질 수 있다.The step of creating the clustering model includes creating a trie data structure with each token as a node, and each node may have a reference count for each cluster as an attribute value.

상기 트라이 데이터 구조를 생성하는 단계는, 상기 각각의 클러스터 별로 대표 로그 시퀀스를 결정하는 단계와, 상기 대표 로그 시퀀스들을 이용하여 상기 트라이 데이터 구조를 생성하는 단계를 포함할 수 있다.Generating the trie data structure may include determining a representative log sequence for each cluster and generating the trie data structure using the representative log sequences.

상기 트라이 데이터 구조를 생성하는 단계는, 루트 노드로부터 리프 노드의 방향으로 단일 자식 노드를 가지는 노드가 둘 이상 연속되는 경우, 단일 자식 노드를 가지는 연속된 둘 이상의 노드들을 병합(merge)하는 단계를 포함할 수 있다.The step of generating the trie data structure includes merging two or more consecutive nodes having a single child node when there are two or more consecutive nodes having a single child node in the direction from the root node to the leaf node. can do.

상기 트라이 데이터 구조를 생성하는 단계는 상기 참조 횟수로 '1'을 가지는 노드의 모든 자식 노드들을 삭제하는 단계를 포함할 수 있다.Creating the trie data structure may include deleting all child nodes of the node with the reference count of '1'.

상기 클러스터링 모델을 생성하는 단계는, 상기 생성된 클러스터링 모델을 이용하여 로그 시퀀스의 클러스터가 식별되는 것에 응답하여, 각 클러스터에 사전 정의된 액션을 자동 수행하는 단계를 포함할 수 있다.Generating the clustering model may include automatically performing a predefined action on each cluster in response to identifying clusters of the log sequence using the generated clustering model.

상기 클러스터링 하는 단계는, 상기 로그 시퀀스들의 정렬 결과에 따른 인접 로그 시퀀스 간의 차이값을 연산하는 단계와 상기 차이값이 기준치를 초과하는지 여부를 이용하여 로그 시퀀스들을 클러스터링 하는 단계를 포함할 수 있다.The clustering step may include calculating a difference value between adjacent log sequences according to a result of sorting the log sequences, and clustering the log sequences using whether the difference value exceeds a reference value.

상기 차이값을 연산하는 단계는, 인접한 제1 로그 시퀀스 및 제2 로그 시퀀스에 대하여, 상기 제1 로그 시퀀스와 상기 제2 로그 시퀀스 간에 서로 다른 토큰이 등장하는 최초의 오프셋(offset)을 식별하는 단계와, 상기 오프셋을 상기 제1 로그 시퀀스와 상기 제2 로그 시퀀스 사이의 차이값으로서 결정하는 단계를 포함할 수 있다.The step of calculating the difference value includes, with respect to the adjacent first log sequence and the second log sequence, identifying the first offset at which a different token appears between the first log sequence and the second log sequence. and determining the offset as a difference value between the first log sequence and the second log sequence.

상기 기준치는, 상기 로그 시퀀스들의 정렬 결과 및 기준치 변경 컨트롤을 포함하는 사용자 인터페이스에 대한 사용자 입력에 따라 결정된 값일 수 있다.The reference value may be a value determined according to the alignment result of the log sequences and a user input to a user interface including a reference value change control.

상기 사용자 인터페이스는, 상기 기준치 변경 컨트롤을 통하여 기준치 변경 사용자 입력이 가해지는 경우, 변경된 클러스터링 결과를 표시해 주는 것일 수 있다.The user interface may display a changed clustering result when a user input for changing the reference value is applied through the reference value change control.

상기 사용자 인터페이스는, 상기 기준치 변경 컨트롤을 통하여 기준치 변경 사용자 입력이 가해지는 경우, 변경된 클러스터링 결과를 상기 로그 시퀀스들의 정렬 결과 상에 각 클러스터의 영역을 오버레이하여 표시해 주는 것일 수 있고, 상기 기준치 변경 컨트롤을 통하여 기준치 변경 사용자 입력이 가해지는 경우, 변경된 클러스터링 결과를 상기 로그 시퀀스들의 정렬 결과 상에 각 클러스터의 영역을 오버레이하여 표시해 주는 것일 수 있다.The user interface may display the changed clustering result by overlaying the area of each cluster on the alignment result of the log sequences when a user input for changing the reference value is applied through the reference value change control. When a user input to change the reference value is applied, the changed clustering result may be displayed by overlaying the area of each cluster on the alignment result of the log sequences.

본 개시의 다른 실시예에 따른 로그 시퀀스의 토큰 중 일부를 마스킹 처리하는 단계는, 각 토큰을 노드로 하는 트라이(trie) 데이터 구조 형태의 클러스터링 모델에, 상기 마스킹 처리된 로그 시퀀스를 입력함으로써, 상기 로그 시퀀스의 클러스터를 식별하는 단계를 포함할 수 있다.The step of masking some of the tokens of the log sequence according to another embodiment of the present disclosure is to input the masked log sequence into a clustering model in the form of a trie data structure with each token as a node, It may include identifying clusters in the log sequence.

상기 마스킹 처리 하는 단계는, 상기 복수의 로그 시퀀스 각각에 대하여, 마스킹 대상 토큰을 식별하는 단계와, 상기 마스킹 대상 토큰을, 상기 마스킹 대상 토큰의 타입에 대응되는 마스킹 표현으로 치환하는 단계를 포함할 수 있다.The masking step may include identifying a masking target token for each of the plurality of log sequences, and replacing the masking target token with a masking expression corresponding to the type of the masking target token. there is.

상기 트라이 데이터 구조의 각 노드는 클러스터 참조 횟수를 가지며, 상기 로그 시퀀스의 클러스터를 식별하는 단계는, 상기 마스킹 처리된 로그 시퀀스의 토큰을 상기 클러스터링 모델의 각 노드와 매칭해 가면서, 노드 탐색(node traversing)을 진행하되, 참조 횟수 값으로 1을 가지는 노드에 도달하면, 상기 노드 탐색을 종료하고, 종료 시점의 노드가 가리키는 클러스터를, 상기 마스킹 처리된 로그 시퀀스의 클러스터로서 식별하는 단계를 포함할 수 있다. Each node of the trie data structure has a cluster reference number, and the step of identifying a cluster of the log sequence involves node traversing by matching tokens of the masked log sequence with each node of the clustering model. ), and when a node with a reference count value of 1 is reached, the node search may be terminated, and the cluster indicated by the node at the end may be identified as the cluster of the masked log sequence. .

상기 로그 데이터 관리 방법은 상기 로그 시퀀스의 클러스터가 식별되는 것에 응답하여, 각 클러스터에 사전 정의된 액션을 자동 수행하는 단계를 더 포함할 수 있다.The log data management method may further include automatically performing a predefined action on each cluster in response to identifying a cluster of the log sequence.

상기 기술적 과제를 해결하기 위한 본 개시의 또 다른 실시예에 따른 로그 데이터 관리 시스템은, 복수의 로그 시퀀스들을 포함하는 로그 데이터를 수신하는 네트워크 인터페이스, 클러스터링 모델링이 프로그램이 로드되는 메모리, 상기 메모리에 로드된 클러스터링 모델링 프로그램을 실행하는 하나 이상의 프로세서를 포함할 수 있다. 이 때, 상기 클러스터링 모델링 프로그램은, 상기 로그 데이터에 포함된 각각의 로그 시퀀스의 토큰 중 일부를 마스킹 처리 하는 인스트럭션, 상기 마스킹 처리된 각각의 로그 시퀀스들을 정렬하는 인스트럭션, 상기 로그 시퀀스들의 정렬 결과를 이용하여 로그 시퀀스들을 클러스터링 하는 인스트럭션을 포함할 수 있다.A log data management system according to another embodiment of the present disclosure for solving the above technical problem includes a network interface for receiving log data including a plurality of log sequences, a memory into which a clustering modeling program is loaded, and a memory loaded into the memory. It may include one or more processors that execute a clustering modeling program. At this time, the clustering modeling program uses an instruction for masking some of the tokens of each log sequence included in the log data, an instruction for sorting each masked log sequence, and the alignment result of the log sequences. Thus, an instruction for clustering log sequences may be included.

상기 기술적 과제를 해결하기 위한 본 개시의 또 다른 실시예에 따른 로그 데이터 관리 시스템은, 로그 시퀀스를 수신하는 네트워크 인터페이스와, 클러스터링 프로그램이 로드되는 메모리, 상기 메모리에 로드된 클러스터링 프로그램을 실행하는 하나 이상의 프로세서를 포함하고, 상기 클러스터링 프로그램은 상기 로그 시퀀스의 토큰 중 일부를 마스킹 처리 하는 인스트럭션(instruction), 각 토큰을 노드로 하는 트라이(tire) 데이터 구조 형태의 클러스터링 모델에, 상기 마스킹 처리된 로그 시퀀스를 입력함으로써, 상기 로그 시퀀스의 클러스터를 식별하는 인스트럭션 및 상기 로그 시퀀스의 상기 식별된 클러스터를 출력하는 것 및 상기 로그 시퀀스에 상기 식별된 클러스터에 대한 정보를 부가(attach)하는 것중 적어도 하나를 수행하는 인스트럭션을 포함할 수 있다.A log data management system according to another embodiment of the present disclosure for solving the above technical problem includes a network interface that receives a log sequence, a memory in which a clustering program is loaded, and one or more executing the clustering program loaded in the memory. It includes a processor, and the clustering program includes instructions for masking some of the tokens of the log sequence and processing the masked log sequence into a clustering model in the form of a trie data structure with each token as a node. An instruction that, by inputting, performs at least one of identifying a cluster of the log sequence, outputting the identified cluster of the log sequence, and attaching information about the identified cluster to the log sequence. may include.

도 1은 본 개시의 몇몇 실시예에 따른 로그 데이터 처리 및 분류 시스템이 적용될 수 있는 예시적인 환경을 도시한다.
도 2는 본 개시의 일 실시예에 따른 로그 데이터 처리 장치를 설명하기 위 한 블록 구성도이다.
도 3은 본 개시의 다른 실시예에 따른, 클러스터 모델 생성 방법의 순서도이다.
도 4는 본 개시의 몇몇 실시예들에서 수행될 수 있는 로그 데이터 시퀀스 경계 지정을 설명하기 위한 도면이다.
도 5는 본 개시의 몇몇 실시예들에서 수행될 수 있는 로그 데이터를 토큰화 및 마스킹하는 방법을 설명하기 위한 도면이다.
도 6은 도 1을 참조하여 설명한 일부 동작을 자세하게 설명하기 위한 순서도이다.
도 7은 본 개시의 몇몇 실시예들에서 표시될 수 있는 로그 시퀀스들의 정렬 결과 표시 화면 및 기준치 변경 컨트롤을 포함하는 사용자 인터페이스를 예시하는 도면이다.
도 8은 본 개시의 몇몇 실시예들에서 수행될 수 있는 지정된 기준값에 기초하여 복수의 로그데이터를 클러스터링 하는 방법을 설명하기 위한 도면이다.
도 9는 본 개시의 몇몇 실시예들에서 표시될 수 있는, 선형비교를 통한 군집 분류된 로그 데이터 시퀀스를 예시하는 도면이다.
도 10은 본 개시의 몇몇 실시예들에서 표시될 수 있는 구분된 로그 데이터 클러스터를 예시하는 도면이다.
도 11은 도 1을 참조하여 설명한 일부 동작을 자세하게 설명하기 위한 순서도이다.
도 12는 도 11을 참조하여 설명한 일부 동작을 자세하게 설명하기 위한 순서도이다.
도 13은 본 개시의 몇몇 실시예의 수행 결과 클러스터 모델로서 생성될 수 있는 예시적인 트라이(TRIE) 자료 구조이다.
도 14는 본 개시의 또 다른 실시예에 따른 신규 로그 데이터 분류 방법의 순서도이다.
도 15은 신규 삽입된 로그 데이터와 유사한 클러스터를 검색하는 방법을 설명하기 위한 트라이 데이터 구조를 예시한 도면이다.
도 16은 본 개시의 몇몇 실시예들에서 구성요소로서 사용될 수 있는 컴퓨팅 시스템의 하드웨어 구성도이다.1 illustrates an example environment in which a log data processing and classification system according to some embodiments of the present disclosure may be applied.
Figure 2 is a block diagram for explaining a log data processing device according to an embodiment of the present disclosure.
Figure 3 is a flow chart of a cluster model generation method according to another embodiment of the present disclosure.
FIG. 4 is a diagram illustrating log data sequence boundary designation that can be performed in some embodiments of the present disclosure.
FIG. 5 is a diagram illustrating a method of tokenizing and masking log data that can be performed in some embodiments of the present disclosure.
FIG. 6 is a flowchart for explaining in detail some of the operations described with reference to FIG. 1 .
FIG. 7 is a diagram illustrating a user interface including a screen for displaying alignment results of log sequences and a reference value change control that can be displayed in some embodiments of the present disclosure.
FIG. 8 is a diagram illustrating a method of clustering a plurality of log data based on a designated reference value that can be performed in some embodiments of the present disclosure.
9 is a diagram illustrating a log data sequence clustered through linear comparison, which may be displayed in some embodiments of the present disclosure.
10 is a diagram illustrating segmented log data clusters that may be displayed in some embodiments of the present disclosure.
FIG. 11 is a flowchart for explaining in detail some of the operations described with reference to FIG. 1.
FIG. 12 is a flowchart for explaining in detail some of the operations described with reference to FIG. 11.
13 is an exemplary TRIE data structure that can be generated as a cluster model as a result of performing some embodiments of the present disclosure.
Figure 14 is a flowchart of a new log data classification method according to another embodiment of the present disclosure.
Figure 15 is a diagram illustrating a trie data structure to explain a method of searching for clusters similar to newly inserted log data.
16 is a hardware configuration diagram of a computing system that can be used as a component in some embodiments of the present disclosure.

이하, 첨부된 도면을 참조하여 본 개시의 바람직한 실시예들을 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명의 기술적 사상은 이하의 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 이하의 실시예들은 본 발명의 기술적 사상을 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명의 기술적 사상은 청구항의 범주에 의해 정의될 뿐이다.Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the attached drawings. The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the technical idea of the present invention is not limited to the following embodiments and may be implemented in various different forms. The following examples are merely intended to complete the technical idea of the present invention and to be used in the technical field to which the present invention pertains. It is provided to fully inform those skilled in the art of the scope of the present invention, and the technical idea of the present invention is only defined by the scope of the claims.

본 개시를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다. In describing the present disclosure, if it is determined that a detailed description of a related known configuration or function may obscure the gist of the present invention, the detailed description will be omitted.

이하, 도면들을 참조하여 본 개시의 몇몇 실시예들을 설명한다.Hereinafter, several embodiments of the present disclosure will be described with reference to the drawings.

도 1은 본 개시의 몇몇 실시예에 따른 로그 데이터 관리 시스템이 적용될 수 있는 예시적인 환경을 도시한다. 도 1은 1개의 로그 데이터 출력 시스템(200)이 네트워크에 연결된 것을 도시하고 있으나, 이는 이해의 편의를 제공하기 위한 것일 뿐이고, 로그 데이터 출력 시스템(200)의 개수는 얼마든지 달라질 수 있다.1 illustrates an example environment in which a log data management system according to some embodiments of the present disclosure may be applied. Figure 1 shows one log data output system 200 connected to the network, but this is only for convenience of understanding, and the number of log data output systems 200 may vary.

도 1은 본 개시의 목적을 달성하기 위한 바람직한 실시예를 도시하고 있을 뿐이며, 필요에 따라 일부 구성 요소가 추가되거나 삭제될 수 있다. 또한, 도 1에 도시된 예시적인 환경의 구성 요소들은 기능적으로 구분되는 기능 요소들을 나타낸 것으로써, 복수의 구성 요소가 실제 물리적 환경에서는 서로 통합되는 형태로 구현될 수도 있음에 유의한다. Figure 1 only shows a preferred embodiment for achieving the purpose of the present disclosure, and some components may be added or deleted as needed. Additionally, note that the components of the exemplary environment shown in FIG. 1 represent functional elements that are functionally distinct, and that a plurality of components may be implemented in an integrated form in an actual physical environment.

예를 들어, 로그 데이터 관리 시스템(100)과 로그 데이터 출력 시스템(200)은 동일한 컴퓨팅 장치 내에 서로 다른 로직(logic)의 형태로 구현될 수 있으며, 로그 데이터 출력 시스템(200)에 의해 생성된 로그 데이터를 동일한 컴퓨팅 장치 내에 로그 데이터 관리 시스템(100)이 처리하는 형태로 구현될 수 있다. 이하, 도 1에 도시된 각 구성 요소에 대해 보다 구체적으로 설명하기로 한다.For example, the log data management system 100 and the log data output system 200 may be implemented in the form of different logic within the same computing device, and the log generated by the log data output system 200 The log data management system 100 may process data within the same computing device. Hereinafter, each component shown in FIG. 1 will be described in more detail.

로그 데이터 관리 시스템(100)은 로그 데이터 출력 시스템(200)에 의해 생성된 복수의 로그 데이터를 획득하고, 획득된 복수의 로그 데이터를 클러스터링할 수 있다. 또한, 로그 데이터 관리 시스템(100)은 복수의 로그 데이터의 클러스터링 결과를 이용하여, 로그 데이터 클러스터링 모델을 생성할 수 있다. 도 2에 도시된 바와 같이, 로그 데이터 관리 시스템(100)에 포함된 로그 데이터 클러스터링 모델 생성부(110)에서 상술한 동작들이 수행될 수 있다.The log data management system 100 may acquire a plurality of log data generated by the log data output system 200 and cluster the acquired plurality of log data. Additionally, the log data management system 100 may create a log data clustering model using the clustering results of a plurality of log data. As shown in FIG. 2, the above-described operations may be performed in the log data clustering model generator 110 included in the log data management system 100.

또한, 로그 데이터 관리 시스템(100)은 로그 데이터 클러스터링 모델을 이용하여, 대상 로그 데이터에 포함된 로그 데이터의 속성 값을 추출할 수도 있다. 도 2에 도시된 바와 같이, 로그 데이터 관리 시스템(100)에 포함된 로그 데이터 클러스터링부(120)에서 상술한 동작이 수행될 수 있다.Additionally, the log data management system 100 may extract attribute values of log data included in target log data using a log data clustering model. As shown in FIG. 2, the above-described operation may be performed in the log data clustering unit 120 included in the log data management system 100.

로그 데이터 관리 시스템(100)은 하나 이상의 컴퓨팅 장치로 구현될 수 있다. 예를 들어, 로그 데이터 관리 시스템(100)의 모든 기능은 단일 컴퓨팅 장치에서 구현될 수 있다. 다른 예로써, 로그 데이터 관리 시스템(100)의 제1 기능은 제1 컴퓨팅 장치에서 구현되고, 제2 기능은 제2 컴퓨팅 장치에서 구현될 수도 있다. 여기서, 상기 제1 컴퓨팅 장치 또는 제2 컴퓨팅 장치 중 어느 하나 이상은 클라우드 서버일 수 있다. Log data management system 100 may be implemented with one or more computing devices. For example, all functions of log data management system 100 may be implemented on a single computing device. As another example, the first function of the log data management system 100 may be implemented in a first computing device, and the second function may be implemented in a second computing device. Here, at least one of the first computing device and the second computing device may be a cloud server.

또한, 상기 컴퓨팅 장치는, 노트북, 데스크톱(desktop), 랩탑(laptop) 등이 될 수 있으나, 이에 국한되는 것은 아니며 컴퓨팅 기능이 구비된 모든 종류의 장치를 포함할 수 있다. 다만, 로그 데이터 처리 장치(100)가 다양한 로그 데이터 출력 시스템(200)과 연동하여 로그 데이터를 클러스터링 해야 하는 환경이라면, 로그 데이터 관리 시스템(100)은 고성능의 서버급 컴퓨팅 장치로 구현되는 것이 바람직할 수 있다. 컴퓨팅 장치의 일예에 대해서는 추후 도 16을 참조하여 설명하기로 한다.Additionally, the computing device may be a laptop, desktop, laptop, etc., but is not limited thereto and may include all types of devices equipped with a computing function. However, in an environment where the log data processing device 100 must cluster log data by linking with various log data output systems 200, it may be desirable for the log data management system 100 to be implemented as a high-performance server-class computing device. there is. An example of a computing device will be described later with reference to FIG. 16.

한편, 도 2에 도시된 각 구성 요소(110, 120, 130)는 소프트웨어(Software) 또는 FPGA(Field Programmable Gate Array)나 ASIC(Application-Specific Integrated Circuit)과 같은 하드웨어(Hardware)를 의미할 수 있다. 그렇지만, 상기 구성 요소들은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, 어드레싱(Addressing)할 수 있는 저장 매체에 있도록 구성될 수도 있고, 하나 또는 그 이상의 프로세서들을 실행시키도록 구성될 수도 있다. 상기 구성 요소들 안에서 제공되는 기능은 더 세분화된 구성 요소에 의하여 구현될 수 있으며, 복수의 구성 요소들을 합하여 특정한 기능을 수행하는 하나의 구성 요소로 구현될 수도 있다.Meanwhile, each component 110, 120, and 130 shown in FIG. 2 may mean software or hardware such as FPGA (Field Programmable Gate Array) or ASIC (Application-Specific Integrated Circuit). . However, the components are not limited to software or hardware, and may be configured to reside in an addressable storage medium, and may be configured to execute one or more processors. The functions provided within the above components may be implemented by more detailed components, or may be implemented as a single component that performs a specific function by combining multiple components.

로그 데이터 관리 시스템(100)이 로그 데이터를 클러스터링하는 구체적인 방법에 관하여서는, 추후 도 3 이하의 도면을 참조하여 상세하게 설명하기로 한다.A specific method by which the log data management system 100 clusters log data will be described in detail later with reference to the drawings of FIG. 3 and below.

다음으로, 로그 데이터 출력 시스템(200)은 다양한 동작들을 수행하며, 로그 데이터를 생성할 수 있다. 예를 들어, 로그 데이터 출력 시스템(200)은 IoT(Internet of Things)장치일 수 있고, 제조 설비에 포함된 장치일 수도 있다. 다만, 앞선 예시들에 본 개시의 범위가 한정되는 것은 아니고, 로그 데이터 출력 시스템(200)은 수행하는 행위와 이벤트에 관한 데이터를 기록하여 로그 데이터를 생성하되, 컴퓨팅 장치로 구현된 모든 것들을 의미할 수 있다.Next, the log data output system 200 may perform various operations and generate log data. For example, the log data output system 200 may be an IoT (Internet of Things) device or a device included in a manufacturing facility. However, the scope of the present disclosure is not limited to the preceding examples, and the log data output system 200 generates log data by recording data about performed actions and events, but may refer to all things implemented as computing devices. You can.

다음으로, 사용자 단말(300)은 로그 데이터 관리 시스템(100)이 생성한 로그 데이터 시퀀스 클러스터링 모델을 로드(load)할 수 있다. 또한, 사용자 단말(300)은 로그 데이터 관리 시스템(100)이 추출한 로그 데이터의 속성 값을 로드할 수도 있다. 사용자 단말(300)은 로그 데이터 관리 시스템(100)으로부터 로드된 데이터를 사용자에게 디스플레이하기 위하여 웹 브라우저(Web browser) 또는 전용 애플리케이션이 설치되어 있을 수 있으며, 예를 들어, 사용자 단말(300)은 데스크탑(Desktop), 워크스테이션(Workstation), 랩탑(Laptop). 태블릿(Tablet) 및 스마트폰(Smart Phone) 중 어느 하나가 될 수 있으나, 앞선 예시들에 한정되지 않고, 컴퓨팅 기능이 구비된 모든 종류의 장치를 포함할 수 있다.Next, the user terminal 300 may load the log data sequence clustering model generated by the log data management system 100. Additionally, the user terminal 300 may load attribute values of log data extracted by the log data management system 100. The user terminal 300 may have a web browser or a dedicated application installed in order to display data loaded from the log data management system 100 to the user. For example, the user terminal 300 may be installed on a desktop. (Desktop), Workstation, Laptop. It may be either a tablet or a smart phone, but is not limited to the previous examples and may include all types of devices equipped with computing functions.

몇몇 실시예에서, 로그 데이터 관리 시스템(100), 로그 데이터 출력 시스템(200) 및 사용자 단말(300)은 네트워크를 통해 통신할 수 있다. 상기 네트워크는 근거리 통신망(Local Are Network; LAN), 광역 통신망(Wide Area Networkl WAN), 이동 통신망(mobile radio communication network), Wibro(Wireless Boradband Internet) 등과 같은 이동 통신망(mobile radio communication network), Wibro(Wireless Broadband Internet) 등과 같은 모든 종류의 유/무선 네트워크로 구현될 수 있다.In some embodiments, log data management system 100, log data output system 200, and user terminal 300 may communicate over a network. The network is a local area network (LAN), a wide area network (WAN), a mobile radio communication network, a mobile radio communication network such as Wibro (Wireless Boradband Internet), Wibro ( It can be implemented with all types of wired/wireless networks such as Wireless Broadband Internet).

지금까지 도 1 내지 도 2를 참조하여 본 개시의 몇몇 실시예에 따른 로그 데이터 처리 장치(100)의 구성 및 동작에 대하여 설명하였다. 이하에서는 도 3 내지 도 10을 참조하여, 본 개시의 다양한 실시예에 따른 방법들에 대하여 상세하게 설명하도록 한다.So far, the configuration and operation of the log data processing device 100 according to some embodiments of the present disclosure have been described with reference to FIGS. 1 and 2. Hereinafter, methods according to various embodiments of the present disclosure will be described in detail with reference to FIGS. 3 to 10.

본 개시의 몇몇 실시예에 따른 방법들의 각 단계는 컴퓨팅 장치에 의해 수행될 수 있다. 다시 말하면, 상기 방법들의 각 단계는 컴퓨팅 장치의 프로세서에 의해 실행되는 하나 이상의 인스트럭션(instruction)들로 구현될 수 있다. 상기 방법들에 포함되는 모든 단계는 하나의 물리적인 컴퓨팅 장치에 의하여 실행될 수도 있을 것이나, 상기 방법의 제1 단계들은 제1 컴퓨팅 장치에 의하여 수행되고, 상기 방법의 제2 단계들은 제2 컴퓨팅 장치에 의하여 수행될 수도 있다. 이하에서는, 상기 방법들의 각 단계가 도 1에 예시된 로그 데이터 처리 장치(100)에 의해 수행되는 것을 가정하여 설명을 이어가도록 한다. 다만, 설명의 편의상, 상기 방법들에 포함되는 각 단계의 동작 주체는 그 기재가 생략될 수도 있다.Each step of the methods according to some embodiments of the present disclosure may be performed by a computing device. In other words, each step of the above methods may be implemented as one or more instructions executed by a processor of a computing device. All steps included in the methods may be performed by a single physical computing device, but the first steps of the method are performed by a first computing device and the second steps of the method are performed by a second computing device. It may also be performed by. Hereinafter, the description will be continued assuming that each step of the above methods is performed by the log data processing device 100 illustrated in FIG. 1. However, for convenience of explanation, the description of the operator of each step included in the above methods may be omitted.

도 3의 단계 S100에서, 로그 데이터 관리 시스템에 의해, 공통된 패턴에 기초하여 로그 데이터 시퀀스의 경계가 지정될 수 있다. 해당 단계를 보다 구체적으로 설명하기 위해 도 4를 참조하도록 한다. 여기서, 로그 데이터 시퀀스는 로그 데이터 출력 시스템이 특정 동작을 수행하면서 출력하는 로그 데이터일 수 있다. 예를 들어, 로그 데이터 출력 시스템이 제1 동작을 수행하면서 두 줄로 구성된 하나의 로그 데이터를 출력하였다면, 상기 로그 데이터는 하나의 시퀀스인 것이다.In step S100 of FIG. 3, the boundary of the log data sequence may be designated by the log data management system based on a common pattern. Please refer to FIG. 4 to explain the step in more detail. Here, the log data sequence may be log data output while the log data output system performs a specific operation. For example, if the log data output system outputs one log data consisting of two lines while performing the first operation, the log data is one sequence.

하나의 로그 데이터 출력 시스템에서 발생한 로그 데이터라면 모든 로그 데이터 시퀀스에 대해 공통된 하나의 패턴이 존재할 수 있다. 예를 들어, 도 4의 모든 로그 데이터들은, '로그 데이터 발생 일시 - 분류 (INFO/WARN) [메시지1] - 메시지2'의 패턴을 동일하게 갖는 것을 알 수 있다.If log data is generated from one log data output system, a common pattern may exist for all log data sequences. For example, it can be seen that all log data in FIG. 4 have the same pattern of 'log data generation date and time - classification (INFO/WARN) [Message 1] - Message 2'.

상기 로그 데이터들이 서로 다른 동작에 의해 출력된 상이한 로그 데이터 시퀀스임을 명시하기 위해, 로그 데이터의 경계가 지정될 필요가 있는 것이다. 예를 들어, 도 4의 로그 데이터들은 한 줄일 뿐만 아니라, 두 줄인 경우의 로그 데이터도 포함하고 있기 때문에, 상이한 줄에 로그 데이터가 입력되었다고 하더라도, 하나의 로그 데이터 시퀀스일 수 있는 것이다. In order to specify that the log data are different log data sequences output by different operations, the boundaries of the log data need to be specified. For example, since the log data in FIG. 4 includes not only one line but also two lines of log data, even if the log data is input in different lines, it can be one log data sequence.

따라서, 도 4의 첫 번째 로그 데이터(5)와 같이, 로그 데이터가 끝나는 지점에는 줄 바꿈 기호 '\n'이 삽입되어 있다. 따라서, 도 4의 로그 데이터들은 '\n'을 기준으로 로그 데이터 시퀀스의 경계가 지정될 수 있다. Therefore, like the first log data (5) in FIG. 4, a line break symbol '\n' is inserted at the end of the log data. Accordingly, in the log data of FIG. 4, the boundary of the log data sequence may be designated based on '\n'.

상기 시퀀스 경계 지정 방식은 '\n'을 기준으로 하는 것에 국한되는 것이 아니며, 다른 시퀀스 경계 지정 구분자(delimeter)가 환경 설정을 통하여 지정될 수 있고, 설저오딘 구분자를 기준으로 인접한 로그 데이터 시퀀스를 자동으로 구분해주는 로직이 수행될 수도 있다.The sequence boundary designation method is not limited to '\n', and other sequence boundary designation delimiters can be specified through environment settings, and adjacent log data sequences are automatically selected based on the delimiter. Logic to distinguish may be performed.

경우에 따라서, 로그 데이터 시퀀스의 경계를 사용자가 선택함으로써, 로그 데이터 시퀀스가 나누어질 수 있다. 이 경우, 사용자가 서로 다른 로그 데이터를 구분하는 기준을 제공함으로써, 클러스터링에 이용될 로그 데이터 시퀀스들이 정확하게 구분될 수 있다.In some cases, the log data sequence may be divided by the user selecting the boundary of the log data sequence. In this case, by providing a standard for distinguishing different log data, the log data sequences to be used for clustering can be accurately distinguished.

이하, 로그 데이터 관리 시스템이 로그 데이터 시퀀스 경계를 지정하지 않고 로그 데이터 클러스터링을 수행하는 경우에 대해 설명한다. 복수의 상이한 로그 데이터가 동일한 로그 데이터 시퀀스로 인식됨으로써, 클러스터링의 정확도가 낮아지는 문제가 발생할 수 있다. 이 때, 클러스터링 시간의 지연이 발생될 수도 있다.Hereinafter, a case where the log data management system performs log data clustering without specifying log data sequence boundaries will be described. When a plurality of different log data are recognized as the same log data sequence, the problem of lowering the accuracy of clustering may occur. At this time, a delay in clustering time may occur.

본 실시예에 따라, 로그 데이터 관리 시스템은 로그 데이터 간의 정확한 경계를 파악하고, 상기 경계 앞 로그 데이터와 상기 경계 뒤 로그 데이터는 서로 다른 로그 데이터 시퀀스임을 명시하고, 이를 기초로 복수의 로그 데이터 시퀀스에 대한 클러스터링을 수행함으로써, 클러스터링 정확도를 향상시킬 수 있고, 클러스터링 시간을 감소시킬 수도 있다.According to this embodiment, the log data management system determines the exact boundary between log data, specifies that the log data before the boundary and the log data behind the boundary are different log data sequences, and based on this, creates a plurality of log data sequences. By performing clustering, clustering accuracy can be improved and clustering time can be reduced.

다음으로 단계 S200에서 로그 데이터 관리 시스템은 개별 로그 시퀀스를 복수의 토큰으로 분리할 수 있다. 토큰은 공백, 특수문자(!, @, # 등), 숫자, 알파벳 등으로 표현될 수 있다. 개별 로그 시퀀스 토큰화를 보다 구체적으로 설명하기 위해 도 5를 참조하여 설명하기로 한다. Next, in step S200, the log data management system may separate individual log sequences into a plurality of tokens. Tokens can be expressed as spaces, special characters (!, @, #, etc.), numbers, alphabets, etc. To explain individual log sequence tokenization in more detail, it will be described with reference to FIG. 5.

예를 들어, 도 5를 참조하면,'2015-07-20 - INFO [main:Quorum]' 문자열은 2015, -, 07, 29, (공백), -, (공백), INFO, (공백), [, main, :, Quorum, ] 등으로 분리될 수 있다. 이와 같이, 로그 데이터 관리 시스템은 '2015'와 같은 발생 연도를 유추할 수 있는 유의미한 연속된 문자열은 각 숫자를 분리하지 아니한다. 또한, '-' 또는 '(공백)'과 같은 특수문자 및 공백은 각각 다른 토큰으로 분리될 수 있다.For example, referring to Figure 5, the string '2015-07-20 - INFO [main:Quorum]' is 2015, -, 07, 29, (space), -, (space), INFO, (space), It can be separated into [, main, :, Quorum, ], etc. In this way, the log data management system does not separate each number from a meaningful continuous string that can infer the year of occurrence, such as '2015'. Additionally, special characters such as '-' or '(space)' and spaces can be separated into different tokens.

로그 데이터 관리 시스템이 토큰화를 수행하지 않고 로그 데이터 시퀀스 클러스터링을 수행하는 경우, 로그 데이터 전체 문자열에서 유의미한 문자열을 클러스터링 수행 시마다 인식하여야 하기에, 클러스터링 시간의 지연이 발생될 수도 있다. If the log data management system performs log data sequence clustering without performing tokenization, a delay in clustering time may occur because meaningful strings in the entire log data string must be recognized each time clustering is performed.

본 실시예에 따르면, 로그 데이터 관리 시스템은 개별 로그 시퀀스를 분리 하는데 있어 공백 및 특수문자와 같은 개별 토큰으로 존재할 때 무의미한 문자열을 로그 데이터 발생연도, 로그 데이터 발생 일자, 로그 데이터 분류 등의 유의미한 정보를 담고 있는 문자열을 구분함으로써 로그 데이터 시퀀스 클러스터링 수행 시간을 감소시킬 수도 있는 것이다.According to this embodiment, in separating individual log sequences, the log data management system replaces meaningless strings when they exist as individual tokens such as spaces and special characters with meaningful information such as log data occurrence year, log data occurrence date, and log data classification. By distinguishing the strings it contains, the log data sequence clustering execution time can be reduced.

다음으로 단계 S300에서 로그 데이터 관리 시스템은 상기 로그 시퀀스 개별 토큰들을 마스킹할 수 있다. 로그 시퀀스 개별 토큰들을 마스킹하는 단계를 보다 구체적으로 설명하기 위해 도 4 또는 도 5를 참조하여 설명하기로 한다. 여기서, 마스킹은 로그 시퀀스 개별 토큰들이 포함하는 문자열이 서로 다르더라도 그 문자열의 종류가 같으면 하나의 일관된 단어로 치환하는 것을 의미한다. Next, in step S300, the log data management system may mask individual tokens of the log sequence. To explain in more detail the step of masking individual tokens of a log sequence, the description will be made with reference to FIG. 4 or FIG. 5 . Here, masking means replacing individual tokens of a log sequence with one consistent word even if the strings they contain are different from each other if the types of strings are the same.

예를 들어, 도 5를 참조하면, 2015-07-20 - INFO [main:Quorum]’ 문자열이 2015, -, 07, 29, (공백), -, (공백), INFO, (공백), [, main, :, Quorum, ] 등의 토큰으로 분리된 결과에 대하여, '2015'토큰, '07' 및 '20' 토큰은 서로 다른 문자열을 포함하고 있다. 그러나, 상기 토큰이 숫자로 구성되었다는 공통점에 기초하여, 동일하게 '\d+'로 치환될 수 있는 것이다.For example, referring to Figure 5, the string 2015-07-20 - INFO [main:Quorum]' is 2015, -, 07, 29, (space), -, (space), INFO, (space), [ For results separated by tokens such as , main, :, Quorum, ], the '2015' token, '07' and '20' tokens contain different strings. However, based on the commonality that the token is composed of numbers, it can be equally replaced with '\d+'.

단계 S300과 관련된 몇몇 실시예에서, 로그 데이터 관리 시스템은 개별 토큰 중 클러스터링 과정에서 주요한 역할을 하지 못하고 잘못된 결과를 도출할 수 있는 값은 마스킹 처리할 수도 있다. In some embodiments related to step S300, the log data management system may mask values that do not play a major role in the clustering process among individual tokens and may lead to incorrect results.

예를 들어, 도 4를 참조하면, '2015-07-29 17:41:41,728 - INFO [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)' 로그 데이터 시퀀스는 INFO 분류에 속하는 로그 데이터임에도 불구하고 '2015-07-29 17:41:41,932 - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@368] - Cannot open channel to 2 at election address /10.10.34.12:3888' 로그 데이터 시퀀스와 같이 WARN 분류에 속하는 로그 데이터가 상기 INFO 분류의 로그 데이터 시퀀스와 동일하게 2015년에 발생되었다는 것에 기초하여, 같은 로그 데이터 클러스터로 클러스터링 될 수도 있는 것이다. For example, referring to Figure 4, '2015-07-29 17:41:41,728 - INFO [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 1 (n.leader), 0x0 (n.zxid ), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)' The log data sequence is ' 2015-07-29 17:41:41,932 - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@368] - Cannot open channel to 2 at election Like the 'address /10.10.34.12:3888' log data sequence, log data belonging to the WARN classification may be clustered into the same log data cluster based on the fact that the log data sequence of the INFO classification occurred in the same year as the log data sequence.

상기한 바와 같이 잘못된 클러스터링이 수행될 경우, 추후 사용자가 잘못된 클러스터링을 일일이 수정해야하는 불편함을 초래하여 로그 데이터 시퀀스 클러스터링 방법의 비효율이 발생될 수도 있다. 본 실시예에 따르면, 로그 데이터 관리 시스템은 다른 유형 또는 다른 분류의 로그이지만 동일한 날짜, 시간, 기타 유사한 수치값으로 인해 동일한 클러스터로 클러스터링 되는 오류를 피할 수도 있다.If incorrect clustering is performed as described above, the user may later incur the inconvenience of having to correct the incorrect clustering one by one, resulting in inefficiency of the log data sequence clustering method. According to this embodiment, the log data management system may avoid an error in which logs of different types or classifications are clustered into the same cluster due to the same date, time, or other similar numerical values.

단계 S300 과 관련된 다른 몇몇 실시예에서, 로그 데이터 관리 시스템은 숫자로만 구성된 토큰을 모두 '\d+'로 치환할 수 있다. 여기서, 원본과 마스킹된 데이터는 별도로 보관하며 서로 참조 되도록 처리할 수 있다. 예를 들어, 도 5를 참조하면, 원본의 2015 문자열(10)은 '\d+'로 치환될 수도 있다.In some other embodiments related to step S300, the log data management system may replace all tokens consisting of only numbers with '\d+'. Here, the original and masked data can be stored separately and processed to refer to each other. For example, referring to Figure 5, the original 2015 string (10) may be replaced with '\d+'.

본 실시예에 따르면, 로그 데이터 관리 시스템은 로그 데이터에 포함된 숫자로 구성된 개별 토큰에 의해 특정 클러스터에 소속된 로그 데이터들과 유형이 상이한 하나 이상의 로그 데이터 시퀀스가 함께 클러스터링 되는 오류를 피할 수도 있다.According to this embodiment, the log data management system may avoid an error in which log data belonging to a specific cluster and one or more log data sequences of different types are clustered together by individual tokens consisting of numbers included in the log data.

단계 S300과 관련된 또 다른 몇몇 실시예에서, 숫자가 아닌 문자열로 구성된 토큰은 [a-zA-Z]+으로 치환될 수도 있다. 여기서, 로그 데이터 관리 시스템은 원본과 마스킹된 데이터는 별도로 보관하며 서로 참조 되도록 처리 할 수 있다. In some other embodiments related to step S300, tokens consisting of strings rather than numbers may be replaced with [a-zA-Z]+. Here, the log data management system stores the original and masked data separately and can process them so that they are referenced to each other.

예를 들어, 도 5를 참조하면, '[main:Quorum]' 토큰(11)에서 'main' 및 'Quorum' 토큰은 구성요소 11a와 같이 모두 '[a-zA-Z]'으로 치환될 수 도 있다. 본 실시예에 따르면, 로그 데이터 관리 시스템은 로그 데이터에 포함된 숫자가 아닌 문자열로 구성된 개별 토큰을 마스킹할 수 있다. 그 결과, 특정 클러스터에 소속된 로그 데이터들과 유형이 상이한 하나 이상의 로그 데이터 시퀀스가 함께 클러스터링 되는 오류를 피할 수도 있다.For example, referring to Figure 5, in the '[main:Quorum]' token (11), the 'main' and 'Quorum' tokens can both be replaced with '[a-zA-Z]', as in component 11a. There is also. According to this embodiment, the log data management system can mask individual tokens comprised of strings rather than numbers included in log data. As a result, it is possible to avoid errors in which log data belonging to a specific cluster and one or more log data sequences of different types are clustered together.

단계 S200과 관련된 또 다른 몇몇 실시예에서, 특수 문자 및 공백으로 구성된 토큰은 각각 해당 특수문자 및 '\s'로 치환될 수도 있다. 여기서, 로그 데이터 관리 시스템은 원본과 마스킹된 데이터는 별도로 보관하며 서로 참조 되도록 처리할 수 있다. In some other embodiments related to step S200, tokens consisting of special characters and spaces may be replaced with the corresponding special characters and '\s', respectively. Here, the log data management system stores the original and masked data separately and can process them to refer to each other.

예를 들어, 도 5를 참조하면, 로그 데이터 원본의 '[main:Quorum]'문자열(11)에서 분리된 개별 토큰 '[', ':' 및 ']' 각각은 로그 데이터 마스킹 단계를 거친 후에도 별도의 기호로 치환되지 않고 원본 특수문자 그대로 표기한다. 또한, 원본의 공백(12)는, 구성 요소 12a와 같이 '\s'라는 공백을 의미하는 기호로 치환될 수도 있으나, 도 5에 도시된 예시에 본 개시의 범위가 한정되는 것은 아니고, 로그 데이터에 포함된 문자열의 형식에 기초하여 문자열 형식이 구분될 수 있는 모든 기호가 본 개시의 범위에 포함될 수 있음을 유의해야 한다. For example, referring to Figure 5, each of the individual tokens '[', ':', and ']' separated from the '[main:Quorum]' string (11) of the log data source remains even after going through the log data masking step. The original special characters are written as they are without being replaced with separate symbols. In addition, the blank space 12 in the original may be replaced with a symbol '\s' meaning a blank space, as in component 12a, but the scope of the present disclosure is not limited to the example shown in FIG. 5, and log data It should be noted that all symbols whose string format can be distinguished based on the format of the string included in may be included in the scope of the present disclosure.

본 실시예에 따르면, 로그 데이터 관리 시스템은 로그 데이터에 포함된 특수 문자 또는 공백으로 구성된 개별 토큰을 마스킹 할 수 있다. 그 결과, 특정 클러스터에 소속된 로그 데이터들과 유형이 상이한 하나 이상의 로그 데이터 시퀀스가 함께 클러스터링 되는 오류를 피할 수도 있다.According to this embodiment, the log data management system can mask individual tokens consisting of special characters or spaces included in log data. As a result, it is possible to avoid errors in which log data belonging to a specific cluster and one or more log data sequences of different types are clustered together.

다음으로 단계 S400에서, 로그 데이터 관리 시스템은 상기 별도 보관한 원본과 마스킹 데이터에 기초하여 로그 데이터 시퀀스들을 정렬할 수 있다. 여기서, 상기 시스템은, 상기 로그 데이터 시퀀스 정렬 작업은 일반적인 문자열 데이터 정렬에 사용되는 방식을 사용할 수도 있다. Next, in step S400, the log data management system may sort log data sequences based on the separately stored original and masking data. Here, the system may use a method used for general string data sorting to sort the log data sequence.

정렬 작업을 수행하지 않고 클러스터링이 수행될 경우, 로그 데이터를 클러스터링 하기 위한 계산의 복잡도가 증가하여, 클러스터링에 과도한 시간이 소요되는 문제가 있다. 본 실시예에 따르면, 복수의 로그 데이터 시퀀스들이 문자열 기준으로 정렬되어, 로그 데이터 관리 시스템이 유사한 로그 데이터 시퀀스끼리 클러스터링을 수행할 시, 클러스터링 계산의 복잡도를 감소시켜, 클러스터링 시간을 감소시키고 클러스터링 정확도를 향상시킬 수 있다.If clustering is performed without performing a sorting operation, the complexity of calculations for clustering log data increases, resulting in clustering taking excessive time. According to this embodiment, a plurality of log data sequences are sorted based on strings, so that when the log data management system performs clustering between similar log data sequences, the complexity of the clustering calculation is reduced, the clustering time is reduced, and the clustering accuracy is improved. It can be improved.

단계 S400과 관련된 몇몇 실시예에서, 로그 데이터 관리 시스템은 상기 로그 데이터 시퀀스 정렬작업을 수행할 때 상기 마스킹 된 로그 데이터를 기준으로 할 수 있다. 여기서, 로그 데이터는 보통 왼쪽에서 오른쪽으로 작성된 경우가 대부분이므로 왼쪽 문자를 기준으로 정렬될 수 있다. 단, 아랍 문자와 같이 좌서 문자인 경우는 반대로 맨 오른쪽의 글자를 기준으로 정렬될 수도 있다.In some embodiments related to step S400, the log data management system may use the masked log data as a reference when performing the log data sequence sorting operation. Here, since log data is usually written from left to right, it can be sorted based on the left character. However, in the case of left-right letters such as Arabic letters, they may be sorted based on the letter on the far right.

다음으로 단계 S500에서, 사용자는 로그 데이터 시퀀스 클러스터링의 기준값을 결정할 수 있다. 상기 사용자가 기준값을 결정하는 단계를 보다 구체적으로 설명하기 위해 도 7 및 도 8을 참조하여 설명하기로 한다. Next, in step S500, the user may determine a reference value for log data sequence clustering. In order to explain in more detail the step of the user determining the reference value, it will be described with reference to FIGS. 7 and 8.

여기서, 기준값(Threshold)은, 로그 데이터 패턴이 일치하는 토큰의 최대 위치 정보를 의미할 수 있다. 예를 들어, 도 8을 참조하면, data[2]에 해당하는 로그 데이터 시퀀스와 data[5]에 해당하는 로그 데이터 시퀀스는 사용자가 기준값을 58로 설정하였을 때, 동일한 클러스터에 클러스터링 된다. Data[2]에 해당하는 로그 데이터 시퀀스는, 왼쪽부터 58 기준값 토큰 위치 정보까지 '\d+-\d+-\d+\s\d+:\d+:\d+\S+\d+\s-\s[a-zA-Z]+\S+\[[a-zA-Z]+'의 마스킹된 패턴을 가진다. 또한, data[5] 왼쪽부터 58 기준값 토큰 위치 정보까지 '\d+-\d+-\d+\s\d+:\d+:\d+\S+\d+\s-\s[a-zA-Z]+\S+\[[a-zA-Z]+'의 동일한 마스킹된 패턴을 가지므로 data[2]와 data[5]는 동일한 클러스터에 클러스터링할 수 있는 것이다.Here, the threshold may mean the maximum location information of the token matching the log data pattern. For example, referring to FIG. 8, the log data sequence corresponding to data[2] and the log data sequence corresponding to data[5] are clustered in the same cluster when the user sets the reference value to 58. The log data sequence corresponding to Data[2] is '\d+-\d+-\d+\s\d+:\d+:\d+\S+\d+\s-\s[a It has a masked pattern of -zA-Z]+\S+\[[a-zA-Z]+'. In addition, from the left of data[5] to the 58 standard value token location information, '\d+-\d+-\d+\s\d+:\d+:\d+\S+\d+\s-\s[a-zA-Z]+ Since they have the same masked pattern of \S+\[[a-zA-Z]+', data[2] and data[5] can be clustered into the same cluster.

단계 S400과 관련된 몇몇 실시예에서, 로그 데이터 관리 시스템은 로그 데이터 시퀀스들의 정렬 결과 및 기준치 변경 컨트롤을 포함하는 사용자 인터페이스를 통해 사용자로부터 기준값을 입력받을 수 있다. 예를 들어, 도 7을 참조하면, 사용자는 기준치 변경 컨트롤(40)을 마우스 입력(42)을 통해 슬라이딩 하여 기준값(Threshold)(43)을 입력할 수 있다. 본 실시예에 따르면, 사용자는 별도의 코드 입력 없이, 직관적인 인터페이스를 통한 기준값 입력에 기초하여, 로그 데이터 시퀀스 클러스터링을 간편하게 수행할 수도 있다.In some embodiments related to step S400, the log data management system may receive input of a reference value from the user through a user interface that includes a control for changing the reference value and alignment results of log data sequences. For example, referring to FIG. 7, the user can input a threshold value (43) by sliding the threshold value change control (40) using the mouse input (42). According to this embodiment, the user can easily perform log data sequence clustering based on inputting a reference value through an intuitive interface without entering a separate code.

단계 S400과 관련된 다른 몇몇 실시예에서, 로그 데이터 관리 시스템은 상기 사용자 인터페이스에서 마스킹된 로그 데이터가 아닌 원본 로그 데이터에 기초한 로그 데이터 시퀀스들의 클러스터링 결과를 실시간으로 출력함으로써, 사용자로 하여금 즉각 해당 결과를 직관적으로 확인하도록 할 수도 있다. 본 실시예에 따르면, 사용자는 원본 로그 데이터에 기초한 클러스터링 결과를 실시간으로 확인함으로써, 로그 데이터 시퀀스들의 클러스터링 결과가 적절하지 않을 경우 바로 기준값을 조정할 수도 있다.In some other embodiments related to step S400, the log data management system outputs the clustering results of log data sequences based on original log data rather than masked log data in the user interface in real time, allowing the user to immediately intuitively view the results. You can also check it with . According to this embodiment, the user can check the clustering result based on the original log data in real time and immediately adjust the reference value if the clustering result of the log data sequences is not appropriate.

다음으로, 단계 S600에서, 로그 데이터 관리 시스템은 사용자에 의해 결정된 기준값 기반으로 로그 데이터 시퀀스 클러스터링을 수행할 수 있다. 로그 데이터 시퀀스 클러스터링을 수행하는 단계를 보다 구체적으로 설명하기 위하여, 도 6 및 도 8 내지 도 9를 참조하여 설명하기로 한다.Next, in step S600, the log data management system may perform log data sequence clustering based on a reference value determined by the user. In order to explain the step of performing log data sequence clustering in more detail, it will be described with reference to FIGS. 6 and 8 to 9.

도 6의 단계 S610에서, 로그 데이터 관리 시스템은 특정 로그 데이터 시퀀스와 인접 로그 데이터 시퀀스 간의 차이값을 연산할 수 있다. 여기서, 본 개시에서 실시하는 유사도 측정 방식은 특정 로그 데이터 시퀀스가 마스킹된 데이터와 이와 인접한 마스킹 데이터들을 비교해서 최초로 달라지는 지점의 위치가 이용될 수도 있다. 예를 들어, 문자열 'abcd'와 'abxy'에서, 두 데이터는 3번째 문자(c 와 x)가 서로 다르므로 두 데이터 간의 유사도는 2가 된다. 단, 유사도의 초기 값은 0이다.In step S610 of FIG. 6, the log data management system may calculate a difference value between a specific log data sequence and an adjacent log data sequence. Here, the similarity measurement method implemented in the present disclosure may use the location of the point where a specific log data sequence first differs by comparing masked data with masked data adjacent to it. For example, in the strings 'abcd' and 'abxy', the third character (c and x) of the two data is different, so the similarity between the two data is 2. However, the initial value of similarity is 0.

기존 문자열 유사도 측정 방식을 이용할 경우, 데이터 분류화(categorizing)를 통해 대체한 영역을 제외한 곳의 유사도 분석을 위해 data[N]을 전체 탐색해야 하기 때문에, 최소 O()의 비용이 들었다. 본 실시예에 따르면, 상기하였듯 정규식 패턴 기준으로 데이터를 정렬 했기 때문에 특정 로그 데이터 시퀀스를 마스킹한 데이터 data[i]와 인접한 로그 시퀀스 마스킹 데이터인 data[i+1]의 유사도를 계산함으로써 O(N)의 비용만이 필요하므로 시간 복잡도가 유의미하게 개선된 로그 데이터 클러스터 분류 모델을 생성하는 방법을 제공할 수 있다.When using the existing string similarity measurement method, the entire data[N] must be searched for similarity analysis excluding the area replaced through data categorizing, so the minimum O( ) cost. According to this embodiment, since the data is sorted based on the regular expression pattern as described above, the similarity between data data[i] masking a specific log data sequence and data[i+1], which is adjacent log sequence masking data, is calculated to obtain O( Since only N) costs are required, it can provide a method of generating a log data cluster classification model with significantly improved time complexity.

단계 S610과 관련된 다른 몇몇 실시예에서, 로그 데이터 관리 시스템은 상기 data[i]와 data[i+1]의 로그 시퀀스 마스킹 데이터의 token[p]가 일치할 때까지 p를 이동시키다가 불일치하는 지점의 p를 유사도 매트릭스에 기록할 수 있다. 여기서, token[p]는 로그 데이터의 p번째 토큰을 의미하며, p 변수는 사용자가 결정한 기준값(Threshold)일 수 있다. 이 때, data[i]와 data[i+1]의 길이값 중 최소 값을 p의 최대 값으로 가질 수 있다. In some other embodiments related to step S610, the log data management system moves p until token[p] of the log sequence masking data of data[i] and data[i+1] match, and then moves p until the point where they do not match. p can be recorded in the similarity matrix. Here, token[p] refers to the pth token of log data, and the p variable may be a threshold determined by the user. At this time, the minimum value among the length values of data[i] and data[i+1] can be taken as the maximum value of p.

예를 들어, 도 8을 참조하면, 사용자가 결정한 기준값(31)이 58일 때, data[1]의 token[58]은 '+'이고 data[2]의 token[58]은 '+'이므로 상기 기준값(31)이 58일 때는 하나의 클러스터로 클러스터링 될 수 있으나, 상기 기준값이 60일 때는 서로 다른 클러스터로 분리되므로 data[1]과 data[2]의 유사도는 58로 유사도 매트릭스에 기록된다.For example, referring to Figure 8, when the reference value (31) determined by the user is 58, token[58] of data[1] is '+' and token[58] of data[2] is '+'. When the reference value (31) is 58, they can be clustered into one cluster, but when the reference value is 60, they are separated into different clusters, so the similarity between data[1] and data[2] is recorded as 58 in the similarity matrix.

다음으로, 단계 S620에서, 로그 데이터 관리 시스템은 로그 데이터 시퀀스 data[i-1]과 data[i]가 같은 클러스터, data[i]와 data[i+1]이 같은 클러스터로 클러스터링 된 경우, data[i-1]과 data[i+1]도 같은 클러스터로 클러스터링 될 수 있는 것으로 판단할 수 있게되어 선형 비교를 통해 클러스터를 분류할 수 있다. Next, in step S620, the log data management system determines that if the log data sequences data[i-1] and data[i] are clustered into the same cluster, data[i] and data[i+1] are clustered into the same cluster, data It can be determined that [i-1] and data[i+1] can be clustered into the same cluster, so the clusters can be classified through linear comparison.

단계 S620과 관련된 몇몇 실시예에서, 도 8 및 도9를 참조하면, 로그 데이터 관리 시스템은 사용자가 기준값을 60으로 지정하였다고 가정할 때, data[1]과 data[2]를 정규화한 뒤 두 데이터의 토큰이 일치하는 최장 길이는 58이기 때문에, 두 데이터는 다른 클러스터로 클러스터링 할 수 있다. 그러나, data[2]와 data[3], data[4] 및 data[5]의 경우, data[2]와 토큰이 일치하는 최장 길이가 각각 89, 110, 110으로 상기 기준값인 60보다 큰 유사도를 가지게 되어 data[2]와 관련된 하나의 클러스터(30)로 클러스터링 할 수 있다. 도 9에서 data[3]또한 data[2]와 token[110]까지 동일한 토큰을 가지는 것을 확인할 수 있다. 마찬가지로, 도 8 및 도 9를 참조하면, 동일한 논리에 따라 data[6] ~ data[9] 또한 하나의 클러스터로 클러스터링 된다.In some embodiments related to step S620, referring to FIGS. 8 and 9, the log data management system normalizes data[1] and data[2], assuming that the user specifies the reference value as 60, and then normalizes the two data Since the longest length of matching tokens is 58, the two data can be clustered into different clusters. However, in the case of data[2], data[3], data[4], and data[5], the longest lengths of matching tokens with data[2] are 89, 110, and 110, respectively, which is a similarity greater than the above reference value of 60. It can be clustered into one cluster (30) related to data[2]. In Figure 9, it can be seen that data[3] also has the same token as data[2] and token[110]. Likewise, referring to Figures 8 and 9, data[6] to data[9] are also clustered into one cluster according to the same logic.

로그 데이터 관리 시스템이 data[i]와 data[i+1] 내지 data[i+n] (n은 로그 데이터 시퀀스가 끝나는 지점이다.)과의 유사도 측정을 일일이 수행하여 클러스터링 할 경우, 해당 알고리즘의 시간 복잡도는 O(N)을 초과하여 로그 데이터 클러스터링이 매우 비효율적으로 시행된다. 본 실시예에 따르면, 로그 데이터 관리 시스템은 로그 데이터 시퀀스 data[i-1]과 data[i]가 같은 클러스터, data[i]와 data[i+1]이 같은 클러스터로 클러스터링 된 경우, data[i-1]과 data[i+1]도 같은 클러스터로 클러스터링 될 수 있는 것으로 판단할 수 있게되어 로그 데이터 시퀀스 클러스터링 수행 시간에 매우 파격적인 비용 절감 효과를 제공할 수 있다.When the log data management system performs clustering by individually measuring the similarity between data[i] and data[i+1] to data[i+n] (n is the end point of the log data sequence), the corresponding algorithm The time complexity exceeds O(N), making log data clustering very inefficient. According to this embodiment, the log data management system operates when the log data sequences data[i-1] and data[i] are clustered into the same cluster, and data[i] and data[i+1] are clustered into the same cluster, data[ i-1] and data[i+1] can also be determined to be clustered into the same cluster, which can provide significant cost savings in log data sequence clustering execution time.

지금까지 설명된 단계 S600과 관련된 클러스터링 방법에 대응되는 알고리즘을 사전에 결정된 것일 수 있으며, 경우에 따라서 어느 하나에서 다른 하나로 변경될 수도 있다. 추후 S700과 관련하여 설명될 로그 데이터 시퀀스 클러스터링 적합성을 판단함에 있어서, 상기 클러스터링 결과가 적합하지 않을 경우, 기존과 상이한 조건으로 클러스터링이 수행될 수 있음에 유의해야 한다.The algorithm corresponding to the clustering method related to step S600 described so far may be predetermined, and may be changed from one to another depending on the case. In determining the suitability of log data sequence clustering, which will be explained later in relation to S700, it should be noted that if the clustering result is not suitable, clustering may be performed under conditions different from existing ones.

다음으로, 단계 S700에서, 로그 데이터 관리 시스템은 로그 데이터 분석에 활용하기 위해 로그를 분리하는 목적이 다양한 실시예에서 항상 다를 수 있기에, 로그 데이터 시퀀스 클러스터링 결과가 적합성에 대한 평가를 수행할 수 있다.Next, in step S700, the log data management system may evaluate the suitability of the log data sequence clustering results since the purpose of separating logs for use in log data analysis may always be different in various embodiments.

단계 S700과 관련된 몇몇 실시예에서, 로그 데이터 시퀀스 클러스터링 결과를 사용자가 평가함으로써, 어느 지점에서 클러스터가 구분되어야 하는지 사용자 판단에 의해 결정될 수 있다. 본 실시예에 따르면, 사용자가 서로 다른 로그 데이터를 구분하는 기준을 제공함으로써, 상기 클러스터링 결과가 이용될 목적에 적합하도록 로그 데이터 시퀀스들이 정확하게 구분될 수 있다.In some embodiments related to step S700, the log data sequence clustering result may be evaluated by the user, thereby determining at which point the clusters should be separated by the user's judgment. According to this embodiment, by providing a standard for the user to distinguish between different log data, log data sequences can be accurately classified to suit the purpose for which the clustering result is to be used.

다음으로 단계 S800에서, 로그 데이터 관리 시스템은 사용자 또는 로그 데이터 관리 시스템의 로그 데이터 시퀀스 클러스터링 결과가 적합하다는 것에 응답하여, 로그 데이터 시퀀스 클러스터링 모델을 생성할 수 있다. 여기서, 상기 클러스터링 모델은, 특정 로그 데이터 출력 시스템이 출력한 로그 데이터는 공통된 접두사(prefix) 패턴을 가지고 있다는 것에 기초하여 생성된다. 상기 클러스터링 모델을 생성하는 단계(S800)에 대해서는 도 11을 참조하여 자세하게 후술하도록 한다.Next, in step S800, the log data management system may generate a log data sequence clustering model in response to whether the log data sequence clustering result of the user or the log data management system is suitable. Here, the clustering model is created based on the fact that log data output by a specific log data output system has a common prefix pattern. The step of generating the clustering model (S800) will be described in detail later with reference to FIG. 11.

지금까지 로그 데이터 시퀀스 클러스터링 모델 생성과 관련된 동작들은 각각 독립적으로 이용될 수 있으나, 이에 본 개시의 범위가 한정되는 것은 아니고, 복수의 동작들이 함께 이용될 수도 있다.The operations related to generating the log data sequence clustering model so far may be used independently, but the scope of the present disclosure is not limited thereto, and a plurality of operations may be used together.

다시 도11 내지 도 13을 참조하여 설명하기로 한다.The description will be made again with reference to FIGS. 11 to 13.

도 11은 도 1을 참조하여 설명한 로그 데이터 관리 시스템이 로그 데이터 클러스터링 모델을 생성하는 동작을 자세하게 설명하기 위한 순서도이다.FIG. 11 is a flowchart for explaining in detail the operation of the log data management system described with reference to FIG. 1 to generate a log data clustering model.

단계 S810에서, 로그 데이터 관리 시스템은 로그 데이터의 각 토큰을 노드로 하는 TRIE 데이터 구조를 생성할 수 있다. TRIE 데이터 구조를 생성하는 단계(S810)에 대해서는 도 12 내지 도 13을 참조하여 자세하게 후술하도록 한다. 여기서, 상기 데이터 구조는, 특정 로그 데이터 출력 시스템이 출력한 로그 데이터는 공통된 접두사(prefix) 패턴을 가지고 있다는 것에 기초하여 생성된다. 또한, 상기 노드는 로그 데이터 시퀀스 클러스터 별 참조 횟수를 속성 값으로 가질 수도 있다. In step S810, the log data management system may create a TRIE data structure with each token of the log data as a node. The step of generating the TRIE data structure (S810) will be described in detail later with reference to FIGS. 12 and 13. Here, the data structure is created based on the fact that log data output by a specific log data output system has a common prefix pattern. Additionally, the node may have the reference count for each log data sequence cluster as an attribute value.

다음으로 단계 S820에서, 로그 데이터 관리 시스템은 로그 시퀀스의 클러스터가 식별되는 지 판별할 수 있다. 예를 들어, 로그 시퀀스의 클러스터를 식별하는 방법은, 두 개 이상의 로그 데이터 시퀀스가 사용자가 결정한 기준값 이상의 유사도를 가짐으로써, 하나 이상의 클러스터가 생성된 것을 확인 하는 것일 수도 있다.Next, in step S820, the log data management system may determine whether a cluster of log sequences is identified. For example, a method of identifying a cluster of a log sequence may be to confirm that one or more clusters have been created by having two or more log data sequences have a similarity greater than a reference value determined by the user.

다음으로 단계 S830에서, 로그 데이터 관리 시스템은 로그 시퀀스의 클러스터가 식별되었다는 것에 응답하여, 각 클러스터에 사전 정의된 액션을 자동 수행할 수 있다. 여기서, 상기 사전 정의된 액션은 사용자가 별도의 동작을 수행하여 사전 정의한 액션 또는 로그 데이터 관리 시스템의 초기 설정일 수 있다.Next, in step S830, the log data management system may automatically perform a predefined action on each cluster in response to the cluster of the log sequence being identified. Here, the predefined action may be an action predefined by the user performing a separate action or an initial setting of the log data management system.

단계 S830의 사전 정의된 액션과 관련된 몇몇 실시예에서, 상기 사전 정의된 액션은 특정 로그 데이터 클러스터에 속한 로그가 발견되면, 사용자 단말에 알람 메시지를 송신하는 것일 수 있다. 예를 들어, 로그 데이터 관리 시스템은 로그 데이터의 분류가 'WARN'에 해당하는 로그가 클러스터링 되었음을 식별한 것에 응답하여, 사용자 단말에 알람 메시지를 송신할 수 있다. 본 실시예에 따르면, 사용자가 시스템에 치명적인 로그가 출력되었음을 신속하게 인지하고 치명적인 오류가 시스템에 발생하기 전에 조치할 수도 있다.In some embodiments related to the predefined action of step S830, the predefined action may be sending an alarm message to the user terminal when a log belonging to a specific log data cluster is found. For example, the log data management system may transmit an alarm message to the user terminal in response to identifying that logs whose classification of log data is 'WARN' are clustered. According to this embodiment, the user can quickly recognize that a critical log has been output to the system and take action before a critical error occurs in the system.

단계 S830의 사전 정의된 액션과 관련된 몇몇 실시예에서, 상기 사전 정의된 액션은 특정 로그 데이터 클러스터에 속한 로그가 발견되면, 로그 데이터 관리 시스템이 해당 로그를 출력한 로그 데이터 출력 시스템을 강제로 초기화 하는 것일 수 있다. 예를 들어, 로그 데이터 관리 시스템은 사용자가 원치 않는 접속이 발생하였다는 내용과 관련된 로그가 클러스터링 되었음을 식별한 것에 응답하여, 로그 데이터 출력 시스템을 초기화 할 수 있다. 본 실시예에 따르면, 로그 데이터 관리 시스템은 사용자가 수동으로 조치하지 않아도 수시로 발생할 수 있는 보안 관련 문제에 대응할 수 있다.In some embodiments related to the predefined action of step S830, the predefined action is to force the log data management system to initialize the log data output system that outputs the log when a log belonging to a specific log data cluster is found. It could be. For example, the log data management system may initialize the log data output system in response to the user identifying that an unwanted connection has occurred and that the related logs have been clustered. According to this embodiment, the log data management system can respond to security-related problems that may occur from time to time without the user taking manual action.

로그 데이터 관리 시스템이 로그 데이터를 활용한 사전 정의된 액션을 자동으로 수행하지 않는 경우, 사용자는 클러스터링 된 각각의 로그 데이터 시퀀스를 확인하고 수동으로 원하는 작업을 직접 수행 해야 할 것이다. 본 실시예에 따르면, 사용자는 별도의 추가 동작을 하지 않고, 로그 데이터 관리 시스템이 로그 데이터 시퀀스 클러스터 각각에 대해 사용자가 사전 정의한 액션을 자동으로 수행함으로써, 로그 데이터를 활용하는 작업을 더욱 경제적으로 수행할 수 있는 효과를 제공할 수 있다.If the log data management system does not automatically perform predefined actions using log data, users will have to check each clustered log data sequence and manually perform the desired action. According to this embodiment, the log data management system automatically performs actions predefined by the user for each log data sequence cluster without any additional actions by the user, thereby performing tasks utilizing log data more economically. It can provide an effect that can be achieved.

이하, 도 12 내지 도 13을 참조하여 트라이 데이터 구조를 생성하는 단계에 대해서 자세히 설명하기로 한다.Hereinafter, the step of generating a try data structure will be described in detail with reference to FIGS. 12 and 13.

도 12를 참조하면, 단계 S810-1에서, 로그 데이터 관리 시스템은 로그 데이터 시퀀스 클러스터 별로 대표 로그 데이터 시퀀스를 결정할 수 있다. 일부 실시예에서, 로그 데이터 관리 시스템은 상기 대표 데이터를 해당 클러스터에서 최상단에 위치한 경우, 최하단에 위치한 경우, 최단 길이의 데이터 또는 최장 길이의 데이터 중 한 가지 방식으로 결정하여 추출할 수도 있으나, 전체 클러스터에서 상기 결정된 방식과 동일한 방식으로 대표 데이터를 추출하여야 하는 것에 유의하여야 한다.Referring to FIG. 12, in step S810-1, the log data management system may determine a representative log data sequence for each log data sequence cluster. In some embodiments, the log data management system may extract the representative data by determining one of the shortest length data or longest length data when located at the top or bottom of the cluster, but the entire cluster It should be noted that representative data must be extracted in the same manner as the method determined above.

다음으로 단계 S810-2에서, 로그 데이터 관리 시스템은 상기 추출된 대표 데이터를 토큰화하여 트라이 자료 구조에 저장할 수 있다. 상기 대표 데이터를 토큰화 하는 방법은, 상기 로그 데이터 시퀀스 클러스터링 모델을 생성하기 위하여 로그 데이터를 토큰화 한 방법과 동일할 수도 있다. Next, in step S810-2, the log data management system may tokenize the extracted representative data and store it in a trie data structure. The method of tokenizing the representative data may be the same as the method of tokenizing the log data to create the log data sequence clustering model.

단계 S810-2와 관련된 몇몇 실시예에서, 도 13을 참조하면, 50a 노드는 '\d+'의 형태를 갖는 토큰이 최상단에 위치한 대표 로그 데이터가 저장된 노드이다. 여기서, 50a 노드의 괄호 안 숫자는 50a 노드가 참조하는 클러스터의 수를 의미한다.In some embodiments related to step S810-2, referring to FIG. 13, node 50a is a node in which representative log data with a token in the form of '\d+' located at the top is stored. Here, the number in parentheses of the 50a node means the number of clusters referenced by the 50a node.

본 실시예에 따르면, 후술될 신규 삽입되는 로그 데이터를 클러스터링 하는 단계와 관련하여, 로그 데이터 관리 시스템은 신규 로그 데이터가 입력되었을 시 정규식 변환 및 상기 신규 데이터 토큰화 후 상기 클러스터링 모델의 DFS 탐색을 통해 가장 높은 유사도를 가진 클러스터로 신규 데이터를 분리할 수도 있다.According to this embodiment, in relation to the step of clustering newly inserted log data, which will be described later, when new log data is input, the log data management system performs regular expression conversion and tokenization of the new data through DFS search of the clustering model. New data can also be separated into clusters with the highest similarity.

다음으로 단계 S810-3에서, 로그 데이터 관리 시스템은 상기 대표 데이터를 토큰화 하여 생성된 트라이 자료 구조의 구성 요소 중 단일 자식 노드를 가지는 연속된 둘 이상의 노드들을 병합(merge)할 수 있다.Next, in step S810-3, the log data management system may merge two or more consecutive nodes having a single child node among the components of the trie data structure generated by tokenizing the representative data.

단계 S810-3과 관련 몇몇 실시예에서, 로그 데이터 관리 시스템은 루트 노드부터 자식 노드가 한 개인 지점 까지의 노드들을 모두 병합할 수 있다. 도 13을 참조하면, 노드 50a, 50b, 50c 는 데이터 구조에서 단 하나의 자식 노드를 갖는다. 여기서, 50d는 50c의 자식 노드이고, 50c는 50b의 자식 노드이며, 50b는 50a의 자식 노드이다. 로그 데이터 관리 시스템이 신규 로그 데이터 시퀀스 삽입 시 해당 데이터와 유사한 클러스터를 검색할 때, 자식 노드가 한 개인 지점을 연속해서 검색 하는 것은 무의미한 작업에 해당 할 수 있다. 본 실시예에 따르면, 상기 시스템은 상기 생성된 트라이 데이터 구조의 불필요한 노드들을 삭제함으로써 메모리 효율성을 극대화할 수도 있다.In some embodiments related to step S810-3, the log data management system may merge all nodes from the root node to the point where there is one child node. Referring to FIG. 13, nodes 50a, 50b, and 50c have only one child node in the data structure. Here, 50d is a child node of 50c, 50c is a child node of 50b, and 50b is a child node of 50a. When the log data management system searches for clusters similar to the data when inserting a new log data sequence, continuously searching for points with one child node may be a meaningless task. According to this embodiment, the system may maximize memory efficiency by deleting unnecessary nodes of the generated tri data structure.

다음으로 단계 S 810-4에서, 로그 데이터 관리 시스템은 참조하는 클러스터 수가 1인 모든 자식 노드를 삭제할 수 있다. 단, 참조 횟수가 1이지만 부모 노드가 그렇지 않은 경우인 노드 51과 같은 경우에는 삭제하지 아니한다. Next, in step S 810-4, the log data management system may delete all child nodes whose referring cluster number is 1. However, in cases such as node 51, where the reference count is 1 but the parent node is not, it is not deleted.

예를 들어, 도 13을 참조하면, 노드 52는 참조하는 클러스터 수가 1인 '-(1)' 노드의 참조하는 클러스터 수가 1인 자식 노드이다. 상위 부모 노드가 1개의 클러스터만을 참조함에도 불구하고, 새로운 클러스터의 노드를 지속 생성하는 것은 클러스터링 수행 시간과 메모리 용량의 비효율을 야기할 수 있다. For example, referring to FIG. 13, node 52 is a child node whose reference cluster number is 1 of the '-(1)' node whose reference cluster number is 1. Even though the upper parent node refers to only one cluster, continuously creating new cluster nodes can cause inefficiencies in clustering execution time and memory capacity.

본 실시예에 따르면, 로그 데이터 관리 시스템은 메모리를 차지하는 무의미한 노드를 삭제하여 로그 데이터 관리 시스템이 신규 로그 데이터 시퀀스 클러스터링을 수행할 때 삭제된 자식 노드만큼의 검색 횟수를 절감하고 메모리를 확보하여 시간적, 용량적 비용 절감 효과를 얻을 수도 있다.According to this embodiment, the log data management system deletes meaningless nodes that occupy memory, thereby reducing the number of searches as much as the deleted child node when the log data management system performs clustering of a new log data sequence, and securing memory to save time, Capacity cost savings can also be achieved.

지금까지 도 3 내지 도 13을 참조하여 설명된 본 개시의 몇몇 실시예에 따른 로그 데이터 관리 방법에 따르면, 신규 삽입되는 로그 데이터 시퀀스에 매칭되는 클러스터를 검색하기 위한 로그 데이터 클러스터링 모델이 생성될 수 있다. 보다 구체적으로, 로그 데이터 출력 시스템이 출력하는 로그 데이터의 일정한 패턴이 존재한다는 점에 기초하여, 로그 데이터를 토큰화 하고, 로그 데이터를 마스킹 하여, 사용자가 결정한 기준값 이상의 유사도를 가지는 로그 데이터 시퀀스끼리 클러스터링하고, 상기 클러스터링 결과가 적합하다는 것에 응답하여 트라이 데이터 구조를 가지는 로그 데이터 시퀀스 클러스터링 모델이 생성될 수 있다.According to the log data management method according to some embodiments of the present disclosure described so far with reference to FIGS. 3 to 13, a log data clustering model can be created to search for a cluster matching a newly inserted log data sequence. . More specifically, based on the fact that there is a certain pattern in the log data output by the log data output system, the log data is tokenized, the log data is masked, and log data sequences with a similarity greater than a user-determined standard are clustered. And, in response to the clustering result being suitable, a log data sequence clustering model having a trie data structure can be created.

이하에서는, 도 14 내지 15를 참조하여 로그 데이터 시퀀스 클러스터링 모델을 이용하여, 로그 데이터 관리 시스템이 신규 삽입되는 로그 데이터의 클러스터링을 자동으로 수행하는 방법을 구체적으로 설명하기로 한다. Hereinafter, with reference to FIGS. 14 and 15, a detailed description will be given of how the log data management system automatically clusters newly inserted log data using the log data sequence clustering model.

단계 S900에서, 로그 데이터 관리 시스템에 신규 로그 데이터가 삽입될 수 있다. 여기서, 신규 삽입되는 로그 데이터는 로그 데이터 출력 시스템에 의해 새로이 생성된 로그 데이터를 의미할 수 있다. 로그 데이터 시퀀스 클러스터링 모델은 앞서 도 3 내지 도 13을 참조하면 명확히 이해될 수 있다.In step S900, new log data may be inserted into the log data management system. Here, newly inserted log data may mean log data newly generated by a log data output system. The log data sequence clustering model can be clearly understood by referring to FIGS. 3 to 13.

다음으로, 단계 S1000에서 상기 신규 로그 데이터가 토큰화 될 수 있다. 여기서, 상기 로그 데이터는 공백, 특수문자(!, @, # 등), 숫자, 알파벳 등으로 표현 될 수 있다. 로그 데이터를 토큰화 하는 방법은 앞서 도 5를 참조하면 명확히 이해될 수 있다.Next, the new log data may be tokenized in step S1000. Here, the log data may be expressed as spaces, special characters (!, @, #, etc.), numbers, alphabets, etc. The method of tokenizing log data can be clearly understood by referring to FIG. 5.

다음으로, 단계 S1100에서, 로그 데이터 관리 시스템은 토큰화 된 로그 시퀀스 개별 토큰들을 마스킹할 수 있다. 여기서, 마스킹은 로그 시퀀스 개별 토큰들이 포함하는 문자열이 서로 다르더라도 그 문자열의 종류가 같으면 하나의 일관된 단어로 치환하는 것을 의미한다. 상기 토큰들을 마스킹 하는 방법은 앞서 도 5를 참조하면 명확히 이해될 수 있다.Next, in step S1100, the log data management system may mask individual tokens of the tokenized log sequence. Here, masking means replacing individual tokens of a log sequence with one consistent word even if the strings they contain are different from each other if the types of strings are the same. The method for masking the tokens can be clearly understood by referring to FIG. 5 above.

다음으로, 단계 S1200에서, 로그 데이터 관리 시스템은 각 토큰을 노드로 하는 트라이 데이터 구조 형태의 상기 로그 데이터 시퀀스 클러스터링 모델에 상기 마스킹된 신규 로그 데이터 시퀀스를 입력함으로써, 신규 로그 데이터와 유사한 형태의 클러스터를 검색할 수 있다. Next, in step S1200, the log data management system inputs the masked new log data sequence into the log data sequence clustering model in the form of a tri data structure with each token as a node, thereby forming a cluster similar to the new log data. You can search.

단계 S1200과 관련된 몇몇 실시예에서, 로그 데이터 관리 시스템은 신규 로그 데이터와 유사한 형태의 클러스터를 검색 중 신규 로그 데이터의 마지막 토큰에 도달할 때 검색을 종료할 수 있다.In some embodiments related to step S1200, the log data management system may end the search when the last token of the new log data is reached while searching for a cluster of a similar type as the new log data.

예를 들어, 도 15를 참조하면, 상기 신규 로그 데이터가 상기 클러스터링 모델에 입력되었다고 가정할 때, 상기 로그 데이터 관리 시스템은 신규 로그 데이터의 token[1]이 T1(60)과 같은 기호임을 확인하고 다음 노드를 검색하고, token[2]가 T2(61)와 같은 기호임을 확인하고 다음 노드를 검색하고, token[3]이 T9(62b)와 같은 기호임을 확인하고, token[3]이 신규 로그 데이터의 마지막 토큰임을 확인하고 검색을 종료하여 상기 신규 로그 데이터와 T9(62b)는 유사한 클러스터임을 평가할 수 있다.For example, referring to FIG. 15, assuming that the new log data is input to the clustering model, the log data management system confirms that token[1] of the new log data is the same symbol as T1 (60), and Search the next node, check that token[2] is the same symbol as T2(61), search the next node, check that token[3] is the same symbol as T9(62b), and check that token[3] is the same symbol as T9(62b). By confirming that it is the last token of the data and ending the search, it can be evaluated that the new log data and T9 (62b) are similar clusters.

단계 S1200과 관련된 다른 몇몇 실시예에서, 로그 데이터 관리 시스템은 상기 검색 수행 중 상기 데이터 구조의 노드 중 참조횟수가 1인 경우에 검색을 종료할 수 있다. In some other embodiments related to step S1200, the log data management system may terminate the search when the reference count among the nodes of the data structure is 1 while performing the search.

예를 들어, 도 15를 참조하면, 상기 신규 로그 데이터가 상기 클러스터링 모델에 입력되었다고 가정할 때, 상기 로그 데이터 관리 시스템은 신규 로그 데이터의 token[1]이 T1(60)과 같은 기호임을 확인하고 다음 노드를 검색하고, token[2]가 T2(61)와 같은 기호임을 확인하고 다음 노드를 검색하고, token[3]가 T9(62b)과 같은 기호임을 확인하고, 그리고 T9(62b) 노드의 참조 횟수가 1이라는 것에 응답하여, 검색을 종료하고, 신규 로그 데이터는 T9(62b)와 유사함을 평가할 수 있다.For example, referring to FIG. 15, assuming that the new log data is input to the clustering model, the log data management system confirms that token[1] of the new log data is the same symbol as T1 (60), and Search the next node, check that token[2] is the same symbol as T2(61), search the next node, check that token[3] is the same symbol as T9(62b), and check the node T9(62b). In response to the reference count being 1, the search may be terminated and the new log data may be evaluated for similarity to T9 62b.

단계 S1200과 관련된 또 다른 몇몇 실시예에서, 로그 데이터 관리 시스템은 신규 로그의 마지막 토큰까지 진행하지 못하고 더 이상 일치하는 토큰이 없는 경우 검색을 종료할 수 있다.In some other embodiments related to step S1200, the log data management system may not advance to the last token of the new log and terminate the search if there are no more matching tokens.

예를 들어, 도 15를 참조하면, 상기 신규 로그 데이터가 상기 클러스터링 모델에 입력되었다고 가정할 때, 상기 로그 데이터 관리 시스템은 신규 로그 데이터의 token[1]이 T1(60)과 같은 기호임을 확인하고 다음 노드를 검색하고, token[2]가 T2(61)와 같은 기호임을 확인하고 다음 노드를 검색하고, token[3]가 T3(62a) 및 T9(62b) 중 어느 것과도 동일하지 않은 기호임을 확인하고 검색을 종료하고, 신규 로그 데이터와 유사한 클러스터가 존재하지 않음을 평가할 수 있다.For example, referring to FIG. 15, assuming that the new log data is input to the clustering model, the log data management system confirms that token[1] of the new log data is the same symbol as T1 (60), and Search the next node, and find that token[2] is the same symbol as T2(61). Search the next node, and find that token[3] is not the same symbol as any of T3(62a) and T9(62b). You can check and end the search and evaluate that no clusters similar to the new log data exist.

다음으로, 단계 S1300에서, 로그 데이터 관리 시스템은 상기 검색의 결과 신규 로그 데이터와 유사한 클러스터가 클러스터링 모델 내에 존재하는지 평가할 수 있다. 상기 평가 방법은 단계 S1200과 관련된 몇몇 실시예들을 참조하면 명확히 이해될 수 있다.Next, in step S1300, the log data management system may evaluate whether a cluster similar to the new log data as a result of the search exists in the clustering model. The evaluation method can be clearly understood by referring to several embodiments related to step S1200.

트리형 데이터 구조 검색을 수행할 때, 로그 데이터 관리 시스템에 종료 규칙을 별도 지정하지 않고 검색을 수행할 경우, 상기 시스템은 무의미한 검색을 반복 수행한다. 이는 로그 데이터 관리 방법의 시간 복잡도의 비효율을 야기할 수 있다. When performing a tree-type data structure search, if the search is performed without specifying a separate termination rule in the log data management system, the system repeatedly performs a meaningless search. This may cause inefficiency in the time complexity of the log data management method.

본 실시예에 따르면, 로그 데이터 관리 시스템은 신규 로그 데이터의 모든 토큰을 검사할 필요 없이 클러스터 검색을 조기에 완료할 수 있으므로 신규 로그 데이터 클러스터링 수행의 시간적 효율을 대폭 향상시킬 수도 있다.According to this embodiment, the log data management system can complete cluster search early without having to inspect all tokens of new log data, thereby significantly improving the time efficiency of performing new log data clustering.

다음으로 단계 S1400에서, 상기 로그 데이터 관리 시스템이 로그 데이터 클러스터링 모델 내에서 더 이상 신규 로그 데이터의 token[p]와 일치하는 노드를 검색하지 못하여 트리 상에서 진행이 불가하거나, token[p]에 해당하는 노드는 모두 존재했지만 참조횟수가 1인 노드를 만나지 못했다는 것에 응답하여, 별도의 표시가 상기 신규 데이터에 추가될 수 있다. Next, in step S1400, the log data management system cannot proceed on the tree because it no longer searches for a node matching token[p] of the new log data within the log data clustering model, or the node corresponding to token[p] In response to the fact that the nodes were all present but no node with a reference count of 1 was encountered, a separate indication may be added to the new data.

단계 S1400과 관련된 일부 실시예에서, 상기 처리하지 못한 데이터는 모델 생성시에 해당 클러스터가 포함되지 않은 경우이거나, 해당 데이터를 처리할 필요가 없어서 의도적으로 상기 클러스터링 모델에 포함시키지 않은 경우일 수 있다. 이러한 예시에 본 개시의 범위가 한정되는 것은 아니고, 의도적 또는 비의도적으로 처리하지 못한 모든 경우가 본 개시의 범위에 포함될 수 있다. In some embodiments related to step S1400, the unprocessed data may be a case in which the corresponding cluster is not included when creating a model, or may be a case in which the corresponding data is not intentionally included in the clustering model because there is no need to process the data. The scope of the present disclosure is not limited to these examples, and all cases of intentional or unintentional failure to process may be included in the scope of the present disclosure.

본 실시예에 따르면, 로그 데이터 관리 시스템이 상기 클러스터링 모델에서 상기 신규 로그데이터의 클러스터를 분류하지 못했을 때에도 프로그램 에러로 인해 시스템이 작동을 중지하는 상황을 방지할 수 있다.According to this embodiment, even when the log data management system fails to classify a cluster of the new log data in the clustering model, a situation in which the system stops operating due to a program error can be prevented.

다음으로 단계 S1500에서, 로그 데이터 관리 시스템은 상기 신규 로그 데이터와 유사한 클러스터가 존재한다는 것에 응답하여, 해당 클러스터를 리턴할 수 있다. 여기서 상기 클러스터를 리턴하는 방법은, 사용자 인터페이스에 기존 클러스터와 신규 로그 데이터를 같은 분류임을 가시적으로 인지할 수 있도록 별도의 표시를 마련하는 것일 수 있다.Next, in step S1500, the log data management system may return the corresponding cluster in response to the existence of a cluster similar to the new log data. Here, a method of returning the cluster may be to provide a separate display in the user interface so that existing clusters and new log data can be visually recognized as the same classification.

본 실시예에 본 개시의 범위가 한정되는 것은 아니고, 신규 로그 데이터와 유사한 클러스터가 존재한다는 응답을 전제로 하는 모든 추가 작업이 본 개시의 범위에 포함될 수 있음을 유의해야 한다.It should be noted that the scope of the present disclosure is not limited to this embodiment, and all additional work premised on a response that a cluster similar to new log data exists may be included in the scope of the present disclosure.

도 16은 본 개시의 몇몇 실시예들에 따른 컴퓨팅 장치의 하드웨어 구성도이다. 도 16에 도시된 컴퓨팅 장치(1000)는, 예를 들어 도 1을 참조하여 설명한 로그 데이터 관리 시스템(100)을 가리키는 것일 수 있다. 컴퓨팅 장치(1000)는 하나 이상의 프로세서(1100), 시스템 버스(1600), 통신 인터페이스(1200), 프로세서(1100)에 의하여 수행되는 컴퓨터 프로그램(1500)을 로드(load)하는 메모리(1400)와, 컴퓨터 프로그램(1500)을 저장하는 스토리지(1300)를 포함할 수 있다.16 is a hardware configuration diagram of a computing device according to some embodiments of the present disclosure. For example, the computing device 1000 shown in FIG. 16 may refer to the log data management system 100 described with reference to FIG. 1 . The computing device 1000 includes one or more processors 1100, a system bus 1600, a communication interface 1200, a memory 1400 that loads a computer program 1500 executed by the processor 1100, and It may include a storage 1300 that stores a computer program 1500.

프로세서(1100)는 시뮬레이션 장치(200)의 각 구성의 전반적인 동작을 제어한다. 프로세서(1100)는 본 개시의 다양한 실시예들에 따른 방법/동작을 실행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있다. 메모리(1400)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(1400)는 본 개시의 다양한 실시예들에 따른 방법/동작들을 실행하기 위하여 스토리지(1300)로부터 하나 이상의 컴퓨터 프로그램(1500)을 로드(load) 할 수 있다. 버스(1600)는 시뮬레이션 장치(200)의 구성 요소 간 통신 기능을 제공한다. 통신 인터페이스(1200)는 시뮬레이션 장치(200)의 인터넷 통신을 지원한다. 스토리지(1300)는 하나 이상의 컴퓨터 프로그램(1500)을 비임시적으로 저장할 수 있다. 컴퓨터 프로그램(1500)은 본 개시의 다양한 실시예들에 따른 방법/동작들이 구현된 하나 이상의 인스트럭션들(instructions)을 포함할 수 있다. 컴퓨터 프로그램(1500)이 메모리(1400)에 로드 되면, 프로세서(1100)는 상기 하나 이상의 인스트럭션들을 실행시킴으로써 본 개시의 다양한 실시예들에 따른 방법/동작들을 수행할 수 있다.The processor 1100 controls the overall operation of each component of the simulation device 200. The processor 1100 may perform operations on at least one application or program to execute methods/operations according to various embodiments of the present disclosure. The memory 1400 stores various data, commands and/or information. The memory 1400 may load one or more computer programs 1500 from the storage 1300 to execute methods/operations according to various embodiments of the present disclosure. The bus 1600 provides communication between components of the simulation device 200. The communication interface 1200 supports Internet communication of the simulation device 200. Storage 1300 may non-temporarily store one or more computer programs 1500. The computer program 1500 may include one or more instructions implementing methods/operations according to various embodiments of the present disclosure. When the computer program 1500 is loaded into the memory 1400, the processor 1100 can perform methods/operations according to various embodiments of the present disclosure by executing the one or more instructions.

컴퓨터 프로그램(1500)은 복수의 로그 시퀀스들을 포함하는 로그 데이터를 수신하는 동작, 클러스터링 모델링 프로그램이 로드되는 동작, 상기 메모리에 로드된 클러스터링 모델링 프로그램을 실행하는 동작, 상기 클러스터링 프로그램이 상기 로그 시퀀스의 토큰 중 일부를 마스킹 처리하는 동작, 상기 클러스터링 프로그램이 각 토큰을 노드로 하는 트라이 데이터 구조 형태의 클러스터링 모델에, 상기 마스킹 처리된 로그 시퀀스를 입력합으로써, 상기 로그 시퀀스의 클러스터를 식별하는 동작, 상기 로그 시퀀스의 상기 식별된 클러스터를 출력하는 동작 및 상기 로그 시퀀스에 상기 식별된 클러스터에 대한 정보를 부가(attach)하는 동작을 수행하기 위한 인스트럭션들(instructions)을 포함할 수 있다.The computer program 1500 includes the following operations: receiving log data including a plurality of log sequences, loading a clustering modeling program, executing the clustering modeling program loaded in the memory, and generating a token of the log sequence. an operation of masking some of the log sequences, an operation of the clustering program inputting the masked log sequence into a clustering model in the form of a tri data structure with each token as a node, thereby identifying a cluster of the log sequence, the log It may include instructions for performing an operation of outputting the identified cluster of a sequence and an operation of attaching information about the identified cluster to the log sequence.

지금까지 도 1 내지 도 16을 참조하여 본 개시의 다양한 실시예들 및 그 실시예들에 따른 효과들을 언급하였다. 본 개시의 기술적 사상에 따른 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.So far, various embodiments of the present disclosure and effects according to the embodiments have been mentioned with reference to FIGS. 1 to 16 . The effects according to the technical idea of the present disclosure are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description below.

지금까지 설명된 본 개시의 기술적 사상은 컴퓨터가 읽을 수 있는 매체 상에 컴퓨터가 읽을 수 있는 코드로 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체에 기록된 상기 컴퓨터 프로그램은 인터넷 등의 네트워크를 통하여 다른 컴퓨팅 장치에 전송되어 상기 다른 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 다른 컴퓨팅 장치에서 사용될 수 있다.The technical ideas of the present disclosure described so far can be implemented as computer-readable code on a computer-readable medium. The computer program recorded on the computer-readable recording medium can be transmitted to another computing device through a network such as the Internet and installed on the other computing device, and thus can be used on the other computing device.

도면에서 동작들이 특정한 순서로 도시되어 있지만, 반드시 동작들이 도시된 특정한 순서로 또는 순차적 순서로 실행되어야만 하거나 또는 모든 도시 된 동작들이 실행되어야만 원하는 결과를 얻을 수 있는 것으로 이해되어서는 안 된다. 특정 상황에서는, 멀티태스킹 및 병렬 처리가 유리할 수도 있다. 이상 첨부된 도면을 참조하여 본 개시의 실시예들을 설명하였지만, 본 개시가 속하는 기술분야에서 통상의 지식을 가진 자는 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 본 발명이 다른 구체적인 형태로도 실시될 수 있다는 것을 이해할 수 있다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 개시에 의해 정의되는 기술적 사상의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Although operations are shown in the drawings in a specific order, it should not be understood that the operations must be performed in the specific order shown or sequential order or that all illustrated operations must be performed to obtain the desired results. In certain situations, multitasking and parallel processing may be advantageous. Although embodiments of the present disclosure have been described above with reference to the attached drawings, those skilled in the art will understand that the present invention can be implemented in other specific forms without changing the technical idea or essential features. I can understand that there is. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive. The scope of protection of the present invention should be interpreted in accordance with the claims below, and all technical ideas within the equivalent scope should be construed as being included in the scope of rights of the technical ideas defined by this disclosure.

Claims

In a method performed by a computing system,
masking some of the tokens of log data including a plurality of log sequences;
Sorting the masked log sequences; and
Comprising the step of clustering log sequences using the alignment results of the log sequences,
How to manage log data.

According to claim 1,
The masking step is,
Identifying a masking target token for each of the plurality of log sequences; and
Comprising replacing the masked token with a masking expression corresponding to the masked token,
How to manage log data.

According to clause 2,
The substitution step is,
determining a type of the token to be masked; and
Comprising the step of replacing the masked token with a masking expression corresponding to the type of the masked token,
How to manage log data.

According to claim 1,
Further comprising generating a clustering model using the clustering result,
How to manage log data.

According to clause 4,
The step of generating the clustering model is,
Including the step of creating a trie data structure with each token as a node,
Each node has the reference count for each cluster as an attribute value,
How to manage log data.

According to clause 5,
The step of creating the try data structure is,
determining a representative log sequence for each cluster; and
Including generating the trie data structure using the representative log sequences,
How to manage log data.

According to clause 5,
The step of creating the try data structure is,
When there are two or more consecutive nodes having a single child node in the direction from the root node to the leaf node, including merging two or more consecutive nodes having a single child node,
How to manage log data.

According to clause 5,
The step of creating the try data structure is,
Including deleting all child nodes of the node having '1' as the reference count,
How to manage log data.

According to clause 4,
The step of generating the clustering model is,
In response to identifying clusters of the log sequence using the generated clustering model, automatically performing a predefined action on each cluster,
How to manage log data.

According to claim 1,
The clustering step is,
calculating a difference value between adjacent log sequences according to a result of sorting the log sequences; and
Including clustering log sequences using whether the difference value exceeds a reference value,
How to manage log data.

According to claim 10,
The step of calculating the difference value is,
With respect to adjacent first log sequences and second log sequences, identifying an initial offset at which a different token appears between the first log sequence and the second log sequence; and
Comprising determining the offset as a difference value between the first log sequence and the second log sequence,
How to manage log data.

According to claim 10,
The above standard values are:
A value determined according to the user input to the user interface including the alignment result of the log sequences and the reference value change control,
How to manage log data.

According to claim 12,
The user interface is,
When a user input for changing the reference value is applied through the reference value change control, the changed clustering result is displayed,
How to manage log data.

According to claim 12,
The user interface is,
When a reference value change user input is applied through the reference value change control, the changed clustering result is displayed by overlaying the area of each cluster on the alignment result of the log sequences,
How to manage log data.

In a method performed by a computing system,
masking some of the tokens of the log sequence;
Including the step of identifying a cluster of the log sequence by inputting the masked log sequence into a clustering model in the form of a trie data structure with each token as a node,
How to manage log data.

According to claim 15,
The masking step is,
Identifying a masking target token for each of the plurality of log sequences; and
Comprising replacing the masked token with a masking expression corresponding to the type of the masked token,
How to manage log data.

According to claim 15,
Each node of the trie data structure has a cluster reference count,
The step of identifying clusters of the log sequence is,
Node traversing is performed by matching the token of the masked log sequence with each node of the clustering model. When a node with a reference count value of 1 is reached, the node traversing is terminated, and the process is terminated. Including identifying the cluster indicated by the node at the time as the cluster of the masked log sequence,
How to manage log data.

According to claim 15,
In response to identifying a cluster in the log sequence, automatically performing a predefined action on each cluster,
How to manage log data.

In computing systems,
a network interface for receiving log data including a plurality of log sequences;
Memory into which the clustering modeling program is loaded; and
Including one or more processors executing a clustering modeling program loaded in the memory,
The clustering modeling program is,
An instruction for masking some of the tokens of each log sequence included in the log data;
an instruction for sorting each of the masked log sequences; and
Including instructions for clustering log sequences using the alignment results of the log sequences,
Log data management system.

In computing systems,
A network interface that receives log sequences;
Memory into which the clustering program is loaded; and
Including one or more processors executing a clustering program loaded in the memory,
The clustering program is,
Instructions for masking some of the tokens of the log sequence;
An instruction for identifying a cluster of the log sequence by inputting the masked log sequence into a clustering model in the form of a trie data structure with each token as a node; and
Comprising instructions that perform at least one of outputting the identified cluster of the log sequence and attaching information about the identified cluster to the log sequence,
Log data management system.