KR20210124811A

KR20210124811A - Apparatus and method for generating training data for network failure diagnosis

Info

Publication number: KR20210124811A
Application number: KR1020200042336A
Authority: KR
Inventors: 김태영
Original assignee: 주식회사 케이티
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2021-10-15
Also published as: KR102511985B1

Abstract

Disclosed are a training data generation apparatus and a training data generation method, capable of generating training data to increase accuracy and stably increase a recall factor while reducing bias of an artificial intelligence engine for diagnosing a network failure. According to the present invention, the training data generation apparatus for generating training data for diagnosing a network failure includes: a preprocessing unit for performing classification and labeling operations for each failure type on alert data collected from network devices so as to vectorize the alert data into a predetermined dimension; a first data increasing unit for selecting insufficient data that meets a predetermined criterion among the vectorized data, and interpolating the selected insufficient data to increase the insufficient data; a feature extraction unit for receiving the vectorized data in which the insufficient data is increased as input data, extracting features of the input data and clustering features for each failure type so as to generate compressed data, and performing training to minimize a difference between the input data and output data obtained by restoring the compressed data; and a second data increasing unit for adding predetermined data to a cluster region corresponding to a failure type selected by a user among regions of the compressed data.

Description

Apparatus and method for generating training data for network failure diagnosis

본 발명은 인공지능을 이용한 네트워크 장애 진단 기술에 관한 것으로, 보다 구체적으로 네트워크 장애를 진단하기 위한 인공지능을 학습하는데 사용할 학습 데이터를 생성하는 학습 데이터 생성 장치 및 방법에 관한 것이다. The present invention relates to a network failure diagnosis technology using artificial intelligence, and more particularly, to a learning data generating apparatus and method for generating learning data to be used for learning an artificial intelligence for diagnosing a network failure.

종래의 네트워크 장애 진단 시스템의 경우 간단한 규칙에 근거하여 근원 장애의 유무를 판단하므로 정확도가 떨어지는 문제가 있다. 이에 따라 네트워크 장애 진단에 인공지능이 도입되고 있다. 인공지능에 의한 네트워크 장애 진단은 경험적으로 학습한 데이터에 근거하여 근원 장애 유무를 판단하므로 학습 데이터가 부족한 경우 정확도가 떨어지고, 완전하지 않은 장애 진단 모델(Incomplete Model)이 생성되는 문제가 있다. In the case of a conventional network failure diagnosis system, since the presence or absence of a root failure is determined based on a simple rule, there is a problem in that accuracy is lowered. Accordingly, artificial intelligence is being introduced to diagnose network failures. Network failure diagnosis by artificial intelligence determines the presence or absence of a root failure based on empirically learned data, so when the learning data is insufficient, the accuracy is lowered, and there is a problem that an incomplete failure diagnosis model is generated.

학습 데이터가 부족한 경우 인공지능 엔진은 불균형한 학습 데이터를 이용하여 학습을 하게 되어, 다수의 집단으로 구성된 네트워크 장애 데이터 유형에 인공지능 엔진이 편향되는 문제가 발생한다. 이러한 편향성을 학습한 인공지능 엔진이 실제 현장에 적용되는 경우, 소수의 장애 유형에 속한 데이터를 제대로 분류하지 못할 가능성이 높아진다. 따라서 네트워크 장애 진단을 위한 인공지능 엔진의 정확도(Accuracy)가 높더라도 학습 데이터의 개수가 적은 네트워크 장애 유형(즉, 클래스)의 분류 성능인 재현율(Recall)이 급격히 낮아지는 현상이 발생할 수 있다. 예를 들어, 네트워크 장애 진단의 전체 정확도는 90% 이상으로 높더라도, 발생 빈도가 많지 않은 네트워크 장애 유형에 대해서는 정확도가 급격히 낮아지는 현상이 발생할 수 있다.When the training data is insufficient, the AI engine learns using the unbalanced training data, causing a problem in that the AI engine is biased to the type of network failure data composed of a large number of groups. If the AI engine that has learned this bias is applied to the actual field, it is highly likely that data belonging to a small number of failure types will not be properly classified. Therefore, even if the accuracy of the artificial intelligence engine for diagnosing network failures is high, a phenomenon may occur that the recall, which is the classification performance of a network failure type (ie, class) with a small number of training data, is sharply lowered. For example, even if the overall accuracy of diagnosing network failures is as high as 90% or more, a phenomenon in which the accuracy is rapidly lowered may occur for types of network failures that do not occur frequently.

이에 따라 학습 데이터를 조정하는 것을 고려할 수 있으나 기존에 수집된 많은 양의 데이터와의 균형을 맞추기 어렵다. 특히, 종래의 네트워크 장애 진단 시스템은 학습 데이터의 균형을 맞추기 위해 부족한 장애 유형의 데이터를 늘리는 것이 아닌, 상대적으로 양이 많은 장애 유형의 데이터의 개수를 줄여 데이터의 균형을 맞춘다. 그런데 이러한 방법은 장애 유형별 학습 데이터의 균형을 맞추어 인공지능 엔진의 학습 편향성을 줄일 수 있지만 데이터를 제거하는 과정에서 유용한 정보가 손실되는 문제점이 발생한다.Accordingly, it may be considered to adjust the training data, but it is difficult to balance the large amount of data previously collected. In particular, the conventional network failure diagnosis system balances the data by reducing the number of relatively large number of failure type data, rather than increasing the insufficient failure type data to balance the learning data. However, this method can reduce the learning bias of the AI engine by balancing the learning data for each type of disability, but there is a problem in that useful information is lost in the process of removing the data.

데이터의 개수를 줄이는 대신, 부족한 장애 유형의 데이터를 늘리는 것을 고려할 수 있다. 인공지능 분야에서 부족한 데이터를 늘리는 방법으로 데이터의 상하좌우 이동, 대칭 이동, 회전과 같은 조작을 하거나 소수의 데이터를 변형하여 새로운 합성 데이터를 생성하는 방법을 사용한다. 그러나 네트워크 장애 발생시 수집되는 경보 데이터는 픽셀 형태로 구성된 이미지와 달리 각 변수간 상관관계가 명시적으로 드러나지 않고, 단순히 데이터를 상하좌우로 이동시키거나, 대칭 이동 또는 회전과 같은 조작을 통해 표현하기 어렵다. 또한 부족한 학습 데이터를 증가시키기 위해 운용자가 원하는 유형의 데이터를 선택적으로 생성하기 어렵다. 따라서 기존의 학습 데이터의 생성 및 변조 기술 적용시 네트워크 장애 진단 인공지능 엔진의 일반화된 성능을 보장하기 어렵다. Instead of reducing the number of data, you may consider increasing the data of the scarce type of failure. In the field of artificial intelligence, as a method of increasing insufficient data, manipulation such as moving up, down, left and right, symmetrical movement, and rotation of data or transforming a small number of data to generate new synthetic data is used. However, the alarm data collected when a network failure occurs does not explicitly reveal the correlation between each variable unlike the pixel-shaped image, and it is difficult to simply move the data up, down, left and right, or to express it through manipulations such as symmetric movement or rotation. . In addition, it is difficult to selectively generate the type of data desired by the operator in order to increase the insufficient training data. Therefore, it is difficult to guarantee the generalized performance of the artificial intelligence engine for diagnosing network failures when the existing learning data generation and modulation technology is applied.

본 발명은 상술한 문제점을 해결하기 위해 제안된 것으로, 네트워크 장애 진단을 위한 인공지능 엔진의 편향성을 감소시키면서 정확도를 높이고 재현율도 안정적으로 상승시킬 수 있도록 학습 데이터를 생성하는 학습 데이터 생성 장치 및 방법을 제공하는데 그 목적이 있다.The present invention has been proposed to solve the above problems, and while reducing the bias of an artificial intelligence engine for diagnosing network failures, an apparatus and method for generating learning data for generating learning data to increase accuracy and stably increase recall Its purpose is to provide

일 실시예에 따른 네트워크 장애 진단을 위한 학습 데이터를 생성하는 학습 데이터 생성 장치는, 네트워크 장비들로부터 수집되는 경보 데이터들에 대해 장애 유형별로 분류 및 레이블링 작업을 수행하여 소정의 차원으로 벡터화하는 전처리부; 벡터화된 데이터들 중 소정의 기준에 부합하는 부족 데이터를 선택하고 선택된 부족 데이터들을 보간하여 부족 데이터를 증가시키는 제 1 데이터 증가부; 부족 데이터가 증가된 벡터화된 데이터들을 입력 데이터로서 수신하고, 입력 데이터의 특징을 추출하여 장애 유형별로 군집화하여 압축 데이터를 생성하며, 압축 데이터를 복원한 출력 데이터와 상기 입력 데이터의 차이를 최소화하는 학습을 수행하는 특징 추출부; 및 상기 압축 데이터의 영역 중 사용자로부터 선택된 장애 유형에 해당하는 군집 영역에 소정의 데이터를 추가하는 제 2 데이터 증가부;를 포함한다.The apparatus for generating learning data for generating learning data for diagnosing network failures according to an embodiment includes a preprocessing unit for vectorizing alert data collected from network devices into a predetermined dimension by performing classification and labeling operations for each failure type. ; a first data increasing unit that selects insufficient data that meets a predetermined criterion from among the vectorized data and increases the insufficient data by interpolating the selected insufficient data; Learning to receive vectorized data with increased insufficient data as input data, extract features of the input data and cluster them by type of failure to generate compressed data, and minimize the difference between the output data from which the compressed data is restored and the input data a feature extraction unit that performs and a second data increase unit for adding predetermined data to a cluster area corresponding to a type of disability selected by a user among the areas of the compressed data.

상기 제 1 데이터 증가부는, 상기 선택된 부족 데이터를 보간하여 부족 데이터의 내부 영역 및 외부 영역 각각에 부족 데이터를 증가시킬 수 있다.The first data increasing unit may increase the insufficient data in each of the inner region and the outer region of the insufficient data by interpolating the selected insufficient data.

상기 제 1 데이터 증가부는, 상기 내부 영역에 부족 데이터를 증가시키는데 있어서 두 개의 부족 데이터를 이은 선분의 내분점에 데이터를 추가하고, 상기 외부 영역에 부족 데이터를 증가시키는데 있어서 두 개의 부족 데이터를 이은 선분의 외분점에 데이터를 추가할 수 있다.The first data increasing unit adds data to an inner division point of a line segment connecting two insufficient data in increasing the insufficient data in the inner region, and increases the insufficient data in the outer region by adding data to the line segment connecting the two insufficient data You can add data to the outer point.

상기 제 1 데이터 증가부는, 가중치를 적용하여 상기 내부 영역 및 상기 외부 영역을 조정할 수 있다.The first data increasing unit may adjust the inner region and the outer region by applying a weight.

상기 특징 추출부는, 상기 부족 데이터가 증가된 벡터화된 데이터들의 각 구성원소별 발생 확률을 장애 유형별 확률밀도함수의 합으로 표현하여 전체 데이터들의 평균 및 분산을 추정하고, 추정된 평균 및 분산을 이용하여 압축 데이터를 생성할 수 있다.The feature extraction unit estimates the average and variance of all data by expressing the probability of occurrence for each element element of the vectorized data in which the lack of data is increased as the sum of the probability density function for each type of disorder, and using the estimated mean and variance Compressed data can be created.

상기 제 2 데이터 증가부는, 상기 선택된 장애 유형에 해당하는 군집 영역의 내부 영역에서 선택된 임의의 두 개의 데이터를 잇는 선분의 내분점에 소정의 데이터를 추가할 수 있다.The second data increasing unit may add predetermined data to an internal division point of a line segment connecting any two pieces of data selected in the inner region of the cluster region corresponding to the selected failure type.

일 실시예에 따른 학습 데이터 생성 장치에서 네트워크 장애 진단을 위한 학습 데이터를 생성하는 방법은, 네트워크 장비들로부터 수집되는 경보 데이터들에 대해 장애 유형별로 분류 및 레이블링 작업을 수행하여 소정의 차원으로 벡터화하는 단계; 벡터화된 데이터들 중 소정의 기준에 부합하는 부족 데이터를 선택하고 선택된 부족 데이터들을 보간하여 부족 데이터를 증가시키는 단계; 부족 데이터가 증가된 벡터화된 데이터들을 입력 데이터로서 수신하고, 입력 데이터의 특징을 추출하여 장애 유형별로 군집화하여 압축 데이터를 생성하며, 압축 데이터를 복원한 출력 데이터와 상기 입력 데이터의 차이를 최소화하는 학습을 수행하는 단계; 및 상기 압축 데이터의 영역 중 사용자로부터 선택된 장애 유형에 해당하는 군집 영역에 소정의 데이터를 추가하는 단계;를 포함한다.A method for generating learning data for diagnosing a network failure in the apparatus for generating learning data according to an embodiment includes performing classification and labeling operations on alert data collected from network devices for each failure type and vectorizing it into a predetermined dimension step; selecting insufficient data that meets a predetermined criterion from among the vectorized data and increasing the insufficient data by interpolating the selected insufficient data; Learning to receive vectorized data with increased insufficient data as input data, extract features of the input data and cluster them by type of failure to generate compressed data, and minimize the difference between the output data from which the compressed data is restored and the input data performing the steps; and adding predetermined data to a cluster area corresponding to the type of disability selected by the user among the areas of the compressed data.

상기 부족 데이터를 증가시키는 단계는, 상기 선택된 부족 데이터를 보간하여 부족 데이터의 내부 영역 및 외부 영역 각각에 부족 데이터를 증가시킬 수 있다.The increasing of the insufficient data may include increasing the insufficient data in each of the inner region and the outer region of the insufficient data by interpolating the selected insufficient data.

상기 부족 데이터를 증가시키는 단계는, 상기 내부 영역에 부족 데이터를 증가시키는데 있어서 두 개의 부족 데이터를 이은 선분의 내분점에 데이터를 추가하고, 상기 외부 영역에 부족 데이터를 증가시키는데 있어서 두 개의 부족 데이터를 이은 선분의 외분점에 데이터를 추가할 수 있다.The step of increasing the insufficient data includes adding data to the inner division point of a line segment connecting the two insufficient data in increasing the insufficient data in the inner region, and connecting the two insufficient data in increasing the insufficient data in the outer region. Data can be added to the outer division of the line segment.

상기 부족 데이터를 증가시키는 단계는, 가중치를 적용하여 상기 내부 영역 및 상기 외부 영역을 조정하는 단계를 포함할 수 있다.The increasing of the insufficient data may include adjusting the inner region and the outer region by applying a weight.

상기 학습을 수행하는 단계는, 상기 부족 데이터가 증가된 벡터화된 데이터들의 각 구성원소별 발생 확률을 장애 유형별 확률밀도함수의 합으로 표현하여 전체 데이터들의 평균 및 분산을 추정하고, 추정된 평균 및 분산을 이용하여 압축 데이터를 생성할 수 있다.The step of performing the learning comprises estimating the average and variance of all data by expressing the occurrence probability for each element element of the vectorized data in which the insufficient data is increased as the sum of the probability density functions for each type of disorder, and the estimated mean and variance can be used to generate compressed data.

상기 데이터를 추가하는 단계는, 상기 선택된 장애 유형에 해당하는 군집 영역의 내부 영역에서 선택된 임의의 두 개의 데이터를 잇는 선분의 내분점에 소정의 데이터를 추가할 수 있다.In the adding of the data, predetermined data may be added to an internal division point of a line segment connecting any two pieces of data selected in the inner region of the cluster region corresponding to the selected disability type.

본 발명에 따르면, 네트워크 장애 진단을 위한 인공지능 엔진의 학습 데이터 중 부족한 데이터를 증가시킴으로써 인공지능 엔진의 편향성을 감소시킬 수 있고, 정확도(accuracy) 뿐만 아니라 재현율(recall)도 한정적으로 상승시킬 수 있다. 특히 부족한 데이터를 단순히 변형하거나 차원 축소를 통해 압축된 데이터로부터 임의 유형의 데이터를 합성하는 것이 아닌 실제 데이터와 합성 데이터 간 분포의 차이를 고려하여 부족 데이터를 생성함으로써 기존 불균형 학습 데이터에 의한 인공지능 엔진의 과적합 현상을 방지할 수 있다. According to the present invention, it is possible to reduce the bias of the AI engine by increasing the insufficient data among the training data of the AI engine for diagnosing network failure, and to increase not only accuracy but also recall rate limitedly. . In particular, rather than simply transforming insufficient data or synthesizing arbitrary types of data from compressed data through dimensionality reduction, the AI engine based on existing imbalanced learning data creates insufficient data by considering the difference in distribution between real and synthetic data. can prevent overfitting.

또한, 본 발명은, 오토인코더의 인코더를 통해 벡터화된 데이터로부터 특징을 추출하여 압축하는데 있어서 유클리디안 거리 내에서 장애 유형을 구분할 수 있는 경계를 형성하고 이를 바탕으로 사용자가 데이터가 부족한 영역을 선택하면, 해당 선택된 영역에, 해당 장애 유형의 압축 정보를 전사함으로써 모든 장애 유형에 균형적인 학습 데이터를 생성할 수 있다.In addition, in the present invention, in extracting and compressing features from vectorized data through an encoder of an autoencoder, a boundary that can distinguish types of failure within the Euclidean distance is formed, and based on this, the user selects an area lacking data. Then, it is possible to generate balanced learning data for all disability types by transcribed compressed information of the corresponding disability type in the selected region.

도 1은 일 실시예에 따른 학습 데이터 생성 장치의 개략적 구성도이다.
도 2는 본 발명의 일 실시예에 따른 경보 데이터들의 테이블의 예를 나타낸 도면이다.
도 3은 본 발명의 일 실시예에 따른 부족 데이터의 내부 영역 및 외부 영역에 데이터를 증가시키는 예를 나타낸 도면이다.
도 4는 종래의 압축 데이터와 본 발명의 실시예에 따른 압축 데이터를 비교한 도면이다.
도 5는 본 발명의 일 실시예에 따른 학습 데이터 생성 방법을 설명하는 흐름도이다.1 is a schematic configuration diagram of an apparatus for generating learning data according to an embodiment.
2 is a diagram illustrating an example of a table of alert data according to an embodiment of the present invention.
3 is a diagram illustrating an example of increasing data in an inner area and an outer area of insufficient data according to an embodiment of the present invention.
4 is a diagram comparing conventional compressed data and compressed data according to an embodiment of the present invention.
5 is a flowchart illustrating a method of generating learning data according to an embodiment of the present invention.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명하기로 한다.The above-described objects, features, and advantages will become more apparent through the following detailed description in relation to the accompanying drawings, whereby those of ordinary skill in the art to which the present invention pertains can easily implement the technical idea of the present invention. There will be. In addition, in describing the present invention, if it is determined that a detailed description of a known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른 학습 데이터 생성 장치의 개략적 구성도이다. 본 실시예에 따른 학습 데이터 생성 장치(100)는, 메모리, 메모리 제어기, 하나 이상의 프로세서(CPU), 주변 인터페이스, 입출력(I/O) 서브시스템, 디스플레이 장치, 입력 장치 및 통신 회로를 포함할 수 있다. 메모리는 고속 랜덤 액세스 메모리를 포함할 수 있고, 또한 하나 이상의 자기 디스크 저장 장치, 플래시 메모리 장치와 같은 불휘발성 메모리, 또는 다른 불휘발성 반도체 메모리 장치를 포함할 수 있다. 프로세서 및 주변 인터페이스와 같은 다른 구성요소에 의한 메모리로의 액세스는 메모리 제어기에 의하여 제어될 수 있다. 메모리는 각종 정보와 프로그램 명령어를 저장할 수 있고, 프로그램은 프로세서에 의해 실행된다. 주변 인터페이스는 입출력 주변 장치를 프로세서 및 메모리와 연결한다. 하나 이상의 프로세서는 다양한 소프트웨어 프로그램 및/또는 메모리에 저장되어 있는 명령어 세트를 실행하여 시스템을 여러 기능을 수행하고 데이터를 처리한다. I/O 서브시스템은 디스플레이 장치, 입력 장치와 같은 입출력 주변장치와 주변 인터페이스 사이에 인터페이스를 제공한다. 통신 회로는 외부 포트를 통한 통신 또는 RF 신호에 의한 통신을 수행한다. 통신 회로는 전기 신호를 RF 신호로 또는 그 반대로 변환하며 이 RF 신호를 통하여 통신 네트워크, 다른 이동형 게이트웨이 장치 및 통신 장치와 통신할 수 있다. 도 1에 도시된 바와 같이, 학습 데이터 생성 장치(100)는, 전처리부(110), 제 1 데이터 증가부(120), 특징 추출부(130) 및 제 2 데이터 증가부(140)를 포함하고, 이들 구성요소는 프로그램으로 구현되어 메모리에 저장되고 프로세서에 의해 실행될 수 있고, 또는 하드웨어와 소프트웨어의 조합에 의해 구현되어 동작할 수 있다. 1 is a schematic configuration diagram of an apparatus for generating learning data according to an embodiment. The training data generating apparatus 100 according to the present embodiment may include a memory, a memory controller, one or more processors (CPU), a peripheral interface, an input/output (I/O) subsystem, a display device, an input device, and a communication circuit. have. The memory may include high-speed random access memory, and may also include one or more magnetic disk storage devices, non-volatile memories such as flash memory devices, or other non-volatile semiconductor memory devices. Access to the memory by other components, such as the processor and peripheral interfaces, may be controlled by the memory controller. The memory may store various kinds of information and program instructions, and the program is executed by the processor. Peripheral interfaces connect the I/O peripherals to the processor and memory. One or more processors execute various software programs and/or sets of instructions stored in memory to perform various functions and process data in the system. The I/O subsystem provides an interface between input/output peripherals such as display devices and input devices and the peripheral interface. The communication circuit performs communication through an external port or communication by an RF signal. The communication circuitry converts electrical signals into RF signals and vice versa through which the RF signals can communicate with communication networks, other mobile gateway devices, and communication devices. As shown in FIG. 1 , the learning data generating apparatus 100 includes a preprocessing unit 110 , a first data increasing unit 120 , a feature extracting unit 130 , and a second data increasing unit 140 , and , these components may be implemented as a program, stored in a memory, and executed by a processor, or may be implemented and operated by a combination of hardware and software.

전처리부(110)는, 장애 발생에 따라 다양한 네트워크 장비에서 수집된 경보 데이터를 수신하여 전처리한다. 전처리부(110)는, 다양한 네트워크 장비에서 발생한 경보 데이터를 수집하는 네트워크 관리 시스템(NMS : Network Management System)으로부터 경보 데이터를 수신할 수 있다. 전처리부(110)는, 경보 데이터를 각 장애마다 분류하고 또한 각 장애의 경보 데이터들의 종류를 분류하며 레이블링 작업을 수행하여 테이블 형태로 변환하여 저장부(예컨대, 데이터베이스)에 저장할 수 있다. 경보 데이터는 장애가 발생한 네트워크 장비의 식별정보와 타입 정보, 경보의 유형 정보, 장애가 발생한 네트워크 장비의 계위 정보 등을 포함할 수 있다. 경보의 유형은, 예를 들어, LOS(Loss of Signal), LOP(Loss of Pointer), RDI(Remote Defect Indication) 등을 포함한다. The pre-processing unit 110 receives and pre-processes alert data collected from various network devices according to the occurrence of a failure. The preprocessor 110 may receive alert data from a network management system (NMS) that collects alert data generated from various network devices. The preprocessor 110 may classify the alarm data for each failure, classify the types of alarm data of each failure, perform a labeling operation, convert it into a table form, and store it in a storage unit (eg, a database). The alarm data may include identification information and type information of a network device in which a failure has occurred, type information of an alarm, and hierarchical information of a network device in which a failure has occurred. The types of alerts include, for example, Loss of Signal (LOS), Loss of Pointer (LOP), Remote Defect Indication (RDI), and the like.

전처리부(110)는, 각 장애마다 생성한 경보 데이터들의 테이블을 기초로 각 장애마다 경보 데이터들을 소정의 차원으로 벡터화한다. 전처리부(110)는 사용자로부터 차원의 정보(예컨대, 1차원, 2차원 등)를 수신하고, 수신된 차원의 정보에 따라 각 장애별 경보 데이터들을 벡터화한다. 본 실시예의 경우 1차원으로 벡터화하는 것을 예로 든다. 도 2는 본 발명의 일 실시예에 따른 경보 데이터들의 테이블의 예를 나타낸 도면이다. 도 2에 도시된 테이블은, 어느 한 장애에서 발생한 경보 데이터들을 정리한 테이블이다. 도 2에 도시된 바와 같이, 테이블에는 해당 장애에서 경보 데이터를 발생한 네트워크 장비들, 그리고 각 네트워크 장비에서 발생한 경보 데이터의 종류와 각 네트워크 장비의 계위 정보 등을 포함한다. 전처리부(110)는, 테이블에서 각 유형의 경보 데이터의 개수 등을 합산하는 등의 전처리를 하여 1차원 벡터 데이터를 생성한다. 따라서, 1차원 벡터 데이터에는 네트워크 장비의 종류, 경보의 유형, 각 경보 데이터의 유형별 개수, 계위 정보 등을 포함한다. The preprocessor 110 vectorizes the alert data for each failure into a predetermined dimension based on the table of alert data generated for each failure. The preprocessor 110 receives dimensional information (eg, one-dimensional, two-dimensional, etc.) from the user, and vectorizes the alert data for each failure according to the received dimensional information. In the case of this embodiment, one-dimensional vectorization is taken as an example. 2 is a diagram illustrating an example of a table of alert data according to an embodiment of the present invention. The table shown in FIG. 2 is a table in which alert data generated in any one failure is arranged. As shown in FIG. 2 , the table includes network devices that generate alert data in the corresponding failure, types of alert data generated by each network device, and hierarchical information of each network device. The preprocessor 110 generates one-dimensional vector data by performing preprocessing such as summing the number of each type of alert data in the table. Accordingly, the one-dimensional vector data includes the type of network equipment, the type of alarm, the number of each type of alarm data, and hierarchical information.

제 1 데이터 증가부(120)는, 상기 전처리부(110)에서 각 장애마다 벡터화된 데이터들 중 소정의 기준에 부합하는 부족 데이터를 선택하고 선택된 부족 데이터들을 보간하여 부족 데이터를 증가시킨다. 바람직하게, 제 1 데이터 증가부(120)는, 복수의 장애 유형 중 상대적으로 경보 데이터가 부족한 장애 유형의 장애에서 생성된 벡터된 데이터들을 부족 데이터로 선택한다. 예컨대, 개수가 가장 작은 장애 유형의 벡터화된 데이터들을 부족 데이터로 선택하거나, 또는 가장 많은 개수에 대비하여 소정 비율 미만의 개수를 갖는 장애 유형의 벡터화된 데이터들을 부족 데이터로 선택한다. 예를 들어, 장애 유형은 총 6개가 있고, 제1유형의 벡터화된 데이터는 50개이고, 제2유형은 40개, 제3유형은 35개, 제4유형은 20개, 제5유형은 15개, 제6유형은 13개라고 할 때, 개수가 가장 작은 유형인 제6유형의 13개의 벡터화된 데이터들을 부족 데이터로 선택한다. 또는 개수가 가장 많은 제1유형에 대비하여 30% 이하인 제5유형과 제6형의 벡터화된 데이터들을 부족 데이터로 선택한다.The first data increasing unit 120 selects insufficient data that meets a predetermined criterion from among the data vectorized for each failure in the preprocessing unit 110 , and increases the insufficient data by interpolating the selected insufficient data. Preferably, the first data increasing unit 120 selects, as insufficient data, vectorized data generated from a failure of a failure type in which alarm data is relatively insufficient among a plurality of failure types. For example, vectorized data of the failure type with the smallest number is selected as insufficient data, or vectorized data of the failure type having a number less than a predetermined ratio with respect to the largest number is selected as insufficient data. For example, there are 6 disability types in total, 50 vectorized data for Type 1, 40 for Type 2, 35 for Type 3, 20 for Type 4, and 15 for Type 5. , when the sixth type is 13, 13 vectorized data of the sixth type, which has the smallest number, are selected as insufficient data. Alternatively, vectorized data of types 5 and 6, which are 30% or less compared to type 1, which has the largest number, are selected as insufficient data.

제 1 데이터 증가부(120)는, 상기 선택된 부족 데이터들을 보간하여 부족 데이터를 증가시키는데 있어서, 부족 데이터의 내부 영역 및 외부 영역 각가가에 부족 데이터를 증가시킨다. 구체적으로, 제 1 데이터 증가부(120)는, 상기 내부 영역에 부족 데이터를 증가시키는데 있어서 임의의 두 개의 부족 데이터를 이은 선분의 내분점에 데이터를 추가하고, 상기 외부 영역에 부족 데이터를 증가시키는데 있어서 임의의 두 개의 부족 데이터를 이은 선분의 외분점에 데이터를 추가할 수 있다.The first data increase unit 120 increases the insufficient data by interpolating the selected insufficient data to increase the insufficient data in each of the inner region and the outer region of the insufficient data. Specifically, the first data increasing unit 120, in increasing the insufficient data in the inner region, adds data to the inner division point of a line segment connecting any two insufficient data, and increases the insufficient data in the outer region. Data can be added to the outer division of the line segment connecting two arbitrary pieces of insufficient data.

예를 들어, 상기 선택된 부족 데이터가 X₁=(α₁, β₁, γ₁, δ₁,...), X₂=(α₂, β₂, γ₂, δ₂,...)라고 할 때, 제 1 데이터 증가부(120)는, 미리 설정된 p, q, s, r 파라미터를 이용하여 아래 수학식1 및 수학식2에 따라 부족 데이터를 증가시킨다. 즉 상기 선택된 부족 데이터의 내부 영역 및 외부 영역에 임의의 데이터를 추가한다. 내부 영역에 추가된 데이터는 하나의 벡터

이 되고, 외부 영역에 추가된 데이터는 하나의 벡터

가 된다.For example, if the selected scarce data is X ₁ =(α ₁ , β ₁ , γ ₁ , δ ₁ ,...), X ₂ =(α ₂ , β ₂ , γ ₂ , δ ₂ ,...) , the first data increasing unit 120 increases insufficient data according to

Equations

1 and 2 below by using preset p, q, s, and r parameters. That is, arbitrary data is added to the inner area and the outer area of the selected insufficient data. The data added to the inner region is one vector.

, and the data added to the external area is one vector.

becomes

(수학식1)(Equation 1)

(수학식2)(Equation 2)

상기 (수학식1)은, 상기 선택된 부족 데이터의 임의의 두 개의 부족 데이터를 이은 선분의 내분점에 데이터를 추가하는 것이고, 상기 (수학식2)는 상기 선택된 부족 데이터의 임의의 두 개의 부족 데이터를 이은 선분의 외분점에 데이터를 추가하는 것이다. 내분점은 상기 선택된 부족 데이터의 내부 영역이고, 외분점은 상기 선택된 부족 데이터의 외부 영역이다.The above (Equation 1) is to add data to the inner division point of a line segment connecting any two insufficient data of the selected insufficient data, and the (Equation 2) is to add any two insufficient data of the selected insufficient data. This is to add data to the outer division of the line segment. The inner division point is an inner area of the selected insufficient data, and the outer division point is an outer area of the selected insufficient data.

제 1 데이터 증가부(120)는, 사용자로부터 수신되는 가중치 ε에 기초하여 데이터의 증가 영역을 조정할 수 있다. 즉, 제 1 데이터 증가부(120)는 상기 p, q, s, r 파라미터에 상기 가중치 ε를 더하거나 빼서 각 파라미터를 조정함으로써 데이터의 증가 영역을 조정할 수 있다. 이를 통해 네트워크 장애 진단을 위한 인공지능 엔진의 과적합(overfitting) 방지를 위한 비대칭도를 증가시킬 수 있다. 제 1 데이터 증가부(120)는, 데이터를 추가하여 증가시키는데 있어서 증가하는 데이터의 개수는 사용자로부터 수신된 생성 개수에 기초하여 결정할 수 있다. 즉, 사용자로부터 수신된 생성 개수에 기초하여, 부족 데이터에서 조합 가능한 임의의 두 개의 데이터의 조합의 개수를 조정함으로써, 데이터의 증가 개수를 조정할 수 있다.The first data increasing unit 120 may adjust the data increase area based on the weight ε received from the user. That is, the first data increasing unit 120 may adjust the data increase region by adjusting each parameter by adding or subtracting the weight ε to the p, q, s, and r parameters. Through this, it is possible to increase the degree of asymmetry to prevent overfitting of the artificial intelligence engine for diagnosing network failures. The first data increasing unit 120 may determine the number of data to increase when adding data based on the number of generated data received from the user. That is, based on the generated number received from the user, by adjusting the number of combinations of any two data that can be combined in insufficient data, the number of data increases may be adjusted.

도 3은 본 발명의 일 실시예에 따른 부족 데이터의 내부 영역 및 외부 영역에 데이터를 증가시키는 예를 나타낸 도면이다. 도 3의 예는, 장애 유형1과 장애 유형2의 데이터만을 나타내고, 부족 데이터는 장애 유형2의 데이터이다. 장애 유형2의 데이터에서 임의의 두 개의 데이터를 잇는 선분의 내분점(310)에 데이터가 추가되고, 또한 또 다른 임의의 두 개의 데이터를 잇는 선분의 외분점(320)에 데이터가 추가된다. 내분점 및 외분점의 위치는, 상기 p, q, s, r 파라미터의 값 및 상기 가중치 ε에 의해 결정된다.3 is a diagram illustrating an example of increasing data in an inner area and an outer area of insufficient data according to an embodiment of the present invention. The example of FIG. 3 shows only the data of the failure type 1 and the failure type 2, and the insufficient data is the data of the failure type 2. In the data of the failure type 2, data is added to the inner division 310 of the line segment connecting any two data, and data is added to the outer division 320 of the line segment connecting another arbitrary two data. The positions of the inner and outer points are determined by the values of the p, q, s, and r parameters and the weight ε.

다시 도 1을 참조하면, 특징 추출부(130)는, 상기 제 1 데이터 증가부(120)에서 증가된 데이터를 포함한 벡터화된 데이터들을 입력 데이터로서 수신하고, 입력 데이터의 특징을 추출하여 장애 유형별로 군집화하여 압축 데이터를 생성하며, 압축 데이터를 복원한 출력 데이터와 상기 입력 데이터의 차이를 최소화하는 학습을 수행한다. 바람직하게, 특징 추출부(130)는, 오토인코더(autoencoder)를 이용한다. 오토인코더는 입력 데이터로부터 특징을 추출하여 압축하는 인코더(encoder)와 압축된 데이터로부터 원래의 데이터를 복원하여 출력하는 디코더를 포함하고, 디코더에서 출력되는 출력 데이터와 상기 입력 데이터 간의 차이를 최소화하는 방향으로 학습을 진행한다. Referring back to FIG. 1 , the feature extraction unit 130 receives, as input data, vectorized data including the data increased by the first data increase unit 120 , and extracts features of the input data for each type of failure. Compressed data is generated by clustering, and learning is performed to minimize the difference between the output data from which the compressed data is restored and the input data. Preferably, the feature extraction unit 130 uses an autoencoder. The autoencoder includes an encoder that extracts features from input data and compresses it, and a decoder that restores and outputs original data from the compressed data, and minimizes the difference between the output data output from the decoder and the input data. proceed with learning.

상기 특징 추출부(130)는, 각 벡터화된 데이터들의 각 구성원소별 발생 확률을 장애 유형별 확률밀도함수의 합으로 표현하여 전체 데이터들의 평균 및 분산을 추정하고, 추정된 평균 및 분산을 이용하여 압축 데이터를 생성한다. 예를 들어, 하나의 벡터 X₁=(x₁, x₂, x₃, x₄,..., x_n)가 있을 때, 해당 벡터를 구성하는 구성원소인 x₁의 K개의 장애 유형에서 발생할 확률밀도함수의 합은 아래 (수학식3)과 같고, 구성원소 각각의 K개의 장애 유형에서 발생할 확률밀도함수의 합을 전부 합하면, 해당 벡터 X₁의 확률밀도함수가 계산되고, 이로부터 해당 벡터 X₁의 평균과 분산이 추정된다. The feature extraction unit 130 estimates the average and variance of all data by expressing the probability of occurrence for each element element of each vectorized data as the sum of the probability density function for each type of disorder, and compresses it using the estimated mean and variance. create data For example, if there is one vector X ₁ =(x ₁ , x ₂ , x ₃ , x ₄ ,..., x _n ), the number of occurrences of the K failure types of _{x 1} which is a constituent element of the vector will occur. The sum of the probability density functions is as follows (Equation 3), and when the sum of the probability density functions that occur in each of the K disorder types of each element element is summed, the probability density function of the vector X ₁ is calculated, and from this, the vector The mean and variance of X _{1 are estimated.}

(수학식3)(Equation 3)

여기서,

는 평균이고,

는 분산이며, N은 평균이

이고 분산이

인 연속확률분포이며, π_k는 k번째 장애 유형이 선택될 확률이다.. here,

is the average,

is the variance, and N is the mean

and the dispersion

is a continuous probability distribution where π _k is the probability that the kth type of failure is selected.

일반적으로 오토인코더의 인코더는 입력 데이터로부터 특징을 추출하고 이를 압축하여 압축 데이터를 생성한다. 종래의 오토인코더의 인코더는 입력 데이터를 압축할 때 입력 데이터가 소정의 평균과 분산을 모수로 하는 정규분포를 따르는 것을 가정하고 입력 데이터를 압축한다. 그런데 네트워크 장애시 발생하는 경보 데이터들은 그 분포가 정규분포를 따르지 않고 복잡한 분포를 나타내어 종래의 인코더와 같이 입력 데이터를 압축하는 것은 적절하지 않다. 정규분포를 따르지 않는 경보 데이터들을 종래의 방식으로 정규분포에 따라 압축할 경우 압축 데이터는 유클리디안 거리(Euclidean Distance) 내에서 혼재되어 표현된다. In general, an encoder of an autoencoder extracts features from input data and compresses them to generate compressed data. When the encoder of the conventional autoencoder compresses the input data, it is assumed that the input data follows a normal distribution with predetermined mean and variance as parameters and compresses the input data. However, it is not appropriate to compress the input data like a conventional encoder because the alarm data generated in the event of a network failure has a complex distribution rather than a normal distribution. When alert data that does not follow a normal distribution is compressed according to a normal distribution in a conventional manner, the compressed data is mixed and expressed within a Euclidean distance.

반면, 본 실시예와 같이, 각 벡터화된 데이터들의 각 구성원소별 발생 확률을 장애 유형별 확률밀도함수의 합으로 표현하여 전체 데이터들의 평균 및 분산을 추정하고, 추정된 평균 및 분산을 이용하여 압축 데이터를 생성하면, 압축 데이터들은 장애 유형별로 군집화되어 표현된다. 즉, 압축 데이터는 유클리디안 경계 내에서 군집화되기 때문에 운용자가 네트워크 장애 유형의 데이터 생성을 통제할 수 있다. 도 4는 종래의 압축 데이터와 본 발명의 실시예에 따른 압축 데이터를 비교한 도면이다. 도 4의 (a)는 종래의 인코더가 정규분포에 기초하여 데이터들을 압축한 결과인데, 장애 유형에 따라 데이터들이 군집화되지 않고 장애 유형에 무관하게 데이터들이 혼재된다. 반면, 도 4의 (b)는 본 발명의 실시예에 따른 압축 데이터로서 도시된 바와 같이 장애 유형에 따라 데이터들이 군집화된다.On the other hand, as in the present embodiment, the average and variance of all data are estimated by expressing the occurrence probability of each element element of each vectorized data as the sum of the probability density function for each type of disorder, and compressed data using the estimated mean and variance , the compressed data is clustered and expressed by failure type. In other words, since compressed data is clustered within the Euclidean boundary, the operator can control the generation of network failure types of data. 4 is a diagram comparing conventional compressed data and compressed data according to an embodiment of the present invention. Figure 4 (a) is a result of the conventional encoder compressing data based on a normal distribution, the data are not clustered according to the type of failure, data is mixed regardless of the type of failure. On the other hand, as shown in (b) of FIG. 4 as compressed data according to an embodiment of the present invention, data are clustered according to the type of failure.

제 2 데이터 증가부(140)는, 상기 특징 추출부(130)의 인코더에서 생성되는 압축 데이터의 영역 중 사용자로부터 선택된 장애 유형에 해당하는 군집 영역에 소정의 데이터를 추가한다. 바람직하게, 제 2 데이터 증가부(140)는, 상기 선택된 장애 유형에 해당하는 군집 영역의 내부 영역에서 선택된 임의의 두 개의 데이터를 잇는 선분의 내분점에 소정의 데이터를 추가할 수 있다. 제 2 데이터 증가부(140)는 추가되는 데이터의 개수를 사용자로부터 수신할 수 있고, 또는 미리 결정된 룰에 기초하여 자동으로 개수를 정해 데이터를 추가할 수도 있다. 사용자로부터 선택된 장애 유형에 해당하는 군집 영역에 데이터가 추가됨으로써 장애 유형별로 데이터들이 균형있게 최종 생성될 수 있다.The second data increasing unit 140 adds predetermined data to a cluster area corresponding to a type of failure selected by a user among areas of compressed data generated by the encoder of the feature extraction unit 130 . Preferably, the second data increasing unit 140 may add predetermined data to an internal division point of a line segment connecting any two pieces of data selected in the inner region of the cluster region corresponding to the selected failure type. The second data increasing unit 140 may receive the number of data to be added from the user, or may add data by automatically determining the number based on a predetermined rule. As data is added to the cluster area corresponding to the type of impairment selected by the user, data may be finally generated in a balanced manner for each type of impairment.

제 2 데이터 증가부(140)에 의해 상기 특징 추출부(130)의 인코더에서 생성되는 압축 데이터에 데이터가 추가되면, 상기 특징 추출부(130)의 디코더는 데이터가 추가된 압축 데이터를 복원하여 출력한다. 디코더에서 출력되는 출력 데이터가 네트워크 장애 진단을 위한 인공지능 엔진의 학습 데이터로서 사용된다. 제 1 데이터 증가부(120) 및 제 2 데이터 증가부(140)에서 추가된 데이터는 네트워크 장애 진단을 위한 인공지능 엔진의 훈련 데이터로는 적합하지만 검증 데이터로는 부적합할 수 있다. 따라서, 상기 디코더에서 출력되는 학습 데이터 중 추가 데이터는 검증 데이터로 활용하지 않고 나머지 데이터 중 일부가 검증 데이터로 사용된다. When data is added to the compressed data generated by the encoder of the feature extracting unit 130 by the second data increasing unit 140 , the decoder of the feature extracting unit 130 restores and outputs the compressed data to which the data is added. do. The output data output from the decoder is used as learning data of the artificial intelligence engine for diagnosing network failures. The data added by the first data increasing unit 120 and the second data increasing unit 140 may be suitable as training data of an artificial intelligence engine for diagnosing a network failure, but may not be suitable as verification data. Accordingly, additional data among the training data output from the decoder is not used as verification data, and some of the remaining data is used as verification data.

이상의 실시예에 따르면, 학습 데이터 생성 장치(100)는, 네트워크 장애 진단을 위한 인공지능 엔진의 학습 데이터 중 부족한 데이터를 증가시킴으로써 인공지능 엔진의 편향성을 감소시킬 수 있고, 정확도(accuracy) 뿐만 아니라 재현율(recall)도 한정적으로 상승시킬 수 있다. 특히 학습 데이터 생성 장치(100)는 부족한 데이터를 단순히 변형하거나 차원 축소를 통해 압축된 데이터로부터 임의 유형의 데이터를 합성하는 것이 아닌 실제 데이터와 합성 데이터 간 분포의 차이를 고려하여 부족 데이터를 생성함으로써 기존 불균형 학습 데이터에 의한 인공지능 엔진의 과적합 현상을 방지할 수 있다. 본 발명의 학습 데이터 생성 장치(100)는 오토인코더의 인코더를 통해 벡터화된 데이터로부터 특징을 추출하여 압축하는데 있어서 유클리디안 거리 내에서 장애 유형을 구분할 수 있는 경계를 형성하고 이를 바탕으로 사용자가 데이터가 부족한 영역을 선택하면, 해당 선택된 영역에, 해당 장애 유형의 압축 정보를 전사함으로써 모든 장애 유형에 균형적인 학습 데이터를 생성한다.According to the above embodiment, the learning data generating apparatus 100 may reduce the bias of the artificial intelligence engine by increasing insufficient data among the learning data of the artificial intelligence engine for diagnosing network failure, and not only accuracy but also recall rate. (recall) can also be raised to a limited extent. In particular, the training data generating apparatus 100 generates insufficient data by considering the difference in distribution between actual data and synthetic data, rather than simply transforming insufficient data or synthesizing arbitrary types of data from data compressed through dimensionality reduction. It is possible to prevent overfitting of the artificial intelligence engine due to unbalanced learning data. The learning data generating apparatus 100 of the present invention forms a boundary that can distinguish the types of failures within the Euclidean distance in extracting and compressing features from the vectorized data through the encoder of the autoencoder, and based on this, the user When a region lacking in is selected, balanced learning data is generated for all disability types by transcribed compressed information of the corresponding disability type into the selected region.

도 5는 본 발명의 일 실시예에 따른 학습 데이터 생성 방법을 설명하는 흐름도이다.5 is a flowchart illustrating a method of generating learning data according to an embodiment of the present invention.

도 5를 참조하면, 단계 S501에서, 학습 데이터 생성 장치(100)는 장애 발생에 따라 다양한 네트워크 장비에서 수집된 경보 데이터를 수신하여 전처리한다. 구체적으로, 학습 데이터 생성 장치(100)는, 다양한 네트워크 장비에서 발생한 경보 데이터를 수집하는 네트워크 관리 시스템(NMS : Network Management System)으로부터 경보 데이터를 수신할 수 있다. 학습 데이터 생성 장치(100)는, 경보 데이터를 각 장애마다 분류하고 또한 각 장애의 경보 데이터들의 종류를 분류하며 레이블링 작업을 수행하여 테이블 형태로 변환하여 저장부(예컨대, 데이터베이스)에 저장할 수 있다. 경보 데이터는 장애가 발생한 네트워크 장비의 식별정보와 타입 정보, 경보의 유형 정보, 장애가 발생한 네트워크 장비의 계위 정보 등을 포함할 수 있다. 경보의 유형은, 예를 들어, LOS(Loss of Signal), LOP(Loss of Pointer), RDI(Remote Defect Indication) 등을 포함한다. 학습 데이터 생성 장치(100)는, 각 장애마다 생성한 경보 데이터들의 테이블을 기초로 각 장애마다 경보 데이터들을 소정의 차원으로 벡터화한다. 학습 데이터 생성 장치(100)는, 사용자로부터 차원의 정보(예컨대, 1차원, 2차원 등)를 수신하고, 수신된 차원의 정보에 따라 각 장애별 경보 데이터들을 벡터화한다. Referring to FIG. 5 , in step S501 , the learning data generating apparatus 100 receives and pre-processes alert data collected from various network devices according to the occurrence of a failure. Specifically, the learning data generating apparatus 100 may receive alert data from a network management system (NMS) that collects alert data generated from various network devices. The learning data generating apparatus 100 may classify the alarm data for each failure, classify the types of alarm data of each failure, perform a labeling operation, convert it into a table form, and store it in a storage unit (eg, a database). The alarm data may include identification information and type information of a network device in which a failure has occurred, type information of an alarm, and hierarchical information of a network device in which a failure has occurred. The types of alerts include, for example, Loss of Signal (LOS), Loss of Pointer (LOP), Remote Defect Indication (RDI), and the like. The learning data generating apparatus 100 vectorizes the alert data for each failure into a predetermined dimension based on a table of alert data generated for each failure. The learning data generating apparatus 100 receives dimensional information (eg, one-dimensional, two-dimensional, etc.) from a user, and vectorizes alert data for each failure according to the received dimensional information.

단계 S502에서, 학습 데이터 생성 장치(100)는, 각 장애마다 벡터화된 데이터들 중 소정의 기준에 부합하는 부족 데이터를 선택하고 선택된 부족 데이터들을 보간하여 부족 데이터를 증가시킨다. 바람직하게, 학습 데이터 생성 장치(100)는, 복수의 장애 유형 중 상대적으로 경보 데이터가 부족한 장애 유형의 장애에서 생성된 벡터된 데이터들을 부족 데이터로 선택한다. 예컨대, 개수가 가장 작은 장애 유형의 벡터화된 데이터들을 부족 데이터로 선택하거나, 또는 가장 많은 개수에 대비하여 소정 비율 미만의 개수를 갖는 장애 유형의 벡터화된 데이터들을 부족 데이터로 선택한다. 학습 데이터 생성 장치(100)는, 상기 선택된 부족 데이터들을 보간하여 부족 데이터를 증가시키는데 있어서, 부족 데이터의 내부 영역 및 외부 영역 각각에 부족 데이터를 증가시킨다. 구체적으로, 학습 데이터 생성 장치(100)는, 상기 내부 영역에 부족 데이터를 증가시키는데 있어서 임의의 두 개의 부족 데이터를 이은 선분의 내분점에 데이터를 추가하고, 상기 외부 영역에 부족 데이터를 증가시키는데 있어서 임의의 두 개의 부족 데이터를 이은 선분의 외분점에 데이터를 추가할 수 있다. 학습 데이터 생성 장치(100)는, 사용자로부터 수신되는 가중치 ε에 기초하여 데이터의 증가 영역을 조정할 수 있다. 즉, 학습 데이터 생성 장치(100)는, 상기 p, q, s, r 파라미터에 상기 가중치 ε를 더하거나 빼서 각 파라미터를 조정함으로써 데이터의 증가 영역을 조정할 수 있다. 이를 통해 네트워크 장애 진단을 위한 인공지능 엔진의 과적합(overfitting) 방지를 위한 비대칭도를 증가시킬 수 있다. 학습 데이터 생성 장치(100)는, 데이터를 추가하여 증가시키는데 있어서 증가하는 데이터의 개수는 사용자로부터 수신된 생성 개수에 기초하여 결정할 수 있다. 즉, 사용자로부터 수신된 생성 개수에 기초하여, 부족 데이터에서 조합 가능한 임의의 두 개의 데이터의 조합의 개수를 조정함으로써, 데이터의 증가 개수를 조정할 수 있다.In step S502, the learning data generating apparatus 100 selects insufficient data that meets a predetermined criterion among vectorized data for each disability, and increases the insufficient data by interpolating the selected insufficient data. Preferably, the learning data generating apparatus 100 selects, as insufficient data, vectorized data generated from a failure of a failure type in which alarm data is relatively insufficient among a plurality of failure types. For example, vectorized data of the failure type with the smallest number is selected as insufficient data, or vectorized data of the failure type having a number less than a predetermined ratio with respect to the largest number is selected as insufficient data. When the training data generating apparatus 100 increases the insufficient data by interpolating the selected insufficient data, the insufficient data is increased in each of the inner region and the outer region of the insufficient data. Specifically, the learning data generating apparatus 100 adds data to an inner division point of a line segment connecting any two pieces of insufficient data in increasing the insufficient data in the inner region, and arbitrary in increasing the insufficient data in the outer region. Data can be added to the outer division of the line segment connecting the two insufficient data of . The training data generating apparatus 100 may adjust the data increase area based on the weight ε received from the user. That is, the training data generating apparatus 100 may adjust the data increase region by adjusting each parameter by adding or subtracting the weight ε to the p, q, s, and r parameters. Through this, it is possible to increase the degree of asymmetry to prevent overfitting of the artificial intelligence engine for diagnosing network failures. The training data generating apparatus 100 may determine the number of data to be increased when adding and increasing data based on the number of data generated from the user. That is, based on the generated number received from the user, by adjusting the number of combinations of any two data that can be combined in insufficient data, the number of data increases may be adjusted.

단계 S503에서, 학습 데이터 생성 장치(100)는, 상기 단계 S502에서 증가된 데이터를 포함한 벡터화된 데이터들을 입력 데이터로서 수신하고, 입력 데이터의 특징을 추출하여 장애 유형별로 군집화하여 압축 데이터를 생성하며, 압축 데이터를 복원한 출력 데이터와 상기 입력 데이터의 차이를 최소화하는 학습을 수행한다. 바람직하게, 학습 데이터 생성 장치(100)는, 오토인코더(autoencoder)를 이용한다. 오토인코더는 입력 데이터로부터 특징을 추출하여 압축하는 인코더(encoder)와 압축된 데이터로부터 원래의 데이터를 복원하여 출력하는 디코더를 포함하고, 디코더에서 출력되는 출력 데이터와 상기 입력 데이터 간의 차이를 최소화하는 방향으로 학습을 진행한다. 학습 데이터 생성 장치(100)는, 각 벡터화된 데이터들의 각 구성원소별 발생 확률을 장애 유형별 확률밀도함수의 합으로 표현하여 전체 데이터들의 평균 및 분산을 추정하고, 추정된 평균 및 분산을 이용하여 압축 데이터를 생성한다. 예를 들어, 하나의 벡터 X₁=(x₁, x₂, x₃, x₄,..., x_n)가 있을 때, 해당 벡터를 구성하는 구성원소인 x₁의 K개의 장애 유형에서 발생할 확률밀도함수의 합은 상기 (수학식3)과 같고, 구성원소 각각의 K개의 장애 유형에서 발생할 확률밀도함수의 합을 전부 합하면, 해당 벡터 X₁의 확률밀도함수가 계산되고, 이로부터 해당 벡터 X₁의 평균과 분산이 추정된다. In step S503, the learning data generating apparatus 100 receives the vectorized data including the data increased in step S502 as input data, extracts features of the input data and clusters them by type of failure to generate compressed data, Learning to minimize the difference between the compressed data and the restored output data and the input data is performed. Preferably, the training data generating apparatus 100 uses an autoencoder. The autoencoder includes an encoder that extracts features from input data and compresses it, and a decoder that restores and outputs original data from the compressed data, and minimizes the difference between the output data output from the decoder and the input data. proceed with learning. The learning data generating apparatus 100 estimates the average and variance of all data by expressing the occurrence probability of each element element of each vectorized data as the sum of the probability density function for each type of disorder, and compresses it using the estimated mean and variance create data For example, if there is one vector X ₁ =(x ₁ , x ₂ , x ₃ , x ₄ ,..., x _n ), the number of occurrences of the K failure types of _{x 1} which is a constituent element of the vector will occur. The sum of the probability density functions is the same as above (Equation 3), and when the sum of the probability density functions that occur in each of the K failure types of each element element is summed, the probability density function of the vector X ₁ is calculated, and from this, the vector The mean and variance of X _{1 are estimated.}

단계 S504에서, 학습 데이터 생성 장치(100)는, 상기 단계 S503의 인코더에서 생성되는 압축 데이터의 영역 중 사용자로부터 선택된 장애 유형에 해당하는 군집 영역에 소정의 데이터를 추가한다. 바람직하게, 학습 데이터 생성 장치(100)는, 상기 선택된 장애 유형에 해당하는 군집 영역의 내부 영역에서 선택된 임의의 두 개의 데이터를 잇는 선분의 내분점에 소정의 데이터를 추가할 수 있다. 학습 데이터 생성 장치(100)는, 추가되는 데이터의 개수를 사용자로부터 수신할 수 있고, 또는 미리 결정된 룰에 기초하여 자동으로 개수를 정해 데이터를 추가할 수도 있다. 사용자로부터 선택된 장애 유형에 해당하는 군집 영역에 데이터가 추가됨으로써 장애 유형별로 데이터들이 균형있게 최종 생성될 수 있다.In step S504, the training data generating apparatus 100 adds predetermined data to the cluster area corresponding to the type of disability selected by the user among the compressed data areas generated by the encoder in step S503. Preferably, the learning data generating apparatus 100 may add predetermined data to an internal division point of a line segment connecting any two pieces of data selected in the inner region of the cluster region corresponding to the selected disability type. The learning data generating apparatus 100 may receive the number of data to be added from the user, or may add data by automatically determining the number based on a predetermined rule. As data is added to the cluster area corresponding to the type of impairment selected by the user, data may be finally generated in a balanced manner for each type of impairment.

이상의 도 5를 참조하여 설명한 실시예에 따라 최종적으로 디코더에서 출력되는 출력 데이터가 네트워크 장애 진단을 위한 인공지능 엔진의 학습 데이터로서 사용된다. 두 차례에 걸쳐 추가된 데이터는 네트워크 장애 진단을 위한 인공지능 엔진의 훈련 데이터로는 적합하지만 검증 데이터로는 부적합할 수 있다. 따라서, 상기 디코더에서 출력되는 학습 데이터 중 추가 데이터는 검증 데이터로 활용하지 않고 나머지 데이터 중 일부가 검증 데이터로 사용된다. According to the embodiment described with reference to FIG. 5, output data finally output from the decoder is used as learning data of the artificial intelligence engine for diagnosing network failure. The data added twice may be suitable as training data of an artificial intelligence engine for diagnosing network failures, but may not be suitable as validation data. Accordingly, additional data among the training data output from the decoder is not used as verification data, and some of the remaining data is used as verification data.

본 명세서는 많은 특징을 포함하는 반면, 그러한 특징은 본 발명의 범위 또는 특허청구범위를 제한하는 것으로 해석되어서는 안 된다. 또한, 본 명세서에서 개별적인 실시예에서 설명된 특징들은 단일 실시예에서 결합되어 구현될 수 있다. 반대로, 본 명세서에서 단일 실시예에서 설명된 다양한 특징들은 개별적으로 다양한 실시예에서 구현되거나, 적절히 결합되어 구현될 수 있다.While this specification contains many features, such features should not be construed as limiting the scope of the invention or the claims. Also, features described in individual embodiments herein may be implemented in combination in a single embodiment. Conversely, various features described herein in a single embodiment may be implemented in various embodiments individually, or may be implemented in appropriate combination.

도면에서 동작들이 특정한 순서로 설명되었으나, 그러한 동작들이 도시된 바와 같은 특정한 순서로 수행되는 것으로, 또는 일련의 연속된 순서, 또는 원하는 결과를 얻기 위해 모든 설명된 동작이 수행되는 것으로 이해되어서는 안 된다. 특정 환경에서 멀티태스킹 및 병렬 프로세싱이 유리할 수 있다. 아울러, 상술한 실시예에서 다양한 시스템 구성요소의 구분은 모든 실시예에서 그러한 구분을 요구하지 않는 것으로 이해되어야 한다. 상술한 프로그램 구성요소 및 시스템은 일반적으로 단일 소프트웨어 제품 또는 멀티플 소프트웨어 제품에 패키지로 구현될 수 있다.Although acts are described in the drawings in a specific order, it should not be understood that the acts are performed in the specific order as shown, or that all of the described acts are performed in a continuous order, or to obtain a desired result. . Multitasking and parallel processing can be advantageous in certain circumstances. In addition, it should be understood that the division of various system components in the above-described embodiments does not require such division in all embodiments. The program components and systems described above may generally be implemented as a package in a single software product or multiple software products.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(시디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. 이러한 과정은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있으므로 더 이상 상세히 설명하지 않기로 한다.The method of the present invention as described above may be implemented as a program and stored in a computer-readable form in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.). Since this process can be easily performed by a person skilled in the art to which the present invention pertains, it will not be described in detail any more.

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다.The present invention described above, for those of ordinary skill in the art to which the present invention pertains, various substitutions, modifications and changes are possible without departing from the technical spirit of the present invention. It is not limited by the drawing.

100 : 학습 데이터 생성 장치
110 : 전처리부
120 : 제 1 데이터 증가부
130 : 특징 추출부
140 : 제 2 데이터 증가부100: learning data generating device
110: preprocessor
120: first data increase unit
130: feature extraction unit
140: second data increase unit

Claims

A learning data generating apparatus for generating learning data for diagnosing network failure, comprising:
a preprocessor for vectorizing alert data collected from network devices into a predetermined dimension by performing classification and labeling operations for each failure type;
a first data increasing unit that selects insufficient data that meets a predetermined criterion from among the vectorized data and increases the insufficient data by interpolating the selected insufficient data;
Learning to receive vectorized data with increased insufficient data as input data, extract features of the input data and cluster them by type of failure to generate compressed data, and minimize the difference between the output data from which the compressed data is restored and the input data a feature extraction unit that performs and
and a second data increasing unit for adding predetermined data to a cluster area corresponding to a type of disability selected by a user among the areas of the compressed data.

The method of claim 1,
The first data increasing unit,
The apparatus for generating learning data, characterized in that by interpolating the selected insufficient data, the insufficient data is increased in each of an inner region and an outer region of the insufficient data.

3. The method of claim 2,
The first data increasing unit,
In increasing the insufficient data in the inner region, data is added to the inner division point of the line segment connecting the two insufficient data,
In increasing the insufficient data in the outer region, the learning data generating apparatus, characterized in that the data is added to the outer division point of the line segment connecting the two insufficient data.

3. The method of claim 2,
The first data increasing unit,
The apparatus for generating learning data, characterized in that the inner region and the outer region are adjusted by applying a weight.

The method of claim 1,
The feature extraction unit,
Expressing the occurrence probability of each element element of the vectorized data in which the insufficient data is increased as the sum of the probability density function for each type of disorder to estimate the average and variance of all data, and to generate compressed data using the estimated mean and variance Learning data generating device, characterized in that.

The method of claim 1,
The second data increase unit,
The apparatus for generating learning data, characterized in that the predetermined data is added to an inner division point of a line segment connecting any two pieces of data selected in the inner region of the cluster region corresponding to the selected disability type.

A method for generating learning data for diagnosing network failures in a learning data generating device, the method comprising:
performing a classification and labeling operation for each type of failure on the alert data collected from network devices and vectorizing them into a predetermined dimension;
selecting insufficient data that meets a predetermined criterion among the vectorized data and increasing the insufficient data by interpolating the selected insufficient data;
Learning to receive vectorized data with increased insufficient data as input data, extract features of the input data and cluster them by type of failure to generate compressed data, and minimize the difference between the output data from which the compressed data is restored and the input data performing the steps; and
and adding predetermined data to a cluster area corresponding to a type of disability selected by a user among the areas of the compressed data.

8. The method of claim 7,
The step of increasing the insufficient data comprises:
Method according to claim 1, characterized in that by interpolating the selected insufficient data, insufficient data is increased in each of an inner region and an outer region of the insufficient data.

9. The method of claim 8,
The step of increasing the insufficient data comprises:
In increasing the insufficient data in the inner region, data is added to the inner division point of the line segment connecting the two insufficient data,
In increasing the insufficient data in the outer region, data is added to an outer division point of a line segment connecting two pieces of insufficient data.

9. The method of claim 8,
The step of increasing the insufficient data comprises:
and adjusting the inner region and the outer region by applying a weight.

8. The method of claim 7,
The step of performing the learning is
Expressing the occurrence probability of each element element of the vectorized data in which the insufficient data is increased as the sum of the probability density function for each type of disorder to estimate the average and variance of all data, and to generate compressed data using the estimated mean and variance A method characterized in that.

8. The method of claim 7,
The step of adding the data is:
Method according to claim 1, characterized in that predetermined data is added to an inner division point of a line segment connecting any two pieces of data selected in the inner region of the cluster region corresponding to the selected disability type.

A computer program stored in a computer-readable recording medium as a computer program for executing the method according to any one of claims 7 to 12 through a computer system.