KR20230061925A

KR20230061925A - Apparatus and Method for Training Network Intrusion Detection Model Based on Extended Training Data

Info

Publication number: KR20230061925A
Application number: KR1020210146676A
Authority: KR
Inventors: 박철희; 김영수; 김현진; 박종근; 이종훈
Original assignee: 한국전자통신연구원
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2023-05-09

Abstract

확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 장치 및 방법이 개시된다. 본 발명의 실시예에 따른 확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 장치는, 적어도 하나의 프로그램이 기록된 메모리 및 프로그램을 실행하는 프로세서를 포함하며, 프로그램은, 적대적 생성 신경망 기반으로 생성된 합성 데이터가 추가된 확장 학습 데이터를 기반으로 오토 인코더를 학습시키는단계, 학습이 완료된 오토 인코더의 인코더를 추출하는 단계 및 전단에 인코더가 배치된 네트워크 침입 위협 탐지 모델을 확장 학습 데이터로 학습시키는 단계를 포함할 수 있다. An apparatus and method for learning a network intrusion threat detection model based on extended learning data are disclosed. An apparatus for training a network intrusion threat detection model based on extended learning data according to an embodiment of the present invention includes a memory in which at least one program is recorded and a processor that executes the program, and the program includes synthesized data generated based on an adversarial generated neural network. It includes the steps of learning an autoencoder based on the extended learning data added, extracting an encoder of an autoencoder that has been trained, and learning a network intrusion threat detection model in which the encoder is placed in the front end with the extended learning data. can

Description

Apparatus and Method for Training Network Intrusion Detection Model Based on Extended Training Data}

기재된 실시예는 시스템 로그 및 네트워크 보안 장비로부터 발생하는 보안 이벤트를 분석하여 사이버 침해 위협을 탐지할 수 있는 기술에 관한 것이다.The disclosed embodiments relate to a technology capable of detecting cyber threats by analyzing system logs and security events generated from network security equipment.

네트워크 통신기술이 발전됨에 따라 분산 네트워크가 구조화되고 있으며, 통신 전용 기기뿐만 아니라 다양한 종류의 센서 및 내장형 기기 등으로 접속 환경이 다변화되고 있다. 이에 따라, 통신 네트워크에 대한 공격 표면 또한 광범위해지고 있으며 사이버 보안 위협이 기하급수적으로 증가하고 있다. As network communication technology develops, distributed networks are structured, and access environments are diversified to various types of sensors and built-in devices as well as communication-only devices. Accordingly, the attack surface for communication networks is also widening and cyber security threats are increasing exponentially.

이러한 사이버 공격들을 탐지하기 위해 침입 탐지 시스템(Intrusion Detection System, IDS)이 네트워크 구성에 필수적인 보안 요소로 배치되고 있지만, 시그니처 및 룰을 기반으로 하는 전통적인 침입 탐지 시스템들은 지능화 및 고도화된 새로운 사이버 공격들을 탐지하지 못하는 문제점을 갖고 있다. To detect these cyber attacks, Intrusion Detection System (IDS) is being deployed as an essential security element in network configuration, but traditional intrusion detection systems based on signatures and rules detect intelligent and advanced new cyber attacks. I have a problem with not being able to.

이에 따라, 미지의 사이버 공격에 대한 탐지를 위해 침입 탐지 시스템에 인공지능 기술을 적용하는 연구가 활발히 진행되고 있다. 인공지능 기반의 네트워크 침입 탐지 시스템은 머신러닝 및 딥러닝 기술을 활용하여 사전에 수집된 보안 이벤트 데이터를 학습하고, 학습된 인공지능 모델은 추후 발생할 수 있는 새로운 보안 위협에 대한 탐지를 목적으로 활용된다. 이러한 네트워크 침입 탐지 시스템에서 광범위하게 활용되는 인공지능 모델에 대한 예시로, 의사결정나무(Decision Tree), SVM(Support Vector Machine), DNN(Deep Neural Network) 등이 포함될 수 있다.Accordingly, research on applying artificial intelligence technology to an intrusion detection system to detect unknown cyber attacks is being actively conducted. The artificial intelligence-based network intrusion detection system uses machine learning and deep learning technology to learn security event data collected in advance, and the learned artificial intelligence model is used for the purpose of detecting new security threats that may occur in the future. . As examples of artificial intelligence models widely used in such network intrusion detection systems, a decision tree, a support vector machine (SVM), and a deep neural network (DNN) may be included.

이러한 인공지능 모델은 침입 탐지 시스템에서 높은 탐지 성능을 보이지만, 모델의 학습은 데이터의 분포에 크게 의존하기 때문에, 학습 데이터가 특정 레이블에 편향되어 있는 경우 성능이 크게 저하될 우려가 있다. 특히, 네트워크 데이터의 경우 현실에서 발생하는 네트워크 플로우는 정상적인 경우가 대부분이며, 보안 위협에 대한 이벤트는 드물게 발생한다. 더욱이, 보안 위협의 범주에서 특정 유형의 공격은 극히 드물게 나타나며, 이러한 문제들은 인공지능 모델이 데이터를 충분히 학습하지 못해 침입 탐지 시스템의 성능을 크게 떨어뜨릴 수 있는 문제로 확장될 수 있다. 즉, 학습 데이터의 크기 뿐만 아니라 데이터의 레이블(유형) 관점에서 균형 문제는 인공지능 모델의 성능에 중요한 요소로 작용된다. These artificial intelligence models show high detection performance in intrusion detection systems, but because the learning of the model is highly dependent on the distribution of data, there is a concern that performance will deteriorate significantly if the training data is biased toward a specific label. In particular, in the case of network data, most network flows that occur in reality are normal, and events for security threats rarely occur. Moreover, certain types of attacks are extremely rare in the category of security threats, and these problems can be extended to problems in which artificial intelligence models do not learn enough data to significantly reduce the performance of intrusion detection systems. In other words, not only the size of the training data, but also the balance problem in terms of the label (type) of the data acts as an important factor in the performance of the artificial intelligence model.

기재된 실시예는 인공 지능 기반의 네트워크 침입 탐지 시스템에서 데이터 불균형 문제를 해소하고 탐지 성능을 향상시키는 데 그 목적이 있다. The disclosed embodiments are aimed at solving a data imbalance problem and improving detection performance in an artificial intelligence-based network intrusion detection system.

실시예에 따른 학습 데이터 확장 방법은, 네트워크 플로우에서 미리 획득된 학습 데이터를 데이터 유형에 따라 소정 개수의 클래스들 별로 분할하는 단계, 분할된 클래스들 별 학습 데이터로 각각에 대응되는 적대적 생성 신경망 모델을 학습시키는 단계, 학습이 완료된 적대적 생성 신경망 모델들에 포함된 생성기들을 이용하여 합성 데이터를 생성하는 단계 및 생성된 합성 데이터를 학습 데이터에 병합하는 단계를 포함할 수 있다. The learning data extension method according to the embodiment includes the steps of dividing training data obtained in advance in a network flow into a predetermined number of classes according to data types, and generating an adversarial neural network model corresponding to each of the divided classes with training data for each class. It may include a step of learning, a step of generating synthesized data using generators included in adversarial generated neural network models for which learning has been completed, and a step of merging the generated synthesized data into training data.

이때, 적대적 생성 신경망 모델은, 잠재 코드를 입력받아 위조된 합성 데이터를 생성하는 생성기 및 합성 데이터와 학습 데이터를 비교하여 진위 여부를 판별하는 판별기로 구성되되, 생성기는, 잠재 코드로부터 합성 데이터를 생성하는 디코더로 구성되고, 판별기는, 학습 데이터로부터 특징을 추출하는 인코더 및 추출된 특징으로부터 학습 데이터를 재구성하는 디코더를 포함하는 오토 인코더로 구성될 수 있다. At this time, the adversarial generative neural network model consists of a generator that receives a latent code and generates forged synthetic data, and a discriminator that compares the synthetic data and learning data to determine whether it is true or not. The generator generates synthetic data from the latent code. It is composed of a decoder, and the discriminator may be composed of an auto encoder including an encoder for extracting features from the training data and a decoder for reconstructing the training data from the extracted features.

이때, 학습시키는 단계는, 재구성 오류 및 판별 오류가 소정 임계치 이하가 될 때까지 반복 수행될 수 있다. In this case, the step of learning may be repeatedly performed until the reconstruction error and the discrimination error are less than or equal to a predetermined threshold value.

이때, 합성 데이터를 생성하는 단계는, 학습 데이터에서의 클래스 별 분포에 따라 생성 비중을 조절할 수 있다. At this time, in the step of generating synthesized data, the proportion of generation may be adjusted according to the distribution of each class in the learning data.

이때, 실시예에 따른 학습 데이터 확장 방법은, 생성된 합성 데이터를 데이터 유형 별로 레이블링하는 단계를 더 포함할 수 있다. In this case, the learning data expansion method according to the embodiment may further include labeling the generated synthetic data for each data type.

이때, 레이블은, 정상 및 적어도 하나의 공격 유형 중 하나일 수 있다.In this case, the label may be one of normal and at least one attack type.

이때, 레이블링하는 단계는, 원핫(one-hot) 형식으로 레이블링할 수 있다. In this case, the labeling may be performed in a one-hot format.

이때, 병합하는 단계는, 합성 데이터에서 이상치를 제거하고 병합할 수 있다. In this case, in the merging step, outliers may be removed from the synthesized data and merged.

실시예에 따른 확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 방법은, 적대적 생성 신경망 기반으로 생성된 합성 데이터가 추가된 확장 학습 데이터를 기반으로 오토 인코더를 학습시키는 단계, 학습이 완료된 오토 인코더의 인코더를 추출하는 단계 및 전단에 인코더가 배치된 딥러닝 기반 탐지 모델을 확장 학습 데이터로 학습시키는 단계를 포함할 수 있다. A method for learning a network intrusion threat detection model based on extended learning data according to an embodiment includes the steps of learning an autoencoder based on extended learning data to which synthetic data generated based on a hostile generation neural network is added, and an encoder of the autoencoder after learning has been completed. It may include extracting and learning a deep learning-based detection model in which an encoder is placed in the front end with extended learning data.

이때, 학습시키는 단계는, 딥러닝 기반 탐지 모델의 파라미터만을 갱신할 수 있다. At this time, the learning step may update only the parameters of the deep learning-based detection model.

실시예에 따른 확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 장치는, 적어도 하나의 프로그램이 기록된 메모리 및 프로그램을 실행하는 프로세서를 포함하며, 프로그램은, 적대적 생성 신경망 기반으로 생성된 합성 데이터가 추가된 확장 학습 데이터를 기반으로 오토 인코더를 학습시키는 단계, 학습이 완료된 오토 인코더의 인코더를 추출하는 단계 및 전단에 인코더가 배치된 딥러닝 기반 탐지 모델을 확장 학습 데이터로 학습시키는 단계를 포함할 수 있다. An apparatus for learning an extended learning data-based network intrusion threat detection model according to an embodiment includes a memory in which at least one program is recorded and a processor that executes the program, wherein the program is added with synthetic data generated based on an adversarial generated neural network. It may include the step of learning an autoencoder based on the extended learning data, the step of extracting an encoder of an autoencoder that has been trained, and the step of learning a deep learning-based detection model in which an encoder is placed in the front end with the extended learning data.

이때, 프로그램은, 합성 데이터를 생성함에 있어, 네트워크 플로우에서 미리 획득된 학습 데이터를 데이터 유형에 따라 소정 개수의 클래스들 별로 분할하는 단계, 분할된 클래스들 별 학습 데이터로 각각에 대응되는 적대적 생성 신경망 모델을 학습시키는 단계, 학습이 완료된 적대적 생성 신경망 모델들에 포함된 생성기들을 이용하여 합성 데이터를 생성하는 단계 및 생성된 합성 데이터를 학습 데이터에 병합하는 단계를 수행할 수 있다. At this time, in generating the synthetic data, the program divides the training data obtained in advance in the network flow into a predetermined number of classes according to the data type, the training data for each of the divided classes, and the adversarial generation neural network corresponding to each The steps of learning the model, generating synthesized data using generators included in the trained adversarial neural network models, and merging the generated synthesized data into training data may be performed.

이때, 프로그램은, 재구성 오류 및 판별 오류가 소정 임계치 이하가 될 때까지 적대적 생성 신경망 모델을 학습시키는 단계를 반복 수행될 수 있다. In this case, the program may repeatedly perform the step of learning the adversarial generative neural network model until the reconstruction error and the discrimination error are less than or equal to a predetermined threshold.

이때, 프로그램은, 합성 데이터를 생성하는 단계에서, 학습 데이터에서의 클래스 별 분포에 따라 생성 비중을 조절할 수 있다. In this case, in the step of generating the synthesized data, the program may adjust the generation proportion according to the distribution of each class in the learning data.

이때, 프로그램은, 생성된 합성 데이터를 데이터 유형 별로 레이블링하는 단계를 더 포함할 수 있다. In this case, the program may further include labeling the generated synthetic data for each data type.

이때, 레이블은, 정상 및 적어도 하나의 공격 유형 중 하나일 수 있다. In this case, the label may be one of normal and at least one attack type.

기재된 실시예에 따라, 최근 널리 활용되고 있는 인공 지능 기반 네트워크 침임 탐지 시스템의 성능을 극대화시킬 수 있다. According to the described embodiment, the performance of an artificial intelligence-based network intrusion detection system, which is widely used recently, can be maximized.

기재된 실시예에 따라, 재구성 오류 및 판별 오류를 적용한 BEGAN 모델을 활용하여 데이터 불균형 문제 해소를 통해, 추후 발생할 수 있는 잠재적인 보안 위협을 탐지할 수 있게 한다. According to the described embodiment, it is possible to detect a potential security threat that may occur in the future by solving a data imbalance problem by using a BEGAN model to which reconstruction errors and discrimination errors are applied.

기재된 실시예에 따라, 확장된 학습 데이터로 훈련된 오토인코더 모델을 침입 탐지 모델의 전반부에 배치함으로써 기존의 인공지능 기반 침입 탐지 모델의 성능을 극대화시킬 수 있다. According to the described embodiment, the performance of an existing artificial intelligence-based intrusion detection model can be maximized by arranging the autoencoder model trained with the extended learning data in the first half of the intrusion detection model.

도 1은 실시예에 따른 적대적 생성 신경망 모델의 예시도이다.
도 2는 실시예에 따른 오토 인코더의 구조도이다.
도 3은 실시예에 따른 학습 데이터 확장 방법을 설명하기 위한 순서도이다.
도 4 및 도 5는 실시예에 따른 합성 데이터 확장을 설명하기 위한 도면이다.
도 6은 실시예에 따른 확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 방법을 설명하기 위한 순서도이다.
도 7은 실시예에 따른 확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 과정을 설명하기 위한 예시도이다.
도 8은 실시예에 따른 컴퓨터 시스템 구성을 나타낸 도면이다.1 is an exemplary diagram of an adversarial generative neural network model according to an embodiment.
2 is a structural diagram of an auto-encoder according to an embodiment.
3 is a flowchart illustrating a learning data expansion method according to an embodiment.
4 and 5 are diagrams for explaining synthetic data extension according to an embodiment.
6 is a flowchart illustrating a method for learning a network intrusion threat detection model based on extended learning data according to an embodiment.
7 is an exemplary diagram for explaining a process of learning a network intrusion threat detection model based on extended learning data according to an embodiment.
8 is a diagram showing the configuration of a computer system according to an embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the detailed description of the following embodiments taken in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various different forms, only these embodiments make the disclosure of the present invention complete, and common knowledge in the art to which the present invention belongs. It is provided to fully inform the holder of the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numbers designate like elements throughout the specification.

비록 "제1" 또는 "제2" 등이 다양한 구성요소를 서술하기 위해서 사용되나, 이러한 구성요소는 상기와 같은 용어에 의해 제한되지 않는다. 상기와 같은 용어는 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용될 수 있다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있다.Although "first" or "second" is used to describe various elements, these elements are not limited by the above terms. Such terms may only be used to distinguish one component from another. Therefore, the first component mentioned below may also be the second component within the technical spirit of the present invention.

본 명세서에서 사용된 용어는 실시예를 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소 또는 단계가 하나 이상의 다른 구성요소 또는 단계의 존재 또는 추가를 배제하지 않는다는 의미를 내포한다.Terms used in this specification are for describing embodiments and are not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, "comprises" or "comprising" implies that a stated component or step does not preclude the presence or addition of one or more other components or steps.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 해석될 수 있다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used herein may be interpreted as meanings commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless explicitly specifically defined.

이하에서는, 도 1 내지 도 8을 참조하여 실시예에 따른 확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 장치 및 방법이 상세히 설명된다.Hereinafter, an apparatus and method for learning a network intrusion threat detection model based on extended learning data according to an embodiment will be described in detail with reference to FIGS. 1 to 8 .

기재된 실시예는 인공 지능 기반의 네트워크 침입 탐지 모델을 학습시키기 위한 학습 데이터의 불균형 문제를 해소하여, 기존의 딥러닝 모델 대비 침입 탐지 성능을 향상시킬 수 있는 기술을 제안한다. The described embodiment proposes a technique capable of improving intrusion detection performance compared to conventional deep learning models by resolving an imbalance problem of training data for training an artificial intelligence-based network intrusion detection model.

이를 위해, 기재된 실시예는 희소성을 갖는 학습 데이터를 적대적 생성 신경망(Generative Adversarial Networks, GAN)을 기반으로 인공적으로 합성하여 학습 데이터의 불균형 문제를 해소하는 방법을 제안한다. 이에 대한 상세한 설명은 도 3 내지 도 5를 참조하여 후술하기로 한다. To this end, the disclosed embodiments propose a method of solving the imbalance problem of learning data by artificially synthesizing sparse learning data based on a generative adversarial network (GAN). A detailed description thereof will be described later with reference to FIGS. 3 to 5 .

또한, 기재된 실시예는 인공적으로 생성된 합성 데이터로 확장된 학습 데이터로 오토 인코더 모델과 네트워크 침입 위협 탐지 모델을 학습시켜 네트워크 이상행위 탐지 성능을 향상시킬 수 있는 방법을 제안한다. 이에 대한 상세한 설명은 도 7 및 도 8을 참조하여 후술하기로 한다.In addition, the described embodiment proposes a method for improving network anomaly detection performance by learning an auto-encoder model and a network intrusion threat detection model with artificially generated synthetic data and extended training data. A detailed description thereof will be described later with reference to FIGS. 7 and 8 .

우선, 기재된 실시예에서 사용되는 적대적 생성 신경망 및 오토 인코더에 대해 살펴보기로 한다. First, an adversarial generative neural network and an autoencoder used in the described embodiment will be reviewed.

도 1은 실시예에 따른 적대적 생성 신경망 모델의 예시도이고, 도 2는 오토 인코더의 구조도이다. 1 is an exemplary diagram of an adversarial generative neural network model according to an embodiment, and FIG. 2 is a structural diagram of an auto-encoder.

도 1을 참조하면, 적대적 생성 신경망(Generative Adversarial Networks, 이하 'GAN'으로 기재함) 모델(100)은, 서로 적대적인 목적을 가진 생성기(Generator)(110) 및 판별기(Discriminator)(120)로 구성될 수 있다. Referring to FIG. 1 , a Generative Adversarial Networks (hereinafter referred to as 'GAN') model 100 is composed of a Generator 110 and a Discriminator 120 having antagonistic purposes. can be configured.

이때, 생성기(110)는, 잠재 코드를 입력받아 위조된 합성 데이터를 생성할 있다. 이때, 판별기(120)는, 합성 데이터를 학습 데이터(10)와 비교하여 합성 데이터의 진위 여부를 판별할 수 있다. At this time, the generator 110 may receive the latent code and generate forged synthesized data. At this time, the discriminator 120 may compare the synthesized data with the learning data 10 to determine whether the synthesized data is genuine or not.

즉, 생성기(110)는 판별기(220)를 속이는 것을 목적으로 하며, 판별기(120)는 입력 데이터에 대한 진위 여부(실제 또는 합성)를 정확히 구별하는 것을 목적으로 한다. 이러한 GAN 모델(100)이 올바른 방향으로 학습 완료된 경우, 생성기(210)는 판별기(220)가 진위 여부를 구별할 수 없을 정도로 실제 학습 데이터와 유사한 합성 데이터를 생성해낼 수 있다. That is, the purpose of the generator 110 is to deceive the discriminator 220, and the purpose of the discriminator 120 is to accurately distinguish whether the input data is genuine or false (real or synthetic). When the GAN model 100 is learned in the right direction, the generator 210 may generate synthesized data similar to actual training data to the extent that the discriminator 220 cannot distinguish whether it is genuine or not.

그런데, 이러한 GAN 모델을 기반으로 하는 방식은 모드 붕괴(Mode Collapse)라는 문제로 인해 생성기(110)에 대한 성능을 보장할 수 없다. 여기서, 모드 붕괴란, GAN 모델의 생성기(110)가 생성하는 합성 데이터 분포가 다양성을 잃고, 극히 일부분에 수렴하는 문제를 말하며, 모드 붕괴가 발생한 경우 생성기(110)는 특정 유형의 데이터만을 출력하게 된다. However, the method based on such a GAN model cannot guarantee the performance of the generator 110 due to a problem called mode collapse. Here, mode collapse refers to a problem in which the distribution of synthesized data generated by the generator 110 of the GAN model loses diversity and converges to a very small part. When mode collapse occurs, the generator 110 outputs only specific types of data. do.

이러한 문제를 해결하기 위해, 도 1에 도시된 바와 같이, 실시예에서는 오토 인코더(Autoencoder)(120)를 기반으로 하는 BEGAN(Boundary Equilibrium GAN) 활용하며, 학습 과정에서 판별 오류뿐만 아니라 재구성 오류(Reconstruction Error)를 기반으로 생성기(110)를 학습시켜, 생성기(110)가 생성하는 합성 데이터의 다양성을 보장한다. In order to solve this problem, as shown in FIG. 1, in the embodiment, a Boundary Equilibrium GAN (BEGAN) based on an autoencoder 120 is used, and in the learning process, not only the discrimination error but also the reconstruction error (reconstruction error) is used. By learning the generator 110 based on errors, diversity of synthesized data generated by the generator 110 is ensured.

도 2를 참조하면, 오토 인코더(120)는, 학습 데이터로부터 유의미한 특징을 추출(요약)하는 인코더(121) 및 추출(요약)된 특징으로부터 데이터를 재구성하는 디코더(122)로 구성될 수 있다. Referring to FIG. 2 , an auto-encoder 120 may include an encoder 121 that extracts (summarizes) meaningful features from training data and a decoder 122 that reconstructs data from the extracted (summarized) features.

따라서, 실시예에서 활용되는 BEGAN(100)은 도 1에 도시된 바와 같이, 판별기(120)는 오토인코더의 구조로 구성되고, 생성기(110)는 오토인코더의 디코더 구조를 갖는다. 또한, BEGAN(100)은 학습 데이터와 합성 데이터 간의 재구성 오류를 기반으로 학습되어, 생성기(110)는 재구성 오류를 최소화하는 방향으로 데이터를 생성하게 된다.Therefore, in the BEGAN 100 used in the embodiment, as shown in FIG. 1, the discriminator 120 has an autoencoder structure and the generator 110 has an autoencoder decoder structure. In addition, BEGAN 100 is learned based on reconstruction errors between training data and synthesis data, and the generator 110 generates data in a direction that minimizes reconstruction errors.

도 3은 실시예에 따른 학습 데이터 확장 방법을 설명하기 위한 순서도이고, 도 4는 실시예에 따른 생성기 학습을 설명하기 위한 도면이고, 도 5는 실시예에 따른 학습된 생성기를 이용한 합성 데이터 생성을 설명하기 위한 도면이다. Figure 3 is a flow chart for explaining a learning data expansion method according to an embodiment, Figure 4 is a diagram for explaining generator learning according to an embodiment, Figure 5 is a synthetic data generation using a learned generator according to an embodiment It is a drawing for explanation.

도 3을 참조하면, 실시예에 따른 학습 데이터 확장 방법은, 네트워크 플로우에서 미리 획득된 학습 데이터를 데이터 유형에 따라 소정 개수의 클래스들 별로 분할하는 단계(S210), 분할된 클래스들 별 학습 데이터로 각각에 대응되는 적대적 생성 신경망 모델을 학습시키는 단계(S220), 학습이 완료된 적대적 생성 신경망 모델들에 포함된 생성기들을 이용하여 합성 데이터를 생성하는 단계(S240~S250) 및 생성된 합성 데이터를 학습 데이터에 병합하는 단계(S270~S280)를 포함할 수 있다. Referring to FIG. 3 , in the learning data expansion method according to the embodiment, the learning data obtained in advance from the network flow is divided into a predetermined number of classes according to the data type (S210), and the training data for each divided class Training an adversarial neural network model corresponding to each (S220), generating synthetic data using generators included in the trained adversarial neural network models (S240 to S250), and converting the generated synthetic data into learning data It may include steps of merging into (S270 to S280).

즉, 실시예에 따른 분류하는 단계(S210)에서, 도 4에 예시된 바와 같이, 네트워크 플로우 학습 데이터 셋(10)을 소정 클래스 별로 CLASS_1 데이터(10-1), CLASS_2 데이터(10-2),...CLASS_N 데이터(10-N)로 분할될 수 있다. That is, in the step of classifying according to the embodiment (S210), as illustrated in FIG. 4, the network flow learning data set 10 is classified into CLASS_1 data 10-1, CLASS_2 data 10-2, ...can be divided into CLASS_N data (10-N).

그런 후, 학습시키는 단계(S220)에서, 제1 BEGAN 모델(200-1)은 CLASS_1 데이터(10-1)로 학습되고, 제2 BEGAN 모델(200-2)은 CLASS_2 데이터(10-2)로 학습되고, 제N BEGAN 모델(200-N)은 CLASS_N 데이터(10-N)로 학습될 수 있다.Then, in the step of learning (S220), the first BEGAN model 200-1 is learned with the CLASS_1 data 10-1, and the second BEGAN model 200-2 is learned with the CLASS_2 data 10-2. and the Nth BEGAN model 200-N may be learned with the CLASS_N data 10-N.

이때, 학습시키는 단계(S220)는, 재구성 오류 및 판별 오류가 소정 임계치 이하가 될 때까지 반복 수행될 수 있다. At this time, the step of learning (S220) may be repeatedly performed until the reconstruction error and the discrimination error become less than a predetermined threshold value.

이때, 재구성 오류의 임계치는 0.05이고, 판별 오류의 임계치는 0.5일 수 있다. In this case, the reconstruction error threshold may be 0.05, and the determination error threshold may be 0.5.

즉, 도 3을 참조하면, S230에서 재구성 오류 및 판별 오류가 소정 임계치 이하가 될 경우, 해당 적대적 생성 신경망 모델에 포함된 생성기가 합성 데이터 생성에 활용하기 위해 추출된다(S240). S240에서 추출된 생성기는 충분히 학습이 이루어진 것으로, 실제 데이터와 유사한 합성 데이터를 생성할 수 있다. That is, referring to FIG. 3 , when the reconstruction error and the discrimination error are below a predetermined threshold in S230, the generator included in the adversarial generative neural network model is extracted to be used for generating synthesized data (S240). The generator extracted in S240 has been sufficiently trained and can generate synthetic data similar to actual data.

따라서, 합성 데이터를 생성하는 단계(S250)에서, 도 5에 예시된 바와 같이, 생성기(110-1)은 GLASS_1 합성 데이터를 생성하고, 생성기(110-2)은 CLASS_2의 합성 데이터를 생성하고, 생성기(110-N)은 CLASS_N 합성 데이터를 생성할 수 있다.Therefore, in the step of generating synthetic data (S250), as illustrated in FIG. 5, the generator 110-1 generates synthetic data of GLASS_1, the generator 110-2 generates synthetic data of CLASS_2, The generator 110-N may generate CLASS_N composite data.

이때, 합성 데이터를 생성하는 단계(S250)는, 학습 데이터(10)에서의 클래스 별 분포에 따라 생성 비중을 조절할 수 있다. 이는 학습 데이터의 불균형을 고려한 것으로, 학습 데이터에서 충분한 분포를 갖는 데이터 유형의 경우 생성 비중을 적게 하고, 희소성을 갖는 데이터 유형의 경우 생성 비중을 크게 한다. At this time, in the step of generating synthesized data (S250), the generation proportion may be adjusted according to the distribution of each class in the learning data 10. This takes into account the imbalance of the learning data, and in the case of data types with sufficient distribution in the training data, the generation proportion is reduced, and in the case of sparse data types, the generation proportion is increased.

이때, 실시예에 따른 학습 데이터 확장 방법은, 생성된 합성 데이터를 데이터 유형 별로 레이블링하는 단계(S260)를 더 포함할 수 있다. 즉, 도 5에 도시된 바와 같이, 추출된 생성기들(110-1, 110-2,....110-N) 각각이 생성한 합성 데이터는 단일 유형에 대한 데이터로, 추후 탐지 모델 학습 데이터로 활용되기 위해 레이블링된다In this case, the learning data expansion method according to the embodiment may further include labeling the generated synthetic data for each data type (S260). That is, as shown in FIG. 5, the synthesized data generated by each of the extracted generators 110-1, 110-2, .... 110-N is data for a single type, and is data for future detection model training. are labeled for use as

이때, 레이블은, 정상 및 적어도 하나의 공격 유형 중 하나일 수 있다. 이때, 공격 유형은, 예컨대 DoS, Probe 등이 포함될 수 있다. In this case, the label may be one of normal and at least one attack type. At this time, the attack type may include, for example, DoS, Probe, and the like.

다음의 <표 1>은 원핫 인코딩 형식의 예시를 나타낸다. The following <Table 1> shows an example of one-hot encoding format.

정상normal 유형1type 1 유형2type 2 유형3type 3 유형4type 4 00 00 00 1One 00 00 1One 00 1One 00 00 00 22 00 00 00 1One 00 33 00 00 00 00 1One 44 00 00 00 1One 00 55 00 00 00 00 1One 66 1One 00 00 00 00

그런 후, 병합하는 단계(S270~S280)는, 레이블링된 합성 데이터에서 이상치를 제거(S270)하고, 학습 데이터를 병합한다(S280). Then, in the merging steps (S270 to S280), the outliers are removed from the labeled synthesis data (S270), and the training data are merged (S280).

도 6은 실시예에 따른 확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 방법을 설명하기 위한 순서도이고, 도 7은 실시예에 따른 확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 과정을 설명하기 위한 예시도이다. 6 is a flowchart illustrating a method of learning a network intrusion threat detection model based on extended learning data according to an embodiment, and FIG. 7 is an exemplary diagram illustrating a process of learning a network intrusion threat detection model based on extended learning data according to an embodiment. .

도 6 및 도 7을 참조하면, 실시예에 따른 확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 방법은, 적대적 생성 신경망 기반으로 생성된 합성 데이터가 추가된 확장 학습 데이터(20)를 기반으로 오토 인코더(410)를 학습시키는 단계(S310~S320), 학습이 완료된 오토 인코더(410)의 인코더(411)를 추출하는 단계(S330) 및 전단에 인코더(411)가 배치된 딥러닝 기반 탐지 모델(421)을 확장 학습 데이터(20)로 학습시키는 단계(S340~S360)를 포함할 수 있다. Referring to FIGS. 6 and 7 , the method for learning a network intrusion threat detection model based on extended learning data according to an embodiment is an auto encoder based on extended learning data 20 to which synthesized data generated based on an adversarial generation neural network is added. 410) (S310 to S320), extracting the encoder 411 of the learned auto-encoder 410 (S330), and a deep learning-based detection model 421 in which the encoder 411 is placed at the front end It may include steps (S340 to S360) of learning with the extended learning data 20.

이때, 딥러닝 기반 탐지 모델(421)은 머신 러닝 및 딥러닝 모델로 구성될 수 있다. In this case, the deep learning-based detection model 421 may be composed of machine learning and deep learning models.

이때, 학습시키는 단계(S360)는, 딥러닝 기반 탐지 모델(421)의 파라미터만을 갱신할 수 있다. At this time, in the learning step (S360), only parameters of the deep learning-based detection model 421 may be updated.

즉, 오토 인코더(410)를 학습시키는 단계(S320)에서 인코더(411)의 학습이 완료되었으므로, 인코더(411)의 파라미터는 S360에서는 갱신되지 않는다. That is, since learning of the encoder 411 is completed in the step of learning the auto-encoder 410 (S320), parameters of the encoder 411 are not updated in S360.

전술한 바와 같이 학습이 완료된 전단에 인코더(411)가 배치된 딥러닝 기반 탐지 모델(421)은 인공지능 기반 네트워크 침입 위협 탐지 모델(420)로 활용된다. 실시예에 따라, 학습된 오토 인코더를 탐지 모델에 적용하여 기존의 인공지능 기반 네트워크 침입 탐지 시스템의 성능을 극대화시킬 수 있다. As described above, the deep learning-based detection model 421 in which the encoder 411 is disposed at the front end of learning is used as the artificial intelligence-based network intrusion threat detection model 420. Depending on the embodiment, the performance of the existing artificial intelligence-based network intrusion detection system may be maximized by applying the learned auto-encoder to the detection model.

도 8은 실시예에 따른 컴퓨터 시스템 구성을 나타낸 도면이다.8 is a diagram showing the configuration of a computer system according to an embodiment.

실시예에 따른 확장 학습 데이터 기반 네트워크 침입 위협 탐지 모델 학습 장치는 컴퓨터로 읽을 수 있는 기록매체와 같은 컴퓨터 시스템(1000)에서 구현될 수 있다.An apparatus for learning an extended learning data-based network intrusion threat detection model according to an embodiment may be implemented in a computer system 1000 such as a computer-readable recording medium.

컴퓨터 시스템(1000)은 버스(1020)를 통하여 서로 통신하는 하나 이상의 프로세서(1010), 메모리(1030), 사용자 인터페이스 입력 장치(1040), 사용자 인터페이스 출력 장치(1050) 및 스토리지(1060)를 포함할 수 있다. 또한, 컴퓨터 시스템(1000)은 네트워크(1080)에 연결되는 네트워크 인터페이스(1070)를 더 포함할 수 있다. 프로세서(1010)는 중앙 처리 장치 또는 메모리(1030)나 스토리지(1060)에 저장된 프로그램 또는 프로세싱 인스트럭션들을 실행하는 반도체 장치일 수 있다. 메모리(1030) 및 스토리지(1060)는 휘발성 매체, 비휘발성 매체, 분리형 매체, 비분리형 매체, 통신 매체, 또는 정보 전달 매체 중에서 적어도 하나 이상을 포함하는 저장 매체일 수 있다. 예를 들어, 메모리(1030)는 ROM(1031)이나 RAM(1032)을 포함할 수 있다.Computer system 1000 may include one or more processors 1010, memory 1030, user interface input devices 1040, user interface output devices 1050, and storage 1060 that communicate with each other over a bus 1020. can In addition, computer system 1000 may further include a network interface 1070 coupled to network 1080 . The processor 1010 may be a central processing unit or a semiconductor device that executes programs or processing instructions stored in the memory 1030 or the storage 1060 . The memory 1030 and the storage 1060 may be storage media including at least one of volatile media, nonvolatile media, removable media, non-removable media, communication media, and information delivery media. For example, memory 1030 may include ROM 1031 or RAM 1032 .

이상에서 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although the embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art can implement the present invention in other specific forms without changing its technical spirit or essential features. You will understand that there is Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting.

10 : 학습 데이터 100 : 적대적 생성 신경망 모델
110 : 생성기, 디코더 120 : 판별기, 오토 인코더
121 : 인코더 122 : 디코더
20 : 확장된 학습 데이터 410 : 오토 인코더
411 : 인코더 421 : 네트워크 침입 위협 탐지 모델 10: training data 100: adversarial generative neural network model
110: generator, decoder 120: discriminator, autoencoder
121: encoder 122: decoder
20: extended training data 410: autoencoder
411: encoder 421: network intrusion threat detection model

Claims

Dividing learning data obtained in advance from a network flow into a predetermined number of classes according to data types;
learning an adversarial generative neural network model corresponding to each of the divided classes with training data;
generating synthesized data using generators included in adversarial generative neural network models for which learning has been completed; and
A method for expanding training data, comprising merging the generated synthetic data into training data.

The method of claim 1, wherein the adversarial generative neural network model,
a generator that receives the latent code and generates forged synthesized data; and
It consists of a discriminator that compares synthetic data and learning data to determine authenticity,
generator,
Consists of a decoder that generates synthesized data from latent code;
discriminator,
A method for extending learning data, consisting of an autoencoder including an encoder for extracting features from training data and a decoder for reconstructing training data from extracted features.

The method of claim 2, wherein the step of learning,
A learning data extension method that is repeatedly performed until reconstruction errors and discrimination errors are less than or equal to a predetermined threshold.

The method of claim 1, wherein generating synthetic data comprises:
A learning data expansion method that adjusts the generation proportion according to the distribution by class in the learning data.

According to claim 1,
A method for extending training data, further comprising labeling the generated synthetic data by data type.

The method of claim 5, wherein the label,
A training data extension method, one of normal and at least one type of attack.

The method of claim 6, wherein the labeling step,
A method for expanding training data, labeling in a one-hot format.

The method of claim 1, wherein the merging step,
A training data augmentation method that removes outliers from synthetic data and merges them.

learning an auto encoder based on extended learning data to which synthetic data generated based on an adversarial generative neural network is added;
Extracting an encoder of an auto-encoder for which learning has been completed; and
A method for learning a network intrusion threat detection model based on extended learning data, comprising the step of training a network intrusion threat detection model in which an encoder is disposed at a front end with extended learning data.

The method of claim 7, wherein the step of learning,
A method for learning a network intrusion threat detection model based on extended learning data, wherein only parameters of the network intrusion threat detection model are updated.

a memory in which at least one program is recorded; and
A processor that executes a program;
program,
learning an auto encoder based on extended learning data to which synthetic data generated based on an adversarial generative neural network is added;
Extracting an encoder of an auto-encoder for which learning has been completed; and
An apparatus for learning a network intrusion threat detection model based on extended learning data, comprising the step of training a network intrusion threat detection model, in which an encoder is disposed at a front end, with extended learning data.

The method of claim 11, wherein the program,
In generating synthetic data,
Dividing learning data obtained in advance from a network flow into a predetermined number of classes according to data types;
learning an adversarial generative neural network model corresponding to each of the divided classes with training data;
generating synthesized data using generators included in adversarial generative neural network models for which learning has been completed; and
An apparatus for training a network intrusion threat detection model based on extended learning data, which performs a step of merging the generated synthetic data into training data.

The method of claim 11, wherein the adversarial generative neural network model,
a generator that receives the latent code and generates forged synthesized data; and
It consists of a discriminator that compares synthetic data and learning data to determine authenticity,
generator,
Consists of a decoder that generates synthesized data from latent code;
discriminator,
An apparatus for learning an extended training data-based network intrusion threat detection model, comprising an auto-encoder including an encoder for extracting features from training data and a decoder for reconstructing training data from extracted features.

The method of claim 12, wherein the program,
An apparatus for learning a network intrusion threat detection model based on extended learning data, wherein the step of learning the adversarial generated neural network model is repeatedly performed until the reconstruction error and the discrimination error are below a predetermined threshold.

The method of claim 12, wherein the program,
An apparatus for learning an extended learning data-based network intrusion threat detection model that adjusts the generation proportion according to the distribution by class in the training data in the step of generating synthetic data.

The method of claim 12, wherein the program,
An apparatus for learning an extended training data-based network intrusion threat detection model, further comprising labeling the generated synthetic data by data type.

The method of claim 16, wherein the label,
A device that trains a network intrusion threat detection model based on extended training data that is normal and at least one type of attack.

17. The method of claim 16, wherein the labeling step comprises:
A network intrusion threat detection model learning device based on extended learning data that labels in a one-hot format.

The method of claim 12, wherein the merging step,
A network intrusion threat detection model training device based on extended training data that removes and merges outliers from synthetic data.

The method of claim 11, wherein the step of learning,
An apparatus for learning a network intrusion threat detection model based on extended training data, which renews only the parameters of the network intrusion threat detection model.