KR102279983B1

KR102279983B1 - Network Intrusion Detection Method using unsupervised deep learning algorithms and Computer Readable Recording Medium on which program therefor is recorded

Info

Publication number: KR102279983B1
Application number: KR1020180171484A
Authority: KR
Inventors: 정윤경; 김동민
Original assignee: 성균관대학교산학협력단
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2021-07-21
Also published as: KR20200087299A

Abstract

본 발명은 딥러닝 알고리즘을 이용한 비지도 방식의 네트워크 침입 탐지 방법 및 이를 실행하기 위한 프로그램이 기록된 기록매체에 관한 것이다.
본 발명의 일례에 따른 딥러닝 알고리즘을 이용한 비지도 방식의 네트워크 침입 탐지 방법 및 이를 실행하기 위한 프로그램이 기록된 기록매체는 외부로부터 입력된 데이터 패킷을 원 핫 인코딩(one-hot encoding) 방식으로 전처리하여 벡터값으로 표시되는 원본 벡터를 생성하는 전처리 단계; 원본 벡터를 오토인코더(Autoencoder) 방식으로 낮은 차원으로 압축한 후 원래 차원으로 복원하여 벡터값으로 표시되는 복원 벡터를 생성하는 압축 및 복원 단계; 원본 벡터와 복원 벡터의 차이값을 계산하여 손실값을 산출하는 손실값 산출 단계; 및 산출된 손실값이 임계치보다 크면 데이터 패킷을 비정상 데이터로 판별하고, 산출된 손실값이 임계치보다 작으면 데이터 패킷을 정상 데이터로 판별하는 판별 단계;를 포함하고, 임계치는 수신자 조작 특성 곡선(receiver operating characteristic curve, ROC curve)을 이용하여 결정되되, 수신자 조작 특성 곡선에서 적중 확률(True Positive Rate)이 높고 오경보확률(False Positive Rate)이 낮은 지점이 임계치로 선택될 수 있다.The present invention relates to an unsupervised network intrusion detection method using a deep learning algorithm and a recording medium in which a program for executing the same is recorded.
An unsupervised method for network intrusion detection using a deep learning algorithm according to an example of the present invention, and a recording medium on which a program for executing the same is recorded, pre-processes a data packet input from the outside in a one-hot encoding method a preprocessing step of generating an original vector expressed as a vector value; A compression and restoration step of compressing the original vector to a lower dimension using an autoencoder method and then restoring it to the original dimension to generate a restored vector expressed as a vector value; a loss value calculation step of calculating a loss value by calculating a difference value between the original vector and the restored vector; and a determination step of determining the data packet as abnormal data if the calculated loss value is greater than the threshold, and determining the data packet as normal data if the calculated loss value is less than the threshold. It is determined using an operating characteristic curve, ROC curve), and a point having a high true positive rate and a low false positive rate in the receiver operating characteristic curve may be selected as a threshold value.

Description

Network Intrusion Detection Method using unsupervised deep learning algorithms and Computer Readable Recording Medium on which program therefor is recorded}

본 발명은 딥러닝 알고리즘을 이용한 비지도 방식의 네트워크 침입 탐지 방법 및 이를 실행하기 위한 프로그램이 기록된 기록매체에 관한 것이다.The present invention relates to an unsupervised network intrusion detection method using a deep learning algorithm and a recording medium in which a program for executing the same is recorded.

침입 탐지 시스템(Intrusion Detection System, IDS)은 다양한 유형의 네트워크 공격으로부터 시스템 및 정보 자산을 보호하는 데 사용된다. Intrusion Detection Systems (IDS) are used to protect systems and information assets from various types of network attacks.

일반적으로 전문가는 정확한 공격 원인과 패턴 식별을 위해 IDS를 통해 수집된 트래픽 및 로그를 분석한다. 이 분석 프로세스는 위협의 유형에 따라 며칠에서 수개월이 소요된다. 응답이 지연되면 복구 비용이 크게 증가하는 문제점이 있다. Typically, experts analyze the traffic and logs collected through IDS to identify the exact cause and pattern of the attack. This analysis process can take anywhere from several days to several months, depending on the type of threat. If the response is delayed, there is a problem in that the recovery cost greatly increases.

이에 따라, 침입 탐지 문제를 해결하기 위해 머신 러닝 기술(machine learning techniques)을 사용하여 보안 위협 탐지를 자동화하는 연구가 진행되고 있다.Accordingly, in order to solve the problem of intrusion detection, research on automating the detection of security threats using machine learning techniques is in progress.

머신 러닝 기술(machine learning techniques)은 지도 학습 (supervised learning) 또는 비지도 학습 (unsupervised learning)으로 나뉜다. Machine learning techniques are divided into supervised learning or unsupervised learning.

데이터에 분류 할 클래스가 여러 개 있을 때 지도 학습 알고리즘(supervised learning algorithms)은 사전 주석이 달린 클래스 레이블(label)을 사용하여 분류 패턴을 학습한다. When there are multiple classes to classify in the data, supervised learning algorithms learn classification patterns using pre-annotated class labels.

한편, 비지도 학습 알고리즘(Unsupervised learning algorithms)은 클래스 레이블(label)없이 분류 패턴을 학습한다. 머신 러닝을 사용하는 IDS 기술은 서명 기반 학습 방법(Signature-based learning methods)과 이상 기반 방법(anomaly-based method)을 사용한다.On the other hand, unsupervised learning algorithms learn classification patterns without class labels. IDS technology using machine learning uses signature-based learning methods and anomaly-based methods.

서명 기반 학습 방법(Signature-based learning methods)은 지도 학습 알고리즘(supervised learning algorithms) (SVM, 의사 결정 트리, 신경망, KNN 등)을 사용하여 침입 패턴을 학습한다. Signature-based learning methods use supervised learning algorithms (SVMs, decision trees, neural networks, KNNs, etc.) to learn intrusion patterns.

KDD Cup 99 및 Kyoto 2006+와 같은 분류 된 데이터 세트의 가용성으로 인해 신호 기반 접근법에 대한 연구가 많이 이루어졌다. Due to the availability of classified data sets such as KDD Cup 99 and Kyoto 2006+, much research has been done on signal-based approaches.

Roshan and Huang [2014]은 점진적 SVM 알고리즘에 기반한 침입 탐지 시스템을 개발했다. Mohammed et el. [2016] 상호 정보 기반 알고리즘에 기반한 최소 자승 지원 벡터 (SVM)를 사용하여 IDS를 개발했다. Roshan and Huang [2014] developed an intrusion detection system based on the progressive SVM algorithm. Mohammed et al. [2016] developed an IDS using a least-squares support vector (SVM) based on a mutual information-based algorithm.

Sahu와 Mehtre [2015]는 J48 결정 트리 알고리즘을 기반으로 침입 탐지 시스템의 성능을 분석했다. Sahu and Mehtre [2015] analyzed the performance of an intrusion detection system based on the J48 decision tree algorithm.

Kazuya et el. [2011]은 네트워크 트래픽 추이의 동적 변화에 대처하기 위해 다양한 트래픽 데이터 세트를 학습하여 얻은 여러 분류자를 결합한 침입 탐지 시스템을 제안했다. 비버 (Beaver) 외. [2013]은 다중 분류 자의 적응성 강화에 기반한 침입 탐지 시스템을 제안했다.Kazuya et al. [2011] proposed an intrusion detection system that combines several classifiers obtained by learning various traffic data sets to cope with dynamic changes in network traffic trends. Beaver et al. [2013] proposed an intrusion detection system based on the adaptive enhancement of multiple classifiers.

그러나 서명 기반 접근법의 사용은 실용적인 용도로 제한된다. 첫째, 패턴이 사전 훈련 된 공격 클래스에 속하는 새로운 유형의 공격을 탐지하는 것은 어렵다. 둘째, 분류자를 업데이트하려면 중요하지 않은 수동 작업이 필요한 레이블(label)이 지정된 데이터가 필요하다.However, the use of the signature-based approach is limited to practical uses. First, it is difficult to detect new types of attacks whose patterns belong to pre-trained attack classes. Second, updating the classifier requires labeled data that requires non-critical manual work.

서명 기반 접근법의 단점을 해결하기 위해 많은 예외 기반 작업이 제안되었다. one-class SVM과 클러스터링 방법을 사용하는 이상 탐지는 정상적인 영역 내에서 조밀하게 위치하는 대다수의 데이터를 정상으로 간주하고, 일부 특이치를 비정상으로 간주한다. To solve the shortcomings of the signature-based approach, many exception-based operations have been proposed. Anomaly detection using one-class SVM and clustering method considers the majority of data densely located within the normal region as normal, and some outliers as abnormal.

Song et al. [2011b]는 일류 SVM을 제안하고 교토 2006+ 세트에 적용했을 때 93.5 %의 탐지율과 7.33 %의 오탐지율을 얻었다.Song et al. [2011b] proposed a first-class SVM and obtained a detection rate of 93.5% and a false positive rate of 7.33% when applied to the Kyoto 2006+ set.

Ishida et al. [2011]은 OptiGrid 클러스터링과 그리드 기반 클러스터 레이블링 알고리즘(Grid-based cluster labeling algorithms)을 결합하여 공격 트래픽을 식별하는 침입 탐지 체계를 제안했다. Ishida et al. [2011] proposed an intrusion detection scheme to identify attack traffic by combining OptiGrid clustering and grid-based cluster labeling algorithms.

Hassen and Bourouis [2015]는 온라인 자체 학습 SVM으로 KDD'99 및 Kyoto 2006+ 데이터 세트를 테스트하고 F1- 점수와 정확도 0.98을 얻었다.Hassen and Bourouis [2015] tested the KDD'99 and Kyoto 2006+ datasets with an online self-learning SVM and obtained an F1-score and an accuracy of 0.98.

몇몇 논문들은 네트워크 비정상 탐지를 위해 오토인코더(autoencoder)를 적용하고 특징 추출 및 분류 기능의 측면에서 오토인코더(Autoencoder)의 유효성을 검증했다. 이 섹션에서는 먼저 오토인코더(Autoencoder)를 사용하는 이전 작업에 대해 설명하고 IDS에 대한 딥러닝 알고리즘(deep learning algorithm)을 사용하는 몇 가지 작업을 소개한다.Several papers applied autoencoder for network anomaly detection and verified the validity of autoencoder in terms of feature extraction and classification functions. In this section, we first describe our previous work using autoencoders and introduce some work using deep learning algorithms for IDS.

오토인코더 알고리즘(Autoencoder algorithms)이 IDS에 사용되어 왔으며 주로 효율적인 계산을 위해 피쳐 디멘젼(feature dimension)을 줄이기 위해 사용되었다. 예를 들어, Javaid et al. [2016]은 스파스 오토인코더(sparse autoencoder) 및 SMR (soft-max regression) 방법을 조합하여 네트워크 IDS를 개발했다. Autoencoder algorithms have been used in IDS, mainly to reduce the feature dimension for efficient computation. For example, Javaid et al. [2016] developed a network IDS by combining sparse autoencoder and soft-max regression (SMR) methods.

Aygun et al. [2017]는 알려지지 않은 공격 유형을 탐지하기 위한 두 가지 IDS 모델을 제안했다. 하나는 오토인코더(autoencoder)를 사용하고 다른 하나는 노이즈가 제거된 오토인코더(autoencoder)를 사용한다. 모델은 추가로 정확한 검출을 위한 핵심 매개 변수 인 임계 값을 결정하는 새로운 확률 적 방법을 포함한다. 각 모델은 NSL-KDD 데이터 세트로 평가되었으며 88.28 % 및 88.65 %의 탐지 정확도를 각각 달성했다.Aygun et al. [2017] proposed two IDS models for detecting unknown attack types. One uses an autoencoder and the other uses an autoencoder with noise removed. The model additionally includes a novel probabilistic method to determine the threshold, which is a key parameter for accurate detection. Each model was evaluated with the NSL-KDD data set and achieved detection accuracies of 88.28% and 88.65%, respectively.

Li et al. [2015]는 오토인코더(autoencoder)와 DBN의 조합에 기반한 IDS를 제안했다. 이 방법에서는 데이터 차원을 줄이기 위해 오토인코더(autoencoder)가 사용되며 RBM의 다중 레이어와 BP 신경 네트워크의 추가 레이어로 구성된 DBN이 악의적 인 클래스와 일반 클래스 사이의 클래스로 사용된다. 제안 된 모델은 KDD Cup 99 데이터 세트를 사용하여 평가되었으며 결과는 오토인코더(autoencoder)와 DBN을 결합하면 DBN만으로는 더 나은 탐지 정확도를 얻을 수 있음을 보여 주었다.Li et al. [2015] proposed an IDS based on the combination of autoencoder and DBN. In this method, an autoencoder is used to reduce the data dimension, and a DBN composed of multiple layers of RBM and an additional layer of BP neural network is used as a class between the malicious class and the normal class. The proposed model was evaluated using the KDD Cup 99 data set, and the results showed that combining an autoencoder and DBN can achieve better detection accuracy with DBN alone.

Tao et al. [2016]은 피셔 스코어 (Fisher score)와 딥 오토인코더(deep autoencoder)에 기반한 데이터 융합 접근법을 제안했다. 이 주요 목적은 데이터의 차원을 줄이는 것이다. 딥 오토인코더(deep autoencoder)를 특징 추출 방법으로 통합하면 백 프로파게이션 신경망(back propagation neural network) 및 지원 벡터 머신과 같은 분류 알고리즘의 정확성을 향상시킬 수 있다는 것이 확인되었다.Tao et al. [2016] proposed a data fusion approach based on Fisher scores and deep autoencoders. The main purpose of this is to reduce the dimension of the data. It has been confirmed that the integration of deep autoencoders as feature extraction methods can improve the accuracy of classification algorithms such as back propagation neural networks and support vector machines.

Shone et al. [2018]은 감독되지 않은 피쳐 학습과 데이터 차원 감퇴의 능력을 제공하는 비대칭 딥 오토인코더(NDAE)를 제안했다. 제안 된 NDAE를 기반으로 작성자는 누적 된 NDAE (더 나은 기능 학습을 위해)와 Random Forest (RF) (분석 오버 헤드 줄이기)를 결합한 새로운 NIDS 모델을 개발했다. KDD Cup 99 및 NSL-KDD 데이터 세트를 사용한 평가 결과는 제안된 방법이 각 데이터 세트에 대해 97.85 % 및 85.42 %의 정확도를 달성한다는 것을 보여주었다.Shone et al. [2018] proposed an asymmetric deep autoencoder (NDAE) that provides unsupervised feature learning and data-dimensional decay capabilities. Based on the proposed NDAE, the authors developed a novel NIDS model that combines cumulative NDAE (for better functional learning) and Random Forest (RF) (reducing analysis overhead). Evaluation results using the KDD Cup 99 and NSL-KDD data sets showed that the proposed method achieves accuracies of 97.85% and 85.42% for each data set.

Mirsky et al. [2018] Kitsune이라는 NIDS 모델을 개발하여 온라인 및 비지도 방식의 악의적인 트래픽을 탐지하기 위해 오토인코더(autoencoder) 기능을 기반으로 한 딥 러닝(deep learning)을 사용했다. Mirsky et al. [2018] Deep learning based on autoencoder function was used to detect online and unsupervised malicious traffic by developing a NIDS model called Kitsune.

Zhang et al. [2018]은 네트워크 침입 탐지를 위해 희소 스택 오토인코더(autoencoder) 장치 및 이진 트리 앙상블 방법을 사용한 심층 학습 접근법을 제안하고 NSL-KDD 데이터 세트를 사용하여 성능을 평가했다. 평가 결과는 제안된 접근 방법이 평균 91.97 %의 F1 점수를 획득했음을 보여주었다.Zhang et al. [2018] proposed a deep learning approach using a sparse stack autoencoder device and binary tree ensemble method for network intrusion detection, and evaluated its performance using the NSL-KDD dataset. The evaluation results showed that the proposed approach achieved an average F1 score of 91.97%.

Diro et al. [2018]은 특히 fog-to-things 컴퓨팅 환경에서의 분류를 위해 사전 학습을위한 스택 자동 엔코더와 soft-max regression를 사용하는 분산 된 NIDS를 제안했다.Diro et al. [2018] proposed a distributed NIDS using a stack autoencoder and soft-max regression for pre-training, especially for classification in fog-to-things computing environments.

Madani et al. [2018]는 의도적 인 데이터 중독에 대한 머신 러닝 기반 IDS의 견고성을 평가하기 위한 프레임 워크를 제안하고 적대적인 오염 공격 하에서 오토인코더(autoencoder) 기반 모델과 PCA 기반 모델의 성능을 분석 및 비교했다. 결과는 오토인코더(autoencoder) 기반 IDS가 PCA 기반 IDS보다 더 안정적인 검색 성능을 제공 할 수 있음을 보여주었다.Madani et al. [2018] proposed a framework for evaluating the robustness of machine learning-based IDS against intentional data poisoning, and analyzed and compared the performance of autoencoder-based and PCA-based models under hostile pollution attacks. The results showed that autoencoder-based IDS can provide more stable search performance than PCA-based IDS.

Li et al. [2017]은 모바일 어플리케이션의 트래픽 흐름을 차별화하기 위한 딥러닝 모델(deep learning model)을 연구하고, VEAN (Variational Autoencoder Network) 기반의 트래픽 식별 모델을 제안했다.Li et al. [2017] studied a deep learning model to differentiate the traffic flow of mobile applications, and proposed a VEAN (Variational Autoencoder Network)-based traffic identification model.

Alom et al. [2015]는 네트워크 작업 침입 탐지를 위해 제한된 볼츠만 기계 (RBMs)의 스택으로 구성된 DBN의 사용을 제안했다. 제안 된 접근법은 NSL-KDD 데이터 세트로 평가되었으며, 결과에 따라 단지 37 개의 탐지 단위에 대해 97.5 %의 탐지 정확도를 달성했다. Alom et al. [2015] proposed the use of a DBN consisting of a stack of restricted Boltzmann machines (RBMs) for network operation intrusion detection. The proposed approach was evaluated with the NSL-KDD data set, and according to the results, it achieved a detection accuracy of 97.5% for only 37 detection units.

Salama et al. [2011]은 RBM 기반 DBN과 SVM 분류자를 결합한 IDS의 하이브리드 모델을 제안했다. DBN을 사용하여 데이터의 차원을 축소 한 후 SVM을 적용하여 데이터를 분류한다. Salama et al. [2011] proposed a hybrid model of IDS combining RBM-based DBN and SVM classifier. After reducing the dimension of the data using DBN, SVM is applied to classify the data.

Kim et al. [2016]은 recurrent neural networks(RNN)에 기반한 IDS를 제안하였고, RNN에 LSTM (long short term memory) 구조를 적용하였다. Kim et al. [2016] proposed IDS based on recurrent neural networks (RNN), and applied LSTM (long short term memory) structure to RNN.

Tang et al. [2016]는 소프트웨어 기반 네트워킹 환경에서 흐름 기반 이형 탐지를 위한 DNN (deep neural networks) 기반의 IDS를 설계했다. Tang et al. [2016] designed an IDS based on deep neural networks (DNN) for flow-based anomaly detection in a software-based networking environment.

이와 같이, 종래의 머신러닝 기술을 이용하여 네트워크 위협을 판별하기 위해서는 정상/비정상 데이터의 라벨이 필요한데, 딥러닝이나 머신러닝 기술을 이용하여 문제를 해결하려면 방대한 양의 데이터가 필요하므로 데이터 레이블링에 대단히 많은 노력과 시간이 필요한 문제점이 있다.In this way, normal/abnormal data labels are required to identify network threats using conventional machine learning techniques. However, using deep learning or machine learning techniques to solve problems requires a large amount of data, so it is very important for data labeling. There is a problem that requires a lot of effort and time.

본 발명은 딥러닝 알고리즘인 오토인코더(Autoencoder)를 이용하여 네트워크 데이터를 정상 또는 비정상으로 분류하고, 정상 데이터만을 사용하여 딥러닝 네트워크를 학습시키기 위해, 방대한 학습 데이터 중에서 정상 데이터를 분류 수집하는 딥러닝 알고리즘을 이용한 비지도 방식의 네트워크 침입 탐지 방법 및 이를 실행하기 위한 프로그램이 기록된 기록매체를 제공하는데 그 목적이 있다. The present invention classifies network data as normal or abnormal using an autoencoder, which is a deep learning algorithm, and uses only normal data to train a deep learning network. Deep learning that classifies and collects normal data from a vast amount of learning data An object of the present invention is to provide an unsupervised method for detecting network intrusion using an algorithm and a recording medium on which a program for executing the method is recorded.

본 발명의 일례에 따른 딥러닝 알고리즘을 이용한 비지도 방식의 네트워크 침입 탐지 방법 및 이를 실행하기 위한 프로그램이 기록된 기록매체는 외부로부터 입력된 데이터 패킷을 원 핫 인코딩(one-hot encoding) 방식으로 전처리하여 벡터값으로 표시되는 원본 벡터를 생성하는 전처리 단계; 원본 벡터를 오토인코더(Autoencoder) 방식으로 낮은 차원으로 압축한 후 원래 차원으로 복원하여 벡터값으로 표시되는 복원 벡터를 생성하는 압축 및 복원 단계; 원본 벡터와 복원 벡터의 차이값을 계산하여 손실값을 산출하는 손실값 산출 단계; 및 산출된 손실값이 임계치보다 크면 데이터 패킷을 비정상 데이터로 판별하고, 산출된 손실값이 임계치보다 작으면 데이터 패킷을 정상 데이터로 판별하는 판별 단계;를 포함하고, 임계치는 수신자 조작 특성 곡선(receiver operating characteristic curve, ROC curve)을 이용하여 결정되되, 수신자 조작 특성 곡선에서 적중 확률(True Positive Rate)이 높고 오경보확률(False Positive Rate)이 낮은 지점이 임계치로 선택될 수 있다.An unsupervised method for network intrusion detection using a deep learning algorithm according to an example of the present invention, and a recording medium on which a program for executing the same is recorded, preprocesses data packets input from the outside in a one-hot encoding method a preprocessing step of generating an original vector expressed as a vector value; A compression and restoration step of compressing the original vector to a lower dimension using an autoencoder method and then restoring it to the original dimension to generate a restored vector expressed as a vector value; a loss value calculation step of calculating a loss value by calculating a difference value between the original vector and the restored vector; and a determination step of determining the data packet as abnormal data if the calculated loss value is greater than the threshold, and determining the data packet as normal data if the calculated loss value is less than the threshold. It is determined using an operating characteristic curve, ROC curve), and a point having a high true positive rate and a low false positive rate in the receiver operating characteristic curve may be selected as a threshold value.

일례로, 판별 단계에서, 임계치는 수신자 조작 특성 곡선에서 적중 확률(True Positive Rate)이 0.8이상, 오경보확률(False Positive Rate)이 0.3 이하일 수 있다.For example, in the determination step, the threshold value may be a true positive rate of 0.8 or more and a false positive rate of 0.3 or less in the receiver manipulation characteristic curve.

전처리 단계에서, 외부로부터 입력된 데이터 패킷은 숫자가 아닌 데이터(Non-numerical data)인 자연어 또는 문자를 포함하고, 데이터 패킷을 원 핫 인코딩 방식으로 전처리 하기 전, 숫자가 아닌 데이터는 숫자 데이터(numerical data)로 변환한 후, 원 핫 인코딩 방식으로 전처리되어, 숫자 데이터 상태의 데이터 패킷은 0과 1로 코딩되어 벡터값으로 표시될 수 있다.In the pre-processing step, the data packet input from the outside contains natural language or characters that are non-numeric data, and before pre-processing the data packet in a one-hot encoding method, non-numeric data is converted to numeric data (numerical data). data), and then pre-processed by a one-hot encoding method, the data packet in the numeric data state may be coded as 0 and 1 and displayed as a vector value.

압축 및 복원 단계에서, 오토인코더(Autoencoder) 방식은 원본 벡터를 원본 벡터보다 낮은 차원으로 압축하는 인코더(Encoder) 단계와 원본 벡터와 동일한 차원으로 복구하여, 본원 벡터를 생성하는 디코더(Decoder) 단계를 포함할 수 있다.In the compression and decompression step, the autoencoder method includes an encoder step that compresses the original vector to a lower dimension than the original vector, and a decoder step that restores the original vector to the same dimension as the original vector to generate the original vector. may include

손실값 산출 단계에서는 원본 벡터와 복원 벡터의 차이값을 계산하기 위하여, 교차 엔트로피 오차(cross-entropy error, CEE) 및 평균 제곱 오차(Root Mean Square Error, RMSE) 중 적어도 하나를 사용할 수 있다.In the step of calculating the loss value, at least one of a cross-entropy error (CEE) and a root mean square error (RMSE) may be used to calculate a difference between the original vector and the restored vector.

본 발명에 따른 딥러닝 알고리즘을 이용한 비지도 방식의 네트워크 침입 탐지 방법 및 이를 실행하기 위한 프로그램이 기록된 기록매체는 외부로부터 입력된 데이터 패킷을 벡터값으로 전처리 한 후, 오토인코더(autoencoder)를 사용하여 데이터 패깃을 압축 및 복원하여 그 차이값으로부터 손실값을 산출하여, 데이터 패킷이 정상 데이터인지 비정상 데이터인지 판별할 수 있어, 데이터 레이블링에 필요한 노력과 시간을 절감할 수 있다.An unsupervised method for detecting network intrusion using a deep learning algorithm according to the present invention and a recording medium on which a program for executing the same is recorded using an autoencoder after preprocessing a data packet input from the outside into a vector value Thus, it is possible to compress and restore the data packet, calculate a loss value from the difference value, and determine whether the data packet is normal data or abnormal data, thereby reducing the effort and time required for data labeling.

도 1은 본 발명의 일례에 따른 딥러닝 알고리즘을 이용한 비지도 방식의 네트워크 침입 탐지 시스템의 개념을 설명하기 위한 도이다.
도 2는 도 1에 도시된 압축 및 복원 단계(S2)에 사용되는 오토인코더(Autoencoder)의 개념을 설명하기 위한 도이다.
도 3은 SGD 알고리즘의 ROC 커브의 일례를 도시한 것이다..
도 4는 RMSPROP 알고리즘의 ROC 커브의 일례를 도시한 것이다.
도 5는 ADAM 알고리즘의 ROC 커브의 일례를 도시한 것이다.1 is a diagram for explaining the concept of an unsupervised network intrusion detection system using a deep learning algorithm according to an example of the present invention.
FIG. 2 is a diagram for explaining the concept of an autoencoder used in the compression and decompression step S2 shown in FIG. 1 .
3 shows an example of the ROC curve of the SGD algorithm.
4 shows an example of the ROC curve of the RMSPROP algorithm.
5 shows an example of the ROC curve of the ADAM algorithm.

이하에서는 첨부한 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다. 그러나 본 발명이 이러한 실시예에 한정되는 것은 아니며 다양한 형태로 변형될 수 있음은 물론이다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, it goes without saying that the present invention is not limited to these embodiments and may be modified in various forms.

도면에서는 본 발명을 명확하고 간략하게 설명하기 위하여 설명과 관계 없는 부분의 도시를 생략하였으며, 명세서 전체를 통하여 동일 또는 극히 유사한 부분에 대해서는 동일한 도면 참조부호를 사용한다. 그리고 도면에서는 설명을 좀더 명확하게 하기 위하여 두께, 넓이 등을 확대 또는 축소하여 도시하였는바, 본 발명의 두께, 넓이 등은 도면에 도시된 바에 한정되지 않는다. In the drawings, in order to clearly and briefly describe the present invention, the illustration of parts irrelevant to the description is omitted, and the same reference numerals are used for the same or extremely similar parts throughout the specification. In addition, in the drawings, the thickness, width, etc. are enlarged or reduced in order to make the description more clear, and the thickness and width of the present invention are not limited to those shown in the drawings.

도 1은 본 발명의 일례에 따른 딥러닝 알고리즘을 이용한 비지도 방식의 네트워크 침입 탐지 시스템 및 방법의 개념을 설명하기 위한 도이다.1 is a diagram for explaining the concept of an unsupervised network intrusion detection system and method using a deep learning algorithm according to an example of the present invention.

본 발명의 일례에 따른 딥러닝 알고리즘을 이용한 비지도 방식의 네트워크 침입 탐지 시스템 및 방법은 도 1에 도시된 바와 같이, 전처리 단계(S1), 압축 및 복원 단계(S2), 손실값 산출 단계(S3) 및 판별 단계(S4)를 수행할 수 있다.An unsupervised network intrusion detection system and method using a deep learning algorithm according to an example of the present invention is a pre-processing step (S1), a compression and restoration step (S2), a loss value calculation step (S3), as shown in FIG. ) and the determination step (S4) may be performed.

전처리 단계(S1)에서는 외부로부터 입력된 데이터 패킷을 원 핫 인코딩(one-hot encoding)방식으로 전처리하여 벡터값으로 표시되는 원본 벡터를 생성할 수 있다.In the pre-processing step S1, an original vector expressed as a vector value may be generated by pre-processing a data packet input from the outside in a one-hot encoding method.

이와 같은 전처리 단계(S1)에서 상기 외부로부터 입력된 데이터 패킷은 숫자가 아닌 데이터(Non-numerical data)인 자연어 또는 문자를 포함할 수 있는데, 데이터 패킷을 원 핫 인코딩(one-hot encoding)방식으로 전처리 하기 전, 상기 숫자가 아닌 데이터는 숫자 데이터(numerical data)로 변환한 후, 상기 원 핫 인코딩(one-hot encoding)방식으로 전처리되어, 상기 숫자 데이터 상태의 데이터 패킷이 0과 1로 코딩되어 벡터값으로 표시될 수 있다.In this pre-processing step (S1), the data packet input from the outside may include natural language or characters that are non-numerical data, and the data packet is converted into a one-hot encoding method. Before pre-processing, the non-numeric data is converted to numeric data, and then pre-processed by the one-hot encoding method, so that the data packets in the numeric data state are coded with 0 and 1. It can be expressed as a vector value.

압축 및 복원 단계(S2)에서는 원본 벡터를 오토인코더(Autoencoder) 방식으로 낮은 차원으로 압축한 후 원래 차원으로 복원하여 벡터값으로 표시되는 복원 벡터를 생성할 수 있다.In the compression and decompression step S2, the original vector is compressed to a lower dimension by an autoencoder method, and then restored to the original dimension to generate a restored vector represented by a vector value.

여기서, 상기 오토인코더(Autoencoder) 방식은 상기 원본 벡터를 상기 원본 벡터보다 낮은 차원으로 압축하는 인코더(Encoder) 단계와 상기 원본 벡터와 동일한 차원으로 복구하는 디코더(Decoder) 단계를 포함할 수 있다.Here, the autoencoder method may include an encoder step of compressing the original vector to a lower dimension than the original vector and a decoder step of restoring the original vector to the same dimension as the original vector.

손실값 산출 단계(S3)는 원본 벡터와 상기 복원 벡터의 차이값을 계산하여 손실값을 산출할 수 있다. 이와 같은 상기 원본 벡터와 상기 복원 벡터의 차이값을 계산하기 위하여, 교차 엔트로피 오차(cross-entropy error, CEE)(cross-entropy error, CEE) 또는 평균 제곱 오차(Root Mean Square Error, RMSE)를 사용할 수 있다.The loss value calculation step S3 may calculate a loss value by calculating a difference value between the original vector and the restored vector. In order to calculate the difference between the original vector and the restored vector, a cross-entropy error (CEE) (cross-entropy error, CEE) or a root mean square error (RMSE) is used. can

판별 단계(S4)에서는 산출된 손실값이 임계치보다 크면 데이터 패킷을 비정상 데이터로 판별하고, 상기 산출된 손실값이 임계치보다 작으면 데이터 패킷을 정상 데이터로 판별할 수 있다.In the determination step S4, if the calculated loss value is greater than the threshold, the data packet may be determined as abnormal data, and if the calculated loss value is less than the threshold, the data packet may be determined as normal data.

여기서, 임계치는 수신자 조작 특성 곡선(receiver operating characteristic curve, ROC curve)을 이용하여 결정되되, 상기 수신자 조작 특성 곡선에서 적중 확률(True Positive Rate)이 높고 오경보확률(False Positive Rate)이 낮은 지점이 상기 임계치로 선택될 수 있다.Here, the threshold is determined using a receiver operating characteristic curve (ROC curve), and the point in the receiver operating characteristic curve having a high true positive rate and a low false positive rate is the may be selected as a threshold.

일례로, 상기 임계치는 상기 수신자 조작 특성 곡선에서 상기 적중 확률(True Positive Rate)이 0.8이상, 상기 오경보확률(False Positive Rate)이 0.3 이하인 것으로 선택될 수 있다.For example, the threshold value may be selected such that the true positive rate is 0.8 or more and the false positive rate is 0.3 or less in the receiver manipulation characteristic curve.

이하에서는 이와 같은 본 발명의 일례에 따른 딥러닝 알고리즘을 이용한 비지도 방식의 네트워크 침입 탐지 시스템이 수행하는 각 단계에 대해 보다 구체적으로 설명한다.Hereinafter, each step performed by the unsupervised network intrusion detection system using a deep learning algorithm according to an example of the present invention will be described in more detail.

이를 위해, 먼저 데이터셋(Datasets)에 대해 설명한다. To this end, first, datasets will be described.

IDS를 포함하여 실제 세계에서 수집 된 데이터는 일반적으로 클래스간에 균형이 맞지 않는다. 침입 데이터 전송률은 모든 트래픽 데이터의 약 1 %로 알려져있다 [Song et al., 2011a, Roshan and Huang, 2014]. Data collected in the real world, including IDS, is usually not balanced between classes. The intrusion data rate is known to be about 1% of all traffic data [Song et al., 2011a, Roshan and Huang, 2014].

그러나 머신 러닝 기반의 IDS 성능을 평가하기 위해 주로 사용되는 KDD Cup 99 [2007]와 Kyoto 2006 + [Song et al., 2011a]는 비정상적인 데이터보다 훨씬 적은 정상 데이터를 포함한다. 공격 데이터 수집을 위해 특별히 설계된 환경으로 인해 KDD Cup 99의 모든 데이터에서 공격 데이터 속도는 약 74 %이며 교토 2006+ 데이터 세트의 공격 데이터 속도는 약 95 %이다. However, KDD Cup 99 [2007] and Kyoto 2006 + [Song et al., 2011a], which are mainly used to evaluate machine learning-based IDS performance, contain much less normal data than abnormal data. Due to the environment specifically designed for attack data collection, the attack data rate for all data in KDD Cup 99 is about 74%, and the attack data rate for the Kyoto 2006+ dataset is about 95%.

불균형한 데이터 세트는 다음과 같은 문제를 야기했다. 동일하지 않은 클래스 크기를 갖는 데이터 셋을 사용하여 IDS 퍼포먼스를 경험할 때, 문제에 적합한 평가 메트릭스를 정의하는 것이 절대적이다. 예를 들어 대부분의 데이터가 특정 클래스에 속할 때 정확도가 가장 좋은 측정 기준은 아니다. 따라서 이 발명에서는 평가 섹션의 모든 주요 성과 점수를 제시한다.Unbalanced data sets caused the following problems. When experiencing IDS performance using data sets with unequal class sizes, it is imperative to define an evaluation metric that is appropriate for the problem. For example, accuracy is not the best metric when most of the data belongs to a particular class. Therefore, in this invention, we present all key performance scores in the evaluation section.

KDD Cup99 데이터 세트 [2007]는 네트워크 IDS 성능을 평가하는 데 주로 사용되는 다량 세트이다. 그러나 데이터 세트는 1999 년에 수집되었으며 최신 네트워크 침입 패턴을 포함하지 않을 수 있다. 또한, 이 데이터 세트는 실제 네트워크 시스템에서 관찰된 패턴과 다른 가상 네트워크 환경에서 수집된다.The KDD Cup99 data set [2007] is a large set mainly used to evaluate network IDS performance. However, the data set was collected in 1999 and may not contain the latest network intrusion patterns. In addition, this data set is collected in a virtual network environment that is different from the patterns observed in the real network system.

Kyoto 2006+ 데이터 세트 [Song et al., 2006, Song et al., 2011a]는 2006 년 11 월에서 2015 년 12 월까지 수집 된 네트워크 트래픽 데이터를 포함한다. 이 데이터 세트는 19.683GB의 대용량 데이터 분석에도 사용할 수 있다. 따라서 이 발명에서는 교토 2006+ 데이터 세트에 접근 방식을 적용했다. The Kyoto 2006+ data set [Song et al., 2006, Song et al., 2011a] contains network traffic data collected from November 2006 to December 2015. This data set can also be used for large data analysis of 19.683 GB. Therefore, in this invention, the approach was applied to the Kyoto 2006+ data set.

Kyoto 2006+ dataset에는 허니팟 네트워크를 통해 수집 된 다양한 네트워크 위협 요소가 교토 대학의 일반 서버와 결합되어 있다. 데이터 세트는 총 24 개의 기능(features)으로 구성된다 [Song et al., 2006]. In the Kyoto 2006+ dataset, various network threats collected through the honeypot network are combined with Kyoto University's regular servers. The data set consists of a total of 24 features [Song et al., 2006].

이 중 14 개의 기능(features)은 KDD CUP99 데이터 세트의 41 가지 기능(features) 중에서 선택된다. 이 기능(features)은 Duration, Service, Source 바이트, Destination 바이트, Count, Srv rate, Serror rate, Srv Serror rate, Dst host count, Dst host srv count, Dst host와 동일한 src port rate, Dst host serror rate, Dst host srv serror rate, Flag 등이 있다.Of these, 14 features are selected from 41 features of the KDD CUP99 data set. These features include Duration, Service, Source Bytes, Destination Bytes, Count, Srv rate, Serror rate, Srv Serror rate, Dst host count, Dst host srv count, src port rate equal to Dst host, Dst host serror rate, Dst host srv serror rate, Flag, etc.

추가적으로 10 개의 기능(features)이 확인 목적으로 지정되었다. 세 가지 기능(features)(IDS 탐지, 맬웨어 탐지 및 셸 코드 탐지)은 패킷이 허니팟 네트워크의 프론트 엔드에 마운트 된 특정 IDS에 의해 탐지되었는지 여부를 나타낸다. 또한 소스 IP 주소, 소스 포트 번호, 대상 IP 주소, 대상 포트 번호, 시작 시간 및 프로토콜과 같은 정보가 들어 있다.An additional 10 features were designated for identification purposes. Three features (IDS detection, malware detection, and shellcode detection) indicate whether a packet was detected by a specific IDS mounted on the front end of the honeypot network. It also contains information such as source IP address, source port number, destination IP address, destination port number, start time, and protocol.

전체 데이터 세트는 3 개의 클래스로 구성된다. The whole data set consists of three classes.

첫째, Honeypot이라고 불리는 미끼 서버에 들어가는 모든 패킷은 공격으로 간주되어, 'known attack (-1)' 및 'shellcode-based unknown attack (-2)'으로 재정의되고, 일반 서버에 있는 모든 패킷은 'normal (1)'로 간주된다.First, every packet entering the bait server called Honeypot is considered an attack, redefined as 'known attack (-1)' and 'shellcode-based unknown attack (-2)', and all packets in the normal server are considered 'normal' (1)'.

이를 위해 제 1 전처리 단계(S1)에서는 노이즈를 필터링할 수 있다. IDS Detection, Malware Detection 및 Shell Code Detection field가 트리거 될 때, normal (1) 패킷을 가진 인스턴스(instance)를 잡음으로 간주한다. 이러한 기능은 침입, 악성 소프트웨어 또는 쉘 코드가 각각 발견되었는지 여부를 나타내기 때문에 인스턴스는 -1 (attack) label로 분류 되어야 한다. 이와 같은 불일치는 일반 서버로 들어오는 모든 패킷이 정상으로 간주된다는 가정 때문에 발생한다.To this end, noise may be filtered in the first pre-processing step S1. When the IDS Detection, Malware Detection and Shell Code Detection fields are triggered, instances with normal (1) packets are considered noise. Instances should be labeled with the -1 (attack) label, as these functions indicate whether an intrusion, malicious software, or shellcode was detected, respectively. This discrepancy arises from the assumption that all incoming packets to the normal server are considered normal.

두 번째 단계는 일부 기능을 제거한다. IDS 탐지, 맬웨어 탐지 및 셸 코드 탐지는 강력한 공격 지표이기 때문에 제거된다. 소스 IP 주소와 대상 IP 주소는 초기 데이터 수집 프로세스 중에 이러한 값이 임의의 IP로 삭제되기 때문에 제거된다. 시작 시간은 중요하지 않으므로 제거해야 한다.The second step removes some features. IDS detection, malware detection, and shellcode detection are eliminated because they are strong attack indicators. The source IP address and destination IP address are removed as these values are deleted as random IPs during the initial data collection process. Start time is not critical and should be removed.

셋째, 숫자가 아닌 데이터(Non-numerical data)를 숫자 데이터(numerical data)로 변경하여 오토인코더(autoencoder) 장치에 입력한다. 서비스, 플래그 및 프로토콜은 범주형 데이터이다. 서비스 기능에는 85 가지의 값이 있으며 Flag에는 13 가지의 값이 있다. 프로토콜은 값이 tcp, udp 또는 icmp 인 네트워크 프로토콜을 나타낸다. 숫자가 아닌 각 데이터의 수는 숫자 데이터로 변경된 다음 나중에 나중에는 원 핫 인코딩(one-hot encoding)(one-hot encoding)으로 인코딩된다. 숫자 범주 데이터의 경우 숫자의 크기가 데이터 자체의 크기를 나타내지는 않는다.Third, non-numeric data is changed to numeric data and input to an autoencoder device. Services, flags and protocols are categorical data. Service function has 85 values and Flag has 13 values. protocol indicates a network protocol with a value of tcp, udp, or icmp. Each number of non-numeric data is converted to numeric data, which is later encoded with one-hot encoding. For numeric category data, the size of the number does not indicate the size of the data itself.

따라서 이 수치가 훈련 가중치 프로세스에 영향을 미치지 않도록 일회성 인코딩을 사용하여 데이터를 인코딩한다. 프로토콜 기능은 tcp (0)에 대한 [1,0,0], udp (1)에 대한 [0,1,0], icmp (2)에 대한 [0, 0,1]과 같은 3개의 벡터로 표시된다.Therefore, we encode the data using one-time encoding so that this number does not affect the training weighting process. The protocol function is 3 vectors: [1,0,0] for tcp(0), [0,1,0] for udp(1), and [0, 0,1] for icmp(2) is displayed

이와 비슷하게, 범주 형 서비스 및 플래그 기능은 각각 85 및 13으로 변환된다. 소스 포트 번호와 대상 포트 번호는 0에서 65535 사이이므로 65536 차원의 벡터로 변환하면 데이터 크기가 너무 커진다.Similarly, the categorical service and flag functions translate to 85 and 13, respectively. The source port number and destination port number are between 0 and 65535, so converting to a 65536-dimensional vector would make the data size too large.

따라서 잘 알려진 포트 (0 ~ 1023), 등록된 포트 (1024 ~ 49151), 동적 포트 (49152 ~ 65536)로 3 개의 벡터로 표현한다.Therefore, well-known ports (0 ~ 1023), registered ports (1024 ~ 49151), and dynamic ports (49152 ~ 65536) are expressed as three vectors.

마지막 단계는 특성 값을 0과 1 사이에서 정규화될 수 있다. 숫자 범주 데이터는 원 핫 인코딩(one-hot encoding)으로 인해 0과 1로 자동 변환된다. 지속 시간, 소스 바이트, 대상 바이트 특성 값은 통계 이상 치가 상위 이상 치에 대한 임계 값으로 변환 된 후 MinMax 배율을 사용하여 정규화될 수 있다.The last step is to normalize the feature values between 0 and 1. Numeric categorical data is automatically converted to 0 and 1 due to one-hot encoding. Duration, source byte, and destination byte attribute values can be normalized using MinMax multipliers after statistical outliers are converted to thresholds for upper outliers.

Dst_host_count 및 Dst_host_srv_count rate 기능은 값을 최대 값 (예 :이 경우 100)으로 나누어 0과 1 사이의 값으로 변환될 수 있다. 평가를 위해 선택된 최종 특징은 다음의 리스트와 같을 수 있다-Duration, Service, Source bytes, Destination bytes, Count, Same srv rate, Serror rate, Srv serror rate, Dst host count, Dst host srv count, Dst host same src port rate, Dst host serror rate, Dst host srv serror rate, Flag, Source Port Number, Destination Port Number, Protocol.The Dst_host_count and Dst_host_srv_count rate functions can be converted to a value between 0 and 1 by dividing the value by the maximum value (eg 100 in this case). The final characteristic selected for evaluation may be the following list - Duration, Service, Source bytes, Destination bytes, Count, Same srv rate, Serror rate, Srv serror rate, Dst host count, Dst host srv count, Dst host same src port rate, Dst host serror rate, Dst host srv serror rate, Flag, Source Port Number, Destination Port Number, Protocol.

이러한 기능은 12 개의 일반 기능과 107 개의 one-hot encoded 카테고리 기능을 결합하여 119 가지 차원으로 표현된다. 마지막으로, 본 발명에서는 Python Pandas 라이브러리의 drop duplicates를 사용하여 정규화된 데이터 세트의 중복을 제거할 수 있다.These functions are expressed in 119 dimensions by combining 12 normal functions and 107 one-hot encoded category functions. Finally, in the present invention, duplicates of a normalized data set can be removed using drop duplicates of the Python Pandas library.

그런 다음, 본 발명은 Known Attack (-1)과 Unknown Attack (-2)을 하나의 비정상적인 클래스로 집계하고, 정상 클래스 normal (1)와 비정상 클래스 abnormal (-1)의 두 클래스로 구성할 수 있다.Then, the present invention aggregates Known Attack (-1) and Unknown Attack (-2) into one abnormal class, and can be composed of two classes: normal class normal (1) and abnormal class abnormal (-1). .

도 2는 도 1에 도시된 압축 및 복원 단계(S2)에 사용되는 오토인코더(Autoencoder)의 개념을 설명하기 위한 도이다. FIG. 2 is a diagram for explaining the concept of an autoencoder used in the compression and decompression step S2 shown in FIG. 1 .

도 2에서 흰색 원은 바이어스를 나타내고 검은 색 원은 네트워크 노드를 나타낸다.In Fig. 2, white circles indicate bias and black circles indicate network nodes.

오토인코더(autoencoder)는 압축 벡터를 원본 벡터 크기로 재구성하는 입력 및 디코딩 네트워크를 압축하는 인코딩 네트워크로 구성된 딥러닝 알고리즘(deep learning algorithm)일 수 있다. The autoencoder may be a deep learning algorithm consisting of an input that reconstructs a compression vector to the original vector size and an encoding network that compresses a decoding network.

보다 구체적으로, 도 2에 도시된 바와 같이, 119 벡터로 표시된 네트워크 로그가 훈련된 오토인코더(autoencoder) 장치에 주어지면 출력은 압축 및 재구성을 통해 생성될 수 있다. More specifically, as shown in FIG. 2 , when a network log represented by 119 vectors is given to a trained autoencoder device, an output may be generated through compression and reconstruction.

이와 같은 상기 압축 및 복원 단계(S2)에서, 상기 오토인코더(Autoencoder) 방식은 상기 원본 벡터를 상기 원본 벡터보다 낮은 차원으로 압축하는 인코더(Encoder) 단계와 상기 원본 벡터와 동일한 차원으로 복구하여, 상기 본원 벡터를 생성하는 디코더(Decoder) 단계를 포함할 수 있다.In the compression and decompression step (S2), the autoencoder method restores the original vector to the same dimension as the encoder step of compressing the original vector to a lower dimension than the original vector and the original vector, It may include a decoder (Decoder) step for generating the vector of the present application.

인코더(Encoder) 단계에서는 도 2에 도시된 바와 같이, 인코더 네트워크가 119 차원의 원본 벡터를 512 차원 값으로 변환 한 다음 256, 128 및 64 차원으로 압축하여, 상기 원본 벡터보다 낮은 차원으로 압축할 수 있다. 디코더(Decoder) 단계에서는 디코딩 네트워크가 64 차원을 128, 256, 512 및 119 차원으로 복원하여, 원본 벡터와 동일한 차원이 되도록 할 수 있다.In the encoder step, as shown in FIG. 2, the encoder network converts a 119-dimensional original vector into a 512-dimensional value, and then compresses it into 256, 128, and 64 dimensions, so that it can be compressed to a lower dimension than the original vector. there is. In the decoder (Decoder) step, the decoding network can restore the 64 dimensions to 128, 256, 512, and 119 dimensions to have the same dimension as the original vector.

이후, 손실값 산출 단계(S3)에서 상기 원본 벡터와 상기 복원 벡터의 차이값을 계산하기 위하여, 교차 엔트로피 오차(cross-entropy error, CEE) 또는 평균 제곱 오차(Root Mean Square Error, RMSE)를 사용하여, 손실값을 계산할 수 있다.Then, in order to calculate the difference between the original vector and the restored vector in the loss value calculation step S3, a cross-entropy error (CEE) or a root mean square error (RMSE) is used. Thus, the loss value can be calculated.

이를 위한 평가의 일례로, 평가를 위해 총 두 개의 데이터 세트를 만들었다.As an example of the evaluation for this purpose, a total of two data sets were created for evaluation.

첫째, 훈련 세트는 정상 데이터 세트에서 무작위로 언더 샘플링되었다. First, the training set was randomly undersampled from the normal data set.

둘째, 클래스 불균형을 피하기 위해 동일한 비율의 정상 데이터와 비정상 데이터로 구성된 데이터 세트에서 테스트 세트를 임의로 추출했다. Second, to avoid class imbalance, the test set was randomly drawn from a data set consisting of equal proportions of normal and abnormal data.

아래의 표 1은 각 데이터 세트의 로그 수를 나타낸다. 참고로, 표 1에서는 트레이닝 세트에 비정상적인 데이터가 없음을 유의한다.Table 1 below shows the number of logs in each data set. For reference, in Table 1, note that there is no abnormal data in the training set.

표 1에서는 각 평가 데이터 세트에 대한 정상 및 비정상 클래스의 로그 수로 표시하였다.In Table 1, the number of logarithms of normal and abnormal classes for each evaluation data set is expressed.

DatasetDataset Normal classNormal class Abnormal classAbnormal class TotalTotal Training settraining set 5028450284 00 5028450284 Test settest set 9575295752 9575295752 191504191504

표 2는 트레이닝 구성을 표시한 것이다. 평가를 위해 가중치 분포에 따라 가중치를 초기화하고 바이어스 가중치를 0으로 초기화했다. RMSPROP, SGD 및 ADAM을 최적화 알고리즘으로 사용하고 평균 제곱 오차(Root Mean Square Error, RMSE) 및 교차 엔트로피 오차(cross-entropy error, CEE)가 손실 함수로 사용되었다. 에포크는 6으로 설정되었으며 일괄 처리 크기는 각각 32와 64에서 테스트되었다. 트레이닝에 사용 된 구성이 표 2에 나와 있다.Table 2 shows the training configuration. For evaluation, the weights were initialized according to the weight distribution and the bias weights were initialized to zero. RMSPROP, SGD, and ADAM were used as optimization algorithms, and Root Mean Square Error (RMSE) and cross-entropy error (CEE) were used as loss functions. Epoch was set to 6 and batch sizes were tested at 32 and 64 respectively. The configurations used for training are shown in Table 2.

ParameterParameter ValueValue Optimization algorithmOptimization algorithm Rmsprop, SGD, AdamRmsprop, SGD, Adam Loss functionLoss function MSE, Cross entropy errorMSE, cross entropy error EpochsEpochs 66 Batch SizeBatch Size 32, 6432, 64

이후, 다양한 매개 변수로 사용하여 수신자 조작 특성 곡선(receiver operating characteristic curve, ROC curve)을 그리고, 이후, 정상 및 비정상 (즉, 침입) 클래스를 가장 잘 분류하는 손실 함수가 임계 값으로서 선택될 수 있다.A receiver operating characteristic curve (ROC curve) is then drawn using the various parameters, and then the loss function that best classifies normal and abnormal (i.e. intrusive) classes can be selected as a threshold value. .

도 3은 SGD 알고리즘의 ROC 곡선의 일례를 도시한 것으로, (a) 및 (b)는 MSE 기능에 대한 배치 크기 32 및 64의 결과이고, (c) 및 (d)는 교차 엔트로피 기능이다. 3 shows an example of the ROC curve of the SGD algorithm, where (a) and (b) are the results of batch sizes 32 and 64 for the MSE function, and (c) and (d) are the cross-entropy functions.

도 3의 (c) 및 (d)에 도시된 바와 같이, 교차 엔트로피는 손실 기능에 대해 도 3의 (a) 및 (b)에 도시된 MSE보다 우수한 성능을 가지며 배치 크기의 차이는 MSE에서만 중요하다. 손실 함수와 배치 크기에 따라 약간의 차이가 있지만 긍정적 인 비율이 90 %에 가까울수록 오경보확률(False Positive Rate)이 높아지므로 전반적인 성능이 좋은 것을 확인할 수 있다.As shown in (c) and (d) of Fig. 3, cross entropy has better performance than the MSE shown in Fig. 3 (a) and (b) for the loss function, and the difference in batch size is important only in MSE. do. Although there are some differences depending on the loss function and batch size, the closer the positive rate is to 90%, the higher the false positive rate, so it can be seen that the overall performance is good.

도 4는 RMSPROP 알고리즘의 ROC 커브의 일례를 도시한 것이다.4 shows an example of the ROC curve of the RMSPROP algorithm.

크로스 엔트로피는 적용 할 수 없으며 MSE 만 평가되었다. 배치 크기에 따라 약간의 차이가 있지만 성능이 좋지 않음을 확인할 수 있다.Cross entropy is not applicable, only MSE was evaluated. There is a slight difference depending on the batch size, but you can see that the performance is not good.

도 5는 ADAM 알고리즘의 ROC 커브의 일례를 도시한 것이다.5 shows an example of the ROC curve of the ADAM algorithm.

도 5의 (a) 및 (b)는 MSE 기능에 대한 배치 크기 32 및 64의 결과이고, 도 5의 (c) 및 (d)는 교차 엔트로피 기능이다. MSE를 손실 함수로 사용하는 것이 더 좋지만, 지금까지 사용 된 세 가지 알고리즘 중 최악의 성능임을 확인할 수 있다.Figures 5 (a) and (b) are the results of batch sizes 32 and 64 for the MSE function, and Figure 5 (c) and (d) are the cross-entropy functions. It is better to use MSE as the loss function, but we can confirm that it has the worst performance of the three algorithms used so far.

학습된 오토인코더(autoencoder) 모델에 의해 재구성 된 입력 데이터와 출력 데이터 간의 교차 엔트로피 오차(cross-entropy error, CEE) 및 평균 제곱 오차(Root Mean Square Error, RMSE)를 계산했다.Cross-entropy error (CEE) and root mean square error (RMSE) between input and output data reconstructed by a trained autoencoder model were calculated.

입력 로그를 정상 또는 비정상으로 분류하기 위해 교차 엔트로피 오차(cross-entropy error, CEE) 및 정상 및 비정상 클래스를 가장 잘 분류하는 평균 제곱 오차(Root Mean Square Error, RMSE)를 나타내는 임계값을 결정했다. 먼저, 수신자 조작 특성(ROC) 곡선을 다양한 매개 변수로 사용하여 좋은 분류 매개 변수로 사용할 수 있는지 여부를 확인한다. 최고의 성능, 교차 엔트로피를 보여주는 도 3의 곡선은 평균 제곱 오차(Root Mean Square Error, RMSE)보다 우수한 성능을 보여줌을 확인할 수 있다.To classify the input log as normal or abnormal, we determined a threshold representing the cross-entropy error (CEE) and the root mean square error (RMSE) that best classifies the normal and nonstationary classes. First, we use the receiver operating characteristic (ROC) curve as various parameters to check whether it can be used as a good classification parameter. It can be seen that the curve of FIG. 3 showing the best performance and cross entropy shows better performance than the root mean square error (RMSE).

그러나 어떤 최적화 알고리즘을 사용하느냐에 따라 성능을 높이기 위해 손실 함수를 특징(features)으로 사용해야 한다.However, depending on which optimization algorithm is used, loss functions should be used as features to improve performance.

판별 단계(S4)에서는 위와 같은 수신자 조작 특성(ROC) 커브를 통해, 산출된 손실값이 임계치보다 크면 데이터 패킷을 비정상 데이터로 판별하고, 상기 산출된 손실값이 임계치보다 작으면 데이터 패킷을 정상 데이터로 판별할 수 있다.In the determination step S4, through the above receiver operation characteristic (ROC) curve, if the calculated loss value is greater than the threshold, the data packet is determined as abnormal data, and if the calculated loss value is less than the threshold, the data packet is converted to normal data. can be identified as

여기서, 임계치는 상기 수신자 조작 특성(ROC) 곡선에서 적중 확률(True Positive Rate)이 높고 오경보확률(False Positive Rate)이 낮은 지점이 상기 임계치로 선택될 수 있다.Here, as the threshold value, a point having a high true positive rate and a low false positive rate on the receiver operation characteristic (ROC) curve may be selected as the threshold value.

아래의 표 3-5는 0.5, 0.95 및 F1 측정치가 최대(굵은 글씨)에 도달했을 때의 비정상적인 등급 (즉, 네트워크 침입)을 예측하는 리콜, 정밀도, 특이성 (참 부정적인 비율), 위양성(false positive) 및 F- 측정치를 보여준다.Tables 3-5 below show the recall, precision, specificity (true negative rate), and false positives predicting anomalous grade (i.e. network intrusion) when 0.5, 0.95, and F1 measurements reach their maximum (bold). ) and F-measurements.

RecallRecall PrecisionPrecision SpecificitySpecificity FalseFalse
PositivePositive ThresholdThreshold FF _1One 0.50 0.50 1.001.00 1.001.00 0.000.00 11.7911.79 0.670.67 0.55 0.55 1.001.00 1.001.00 0.000.00 11.2611.26 0.710.71 0.60 0.60 1.001.00 1.001.00 0.000.00 10.6610.66 0.750.75 0.65 0.65 1.001.00 0.990.99 0.010.01 10.2010.20 0.790.79 0.70 0.70 1.001.00 0.990.99 0.010.01 10.0810.08 0.820.82 0.75 0.75 1.001.00 0.990.99 0.010.01 9.569.56 0.860.86 0.800.80 1.001.00 0.890.89 0.010.01 9.039.03 0.890.89 0.850.85 1.001.00 0.980.98 0.020.02 8.298.29 0.920.92 0.900.90 0.990.99 0.880.88 0.120.12 4.454.45 0.940.94 0.950.95 0.980.98 0.620.62 0.380.38 2.222.22 0.970.97

표 3은 SGD에 대한 평가 결과이다.Table 3 shows the evaluation results for SGD.

RecallRecall PrecisionPrecision SpecificitySpecificity FalseFalse
PositivePositive ThresholdThreshold FF _1One 0.50 0.50 1.001.00 1.001.00 0.000.00 8.948.94 0.670.67 0.55 0.55 1.001.00 1.001.00 0.000.00 8.008.00 0.710.71 0.60 0.60 1.001.00 0.990.99 0.010.01 6.776.77 0.750.75 0.65 0.65 1.001.00 0.990.99 0.010.01 5.365.36 0.790.79 0.70 0.70 1.001.00 0.990.99 0.010.01 4.614.61 0.820.82 0.75 0.75 1.001.00 0.980.98 0.020.02 3.883.88 0.860.86 0.800.80 1.001.00 0.920.92 0.080.08 3.083.08 0.890.89 0.850.85 0.990.99 0.860.86 0.140.14 2.452.45 0.920.92 0.900.90 0.980.98 0.700.70 0.300.30 1.681.68 0.940.94 0.950.95 0.970.97 0.370.37 0.630.63 1.201.20 0.960.96

표 4는 RMSPROP 에 대한 평가 결과이다.Table 4 shows the evaluation results for RMSPROP.

RecallRecall PrecisionPrecision SpecificitySpecificity FalseFalse
PositivePositive ThresholdThreshold FF _1One 0.50 0.50 1.001.00 1.001.00 0.000.00 7.347.34 0.670.67 0.55 0.55 1.001.00 1.001.00 0.000.00 6.086.08 0.710.71 0.60 0.60 1.001.00 0.990.99 0.010.01 5.205.20 0.750.75 0.65 0.65 1.001.00 0.980.98 0.020.02 4.094.09 0.790.79 0.70 0.70 1.001.00 0.970.97 0.030.03 3.533.53 0.830.83 0.75 0.75 1.001.00 0.920.92 0.080.08 3.093.09 0.860.86 0.800.80 0.990.99 0.880.88 0.120.12 2.492.49 0.890.89 0.850.85 0.990.99 0.810.81 0.190.19 2.222.22 0.920.92 0.900.90 0.990.99 0.740.74 0.230.23 1.911.91 0.940.94 0.950.95 0.990.99 0.440.44 0.560.56 1.391.39 0.960.96

표 5는 ADAM 에 대한 평가 결과이다. 표 3-5는 각 최적화 알고리즘에서 ROC 곡선 그래프 modification을 위한 최상의 매개 변수 값을 보여준다. 회수 값(recall value)이 0.95 인 경우 F1 측정 값은 0.97이며 SGD 알고리즘을 사용할 때 가장 높음을 확인할 수 있다. F1 측정 값은 다른 두 알고리즘보다 작은 숫자이지만 오경보확률(False Positive Rate)을 사용하면 SGD 알고리즘이 다른 알고리즘보다 훨씬 큰 차이를 보임을 알 수 있다.Table 5 shows the evaluation results for ADAM. Table 3-5 shows the best parameter values for ROC curve graph modification in each optimization algorithm. When the recall value is 0.95, it can be seen that the F1 measurement value is 0.97, which is the highest when using the SGD algorithm. Although the F1 measurement value is a smaller number than the other two algorithms, it can be seen that the SGD algorithm has a much larger difference than the other algorithms by using the false positive rate.

이와 같은 본 발명은 네트워크 침입을 탐지하기 위한 비지도 딥 러닝 접근법을 제안한다. 오토인코더(autoencoder) 네트워크는 입력을 압축하여 주어진 교육 세트에서 학습 한대로 원본 크기로 재구성한다. 본 발명은 네트워크가 정상적인 데이터로 훈련되면 공격이 입력 값으로 복원 될 가능성이 낮다고 가정한다. 100 % 정상 데이터로 구성된 훈련 세트로 실험 한 결과, 본 발명의 접근법이 IDS (F- 점수 0.97)에 효과적이라는 결과임을 확인할 수 있다.As such, the present invention proposes an unsupervised deep learning approach for detecting network intrusions. An autoencoder network compresses the input and reconstructs it to its original size as learned from a given training set. The present invention assumes that if the network is trained with normal data, the attack is unlikely to be restored to the input value. Experiments with the training set consisting of 100% normal data confirm that the approach of the present invention is effective for IDS (F-score 0.97).

본 발명에 따라 제안된 접근 방식은 몇 가지 장점을 제공한다. 첫째, 본 발명을 구현하기 위한 학습 방법에서는 정상적인 데이터만을 사용하기 때문에 훈련 데이터를 라벨링하는 데 많은 노력이 필요하지 않는다. 둘째, 침입 패턴이 아직 발견되지 않은 제로 데이 공격 문제를 해결할 수 있다.The approach proposed according to the present invention offers several advantages. First, since only normal data is used in the learning method for implementing the present invention, much effort is not required to label the training data. Second, it can solve the problem of zero-day attacks where intrusion patterns are not yet discovered.

상술한 바에 따른 특징, 구조, 효과 등은 본 발명의 적어도 하나의 실시예에 포함되며, 반드시 하나의 실시예에만 한정되는 것은 아니다. 나아가, 각 실시예에서 예시된 특징, 구조, 효과 등은 실시예들이 속하는 분야의 통상의 지식을 가지는 자에 의하여 다른 실시예들에 대해서도 조합 또는 변형되어 실시 가능하다. 따라서 이러한 조합과 변형에 관계된 내용들은 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다.The features, structures, effects, etc. as described above are included in at least one embodiment of the present invention, and are not necessarily limited to one embodiment. Furthermore, features, structures, effects, etc. illustrated in each embodiment can be combined or modified for other embodiments by those of ordinary skill in the art to which the embodiments belong. Accordingly, the contents related to such combinations and modifications should be interpreted as being included in the scope of the present invention.

Claims

In an unsupervised network intrusion detection method using a deep learning algorithm in which each step is performed by a computing system,
A pre-processing step of pre-processing a data packet input from the outside in a one-hot encoding method to generate an original vector expressed as a vector value;
a compression and restoration step of compressing the original vector to a lower dimension using an autoencoder method and then restoring the original dimension to generate a restored vector expressed as a vector value;
a loss value calculation step of calculating a loss value by calculating a difference value between the original vector and the restored vector; and
a determining step of determining the data packet as abnormal data if the calculated loss value is greater than a threshold value, and determining the data packet as normal data if the calculated loss value is less than the threshold value;
The threshold value is determined using a receiver operating characteristic curve (ROC curve), and in the receiver operating characteristic curve, the true positive rate is 0.8 or more and the false positive rate is 0.3 or less. An unsupervised network intrusion detection method using a selected deep learning algorithm.

delete

According to claim 1,
In the pretreatment step,
The data packet input from the outside includes natural language or characters that are non-numeric data, and before the data packet is pre-processed by the one-hot encoding method, the non-numeric data is numeric data (numerical data). data), then preprocessed by the one-hot encoding method, and data packets in the numeric data state are coded as 0 and 1 and displayed as a vector value. An unsupervised method of network intrusion detection using a deep learning algorithm.

According to claim 1,
In the compression and decompression step,
The autoencoder method is
An encoder step of compressing the original vector to a lower dimension than the original vector;
An unsupervised network intrusion detection method using a deep learning algorithm, comprising a decoder step of restoring the original vector to the same dimension as the original vector and generating the restored vector.

According to claim 1,
In the step of calculating the loss value,
Busy using a deep learning algorithm using at least one of a cross-entropy error (CEE) and a root mean square error (RMSE) in order to calculate the difference between the original vector and the restored vector A method of detecting network intrusion in a diagram manner.

A computer-readable recording medium in which a computer program for executing the unsupervised network intrusion detection method using the deep learning algorithm according to any one of claims 1 to 5 by a computer is recorded.