KR20230086976A

KR20230086976A - Improved network intrusion detection method and system through hybrid feature selection and data balancing

Info

Publication number: KR20230086976A
Application number: KR1020210175510A
Authority: KR
Inventors: 박정찬; 신동일; 신동규; 김진국; 민병준
Original assignee: 국방과학연구소; 세종대학교산학협력단
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2023-06-16

Abstract

According to the present invention, a network intrusion detection method relates to a network intrusion detection method performed by a network intrusion detection system, which includes the steps of: constructing a deep neural network-based network intrusion detection model through feature selection and data balancing; receiving an input of test data for network intrusion detection into the configured network intrusion detection model; and detecting network intrusion on the test data for detecting the network intrusion using the network intrusion detection model.

Description

Efficient network detection method and system through hybrid feature selection and data balancing

본 발명은 네트워크 침입 탐지 기술에 관한 것으로, 기계학습 기반의 네트워크 탐지 모델을 이용하여 네트워크 침입 탐지를 수행할 수 있는 하이브리드 특징 선택과 데이터 균형을 통한 효율적인 네트워크 탐지 방법 및 시스템에 관한 것이다.The present invention relates to a network intrusion detection technology, and relates to an efficient network detection method and system through hybrid feature selection and data balancing capable of performing network intrusion detection using a machine learning-based network detection model.

네트워크 침입 탐지 시스템(NIDS; Network-Based Intrusion Detection)은 허가되지 않은 사용자의 침입을 제한하는 시스템으로, 트래픽을 감시하여 공격 여부를 판단한다.Network-Based Intrusion Detection (NIDS) is a system that restricts intrusion by unauthorized users, and monitors traffic to determine whether or not there is an attack.

현재 많이 사용되고 있는 네트워크 침입 탐지 시스템은 시그니처(Signature) 기반 분석 방법을 사용하고 있다. 이는 공격에 대한 특징을 전문가가 분석하여 패턴화시킨 뒤, 실시간으로 들어오는 네트워크 패킷들과 매칭하여 탐지하는 방법이다.A network intrusion detection system that is currently widely used uses a signature-based analysis method. This is a method in which an expert analyzes and patterns the characteristics of an attack, and then matches and detects incoming network packets in real time.

그러나, 최근 APT(Advance Persistent Threat) 공격과 같이 빠르게 변화하는 공격이 빈번히 발생하고 있으며, 이에 따라 매번 발생하는 새로운 공격에 대한 트래픽과 로그 분석 과정에서 비용적 문제와 시간적 문제가 발생한다.However, recently, rapidly changing attacks such as APT (Advance Persistent Threat) attacks frequently occur, and accordingly, problems of cost and time arise in the process of analyzing traffic and logs for new attacks that occur every time.

시그니처 패턴은 빠르게 변화하는 공격의 비슷한 유형에 대하여 일반화된 성능을 보장하지 못하기 때문에, 기존의 시그니처 기반의 시스템의 한계점이 명확해지고 있다. 최근 이러한 문제를 해결하기 위해 기계학습 기반의 탐지 시스템의 연구가 활발하다.Since signature patterns do not guarantee generalized performance against similar types of rapidly changing attacks, the limitations of existing signature-based systems are becoming clear. Recently, research on machine learning-based detection systems is active to solve these problems.

기계학습 모델은 데이터로부터 침입에 대한 판단을 내리기 위한 규칙을 모델이 학습하기에, 이를 통해 자동화된 침입탐지 시스템을 구축할 수 있다면, 앞서 언급한 시간 및 비용적 문제를 해결할 수 있다.Since the machine learning model learns the rules for making judgments about intrusions from data, if an automated intrusion detection system can be built through this, the aforementioned problems of time and cost can be solved.

또한, 새로운 공격 패턴들에 대해 일반화된 성능을 보장할 수 있다. 하지만 입력으로 사용할 수 있는 모든 속성들을 사용하는 것은 기계학습 모델의 성능의 저하와 학습 시간의 낭비를 가져올 수 있다.In addition, generalized performance can be guaranteed for new attack patterns. However, using all available attributes as inputs can lead to degradation of machine learning model performance and waste of training time.

따라서, 실시간 탐지 위해 사용 가능한 많은 속성 중에서 학습과 관련 있는 특징들을 선별하는 것도 중요하게 다뤄지고 있다.Therefore, it is important to select features related to learning among many attributes available for real-time detection.

이외에도 현실 세계에서 수집되는 많은 데이터들은 클래스 간 균형이 완벽하지 않은 환경이 대부분으로, 특히 침입 탐지 문제에서는 전체 데이터 중 침입 데이터의 비율이 약 1%로 알려져 있다.In addition, most of the data collected in the real world is in an environment where the balance between classes is not perfect. In particular, in the problem of intrusion detection, intrusion data is known to account for about 1% of the total data.

기계학습에서 이러한 소량의 침입 데이터로 정상 학습을 하는 것은 매우 어려우며, 이를 극복하기 위한 방법으로 오버 샘플링(Over Sampling)과 언더 샘플링(Under Sampling) 기법을 활용할 수 있다.In machine learning, it is very difficult to perform normal learning with such a small amount of intrusive data, and oversampling and undersampling techniques can be used as a way to overcome this.

한국등록특허 제10-2083028호(공고일: 2020. 02. 28.)Korean Patent Registration No. 10-2083028 (Publication date: 2020. 02. 28.) 한국공개특허 제10-2019-0083458호(공개일: 2019. 07. 12.)Korean Patent Publication No. 10-2019-0083458 (Publication date: 2019. 07. 12.)

본 발명은, 기계학습 모델에 불필요한 속성들과 중첩 속성들을 제거하기 위해서 하이브리드 특징 선택(Hybrid Feature Selection; HFS) 기법을 제안하며, 하이브리드 특징 선택 기법을 심층 신경망에 이용한 HFS-DNN(Deep Neural Network) 모델을 제안하고자 한다.The present invention proposes a hybrid feature selection (HFS) technique to remove unnecessary and overlapping attributes from a machine learning model, and HFS-DNN (Deep Neural Network) using the hybrid feature selection technique for a deep neural network I would like to propose a model.

본 발명은, 하이브리드 특징 선택 기법을 통해 적은 양의 입력만을 사용해 학습 효과를 보장하고, 학습에 사용된 NSL-KDD 데이터 셋의 불균형 문제로 인한 소수 클래스(Minor Class)들의 저조한 탐지율을 개선하고자 SMOTE(Synthetic Minority Over sampling Technique) 기법과 RUS(Random Under Sampling) 기법들을 활용하여 불균형 문제를 해결하고자 한다.The present invention uses SMOTE (SMOTE) to ensure a learning effect using only a small amount of input through a hybrid feature selection technique and to improve the poor detection rate of minor classes due to the imbalance problem of the NSL-KDD data set used for learning. Synthetic Minority Over sampling Technique) and RUS (Random Under Sampling) techniques are used to solve the imbalance problem.

본 발명은, 일 관점에 따라, 네트워크 침입 탐지 시스템에 의해 수행되는 네트워크 침입 탐지 방법에 있어서, 특징 선택과 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델을 구성하는 단계; 상기 구성된 네트워크 침입 탐지 모델에 네트워크 침입 탐지를 위한 테스트 데이터를 입력받는 단계; 및 상기 네트워크 침입 탐지 모델을 이용하여 상기 네트워크 침입 탐지를 위한 상기 테스트 데이터에 대한 네트워크 침입을 탐지하는 단계를 포함하는 네트워크 침입 탐지 방법을 제공할 수 있다.According to one aspect, the present invention provides a network intrusion detection method performed by a network intrusion detection system, comprising: configuring a deep neural network-based network intrusion detection model through feature selection and data balance; receiving test data for network intrusion detection into the configured network intrusion detection model; and detecting network intrusion for the test data for network intrusion detection using the network intrusion detection model.

본 발명의 상기 구성하는 단계는, 상기 특징 선택과 상기 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델에 훈련 데이터를 입력받고, 상기 입력받은 훈련 데이터에 대한 특징 선택과 데이터 균형을 통해 불필요한 속성들을 제거하도록 상기 특징 선택과 상기 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델을 학습하는 단계를 포함할 수 있다.In the configuring step of the present invention, training data is input to a network intrusion detection model based on a deep neural network through feature selection and data balancing, and unnecessary attributes are removed through feature selection and data balancing of the input training data. and learning a network intrusion detection model based on a deep neural network through the feature selection and the data balance to eliminate.

본 발명의 상기 입력받는 단계는, 상기 테스트 데이터가 지역 최적점(Local Optimum)에 제외되는 것을 방지하기 위하여 정규화(Normalization) 과정을 통해 속성값들을 0 과 1사이의 값으로 변경하는 단계를 포함할 수 있다.The receiving step of the present invention may include changing attribute values to values between 0 and 1 through a normalization process to prevent the test data from being excluded from the local optimum. can

본 발명의 상기 정규화 과정은, 명목형(nominal), 숫자형(numeric), 바이너리(binary)를 포함하는 데이터의 형식(type)에 따라 다르게 진행되며, 상기 명목형의 데이터 형식은, 범주형 문자 데이터들에 대하여 정수형 데이터로 인코딩하고, 상기 인코딩된 정수형 데이터를 원-핫(one-hot) 벡터로 변환하고, 상기 숫자형의 데이터 형식은, 숫자형 데이터들에 대하여 속성 값들의 범위의 차이를 왜곡하지 않고 공통 스케일로 변경하기 위해 최소 최대 정규화(Min-max Normalization)를 진행하고, 상기 바이너리의 데이터 형식은, 바이너리 데이터들에 대해서는 정규화를 수행하지 않는다.The normalization process of the present invention proceeds differently depending on the type of data including nominal, numeric, and binary, and the nominal data type is a categorical character Data is encoded into integer data, the encoded integer data is converted into a one-hot vector, and the numeric data format determines the difference between the range of attribute values for the numeric data. Min-max normalization is performed to change to a common scale without distortion, and normalization is not performed on binary data in the binary data format.

본 발명의 상기 탐지하는 단계는, 단일 특징 선택 알고리즘들의 출력 결과인 각 하위 속성 집합들의 교집합을 사용하는 하이브리드 선택 기법(Hybrid Feature Selection; HFS)을 이용하여 중첩 특징(Irrelevant Feature) 및 학습에 무관한 특징(Redundant Feature)을 제거하는 단계를 포함할 수 있다.The detecting step of the present invention uses a Hybrid Feature Selection (HFS) method that uses the intersection of each sub-property set, which is the output result of single feature selection algorithms, to determine irrelevant features and learning. A step of removing redundant features may be included.

본 발명의 상기 탐지하는 단계는, 피어슨 기반 특징 선택(Pearson Based Feature Selection) 방법을 이용하여 학습에 무관한 특징을 제거하고, 특징 중요도 기반 특징 선택(Feature Importance Based Feature Selection) 방법 및 속성 비율 기반 특징 선택(Attribute Ratio Based Feature Selection) 방법을 이용하여 중첩 특징을 제거하는 단계를 포함할 수 있다.In the detecting step of the present invention, features irrelevant to learning are removed using the Pearson Based Feature Selection method, and features based on feature importance based feature selection method and attribute ratio. A step of removing overlapping features using an Attribute Ratio Based Feature Selection method may be included.

본 발명의 상기 탐지하는 단계는, 기 설정된 기준 이상의 피어슨 상관계수를 가지는 두 가지의 특징을 중첩관계로 간주하고, 상기 중첩관계로 간주된 두 가지의 특징 중 하나의 특징을 선택하는 단계를 포함하고, 상기 피어슨 상관계수는, 수학식으로 정의되며, -1과 1사이의 값으로 나타나며,The detecting step of the present invention includes the step of considering two features having a Pearson's correlation coefficient equal to or higher than a predetermined criterion as an overlapping relationship, and selecting one of the two features considered as an overlapping relationship, , The Pearson correlation coefficient is defined by the equation and appears as a value between -1 and 1,

[수학식][mathematical expression]

상기 수학식에서 cov는 공분산, 는 모집단 X의 표준편차, 는 모집단 Y의 표준 편차를 나타내고, 1에 가까울수록 상기 두 가지의 특징이 양의 상관관계에 있고, -1에 가까울수록 상기 두 가지의 특징이 음의 상관관계에 가까움을 의미한다.In the above equation, cov is the covariance, is the standard deviation of population X, and is the standard deviation of population Y. The closer to 1, the positive correlation between the two features, and the closer to -1, the more the two features This means that it is close to a negative correlation.

본 발명의 상기 탐지하는 단계는, 의사결정 트리 모델을 학습시킨 뒤, 정보 획득량(Information Gain)에 기반하여 학습에 사용된 각 특징들의 중요도를 파악하는 단계를 포함하고, 상기 의사결정 트리는, 상기 정보 획득량을 최대화하는 특징을 기준으로 노드를 우선적으로 분할할 수 있다.The detecting step of the present invention includes learning a decision tree model and then determining the importance of each feature used for learning based on an information gain, and the decision tree Nodes can be preferentially divided based on characteristics that maximize the amount of information acquisition.

본 발명의 상기 탐지하는 단계는, 랜덤 포레스트 학습 모델을 통해 특징 중요도를 추출하고, 상기 추출된 특징 중요도를 기 설정된 기준으로 정렬하여 임계값 이상의 특징을 선출하는 단계를 포함할 수 있다.The detecting step of the present invention may include extracting feature importance through a random forest learning model, aligning the extracted feature importance with a preset criterion, and selecting a feature having a threshold value or higher.

본 발명의 상기 탐지하는 단계는, 특징의 빈도수와 평균값을 통해 계산된 특징 중요도가 임계값 이상인 특징을 선출하는 단계를 포함할 수 있다.The detecting step of the present invention may include selecting a feature whose feature importance calculated through the frequency count and average value of the feature is greater than or equal to a threshold value.

본 발명의 상기 탐지하는 단계는, 오버 샘플링(over sampling) 기법과 언더 샘플링 기법(under sampling)을 통해 상기 입력받은 테스트 데이터의 절반에 해당하는 다수 클래스에 대하여 RUS(Random Under Sampling) 기법을 통해 데이터의 샘플 수를 축소하고, 소수 클래스에 해당하는 Probe, U2R, R2L에 대하여 SMOTE(Synthetic Minority Over Sampling Technique) 기법을 통해 샘플 수를 증가시키는 단계를 포함할 수 있다.In the detecting step of the present invention, data are detected through a Random Under Sampling (RUS) technique for a plurality of classes corresponding to half of the input test data through an over sampling technique and an under sampling technique. It may include reducing the number of samples of and increasing the number of samples through a synthetic minority over sampling technique (SMOTE) technique for probe, U2R, and R2L corresponding to the minority class.

본 발명은, 다른 관점에 따라, 컴퓨터 프로그램을 저장하고 있는 컴퓨터 판독 가능 기록매체로서, 상기 컴퓨터 프로그램은, 프로세서에 의해 실행되면, 특징 선택과 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델을 구성하는 단계; 상기 구성된 네트워크 침입 탐지 모델에 네트워크 침입 탐지를 위한 테스트 데이터를 입력받는 단계; 및 상기 네트워크 침입 탐지 모델을 이용하여 상기 네트워크 침입 탐지를 위한 상기 테스트 데이터에 대한 네트워크 침입을 탐지하는 단계를 포함하는 컴퓨터 판독 가능한 기록매체를 제공할 수 있다.According to another aspect, the present invention is a computer readable recording medium storing a computer program, which, when executed by a processor, constructs a network intrusion detection model based on a deep neural network through feature selection and data balance. doing; receiving test data for network intrusion detection into the configured network intrusion detection model; and detecting a network intrusion for the test data for detecting the network intrusion using the network intrusion detection model.

본 발명은, 또 다른 관점에 따라, 컴퓨터 판독 가능한 기록매체에 저장되어 있는 컴퓨터 프로그램으로서, 상기 컴퓨터 프로그램은, 프로세서에 의해 실행되면, 특징 선택과 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델을 구성하는 단계; 상기 구성된 네트워크 침입 탐지 모델에 네트워크 침입 탐지를 위한 테스트 데이터를 입력받는 단계; 및 상기 네트워크 침입 탐지 모델을 이용하여 상기 네트워크 침입 탐지를 위한 상기 테스트 데이터에 대한 네트워크 침입을 탐지하는 단계를 포함하는 동작을 상기 프로세서가 수행하도록 하기 위한 명령어를 포함하는 컴퓨터 프로그램을 제공할 수 있다.The present invention, according to another aspect, is a computer program stored in a computer readable recording medium, wherein the computer program, when executed by a processor, generates a network intrusion detection model based on a deep neural network through feature selection and data balance. constructing; receiving test data for network intrusion detection into the configured network intrusion detection model; and detecting a network intrusion on the test data for the network intrusion detection using the network intrusion detection model.

본 발명은, 또 다른 관점에 따라, 네트워크 침입 탐지 시스템에 있어서, 특징 선택과 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델을 구성하는 모델 구성부; 상기 구성된 네트워크 침입 탐지 모델에 네트워크 침입 탐지를 위한 테스트 데이터를 입력받는 데이터 입력부; 상기 네트워크 침입 탐지 모델을 이용하여 상기 네트워크 침입 탐지를 위한 상기 테스트 데이터에 대한 네트워크 침입을 탐지하는 침입 탐지부를 포함하는 네트워크 침입 탐지 시스템을 제공할 수 있다.According to another aspect, the present invention provides a network intrusion detection system, comprising: a model configuration unit configuring a network intrusion detection model based on a deep neural network through feature selection and data balance; a data input unit that receives test data for network intrusion detection into the configured network intrusion detection model; It is possible to provide a network intrusion detection system including an intrusion detection unit that detects network intrusion with respect to the test data for network intrusion detection using the network intrusion detection model.

본 발명의 실시예에 따르면, 본 발명의 학습 모델의 성능을 왜곡 시키지 않으면서 네트워크 침입 탐지의 성능을 향상시킬 수 있다.According to an embodiment of the present invention, the performance of network intrusion detection can be improved without distorting the performance of the learning model of the present invention.

본 발명의 실시예에 따르면, 하이브리드 특징 선택을 통해 성능을 유지하면서 입력 차원을 축소시킬 수 있고, 오버 샘플링 기법을 통해 소수 클래스의 탐지율을 개선할 수 있다.According to an embodiment of the present invention, the input dimension can be reduced while maintaining performance through hybrid feature selection, and the detection rate of a minority class can be improved through an oversampling technique.

도 1은 본 발명의 실시예에 있어서, 데이터 셋을 설명하기 위한 도면이다.
도 2는 본 발명의 실시예에 따른 네트워크 침입 탐지 시스템의 모델 구조를 설명하기 위한 도면이다.
도 3은 본 발명의 실시예에 있어서, 피어슨(Pearson) 상관 관계가 0.9 이상인 특징을 나타낸 표이다.
도 4는 본 발명의 실시예에 있어서, HF를 통해 선택한 특징 셋을 나타낸 표이다.
도 5는 본 발명의 실시예에 있어서, NSL-KDD 훈련 데이터의 백분율 및 샘플 수를 나타낸 표이다.
도 6은 본 발명의 실시예에 있어서, 신경망 구성의 예를 나타낸 표이다.
도 7은 본 발명의 실시예에 따른 네트워크 침입 탐지 시스템의 구성을 설명하기 위한 블록도이다.
도 8은 본 발명의 실시예에 따른 네트워크 침입 탐지 시스템에서 네트워크 침입 탐지 방법을 설명하기 위한 흐름도이다.1 is a diagram for explaining a data set according to an embodiment of the present invention.
2 is a diagram for explaining the model structure of a network intrusion detection system according to an embodiment of the present invention.
3 is a table showing characteristics having a Pearson correlation of 0.9 or more in an embodiment of the present invention.
4 is a table showing feature sets selected through HF in an embodiment of the present invention.
5 is a table showing the percentage and number of samples of NSL-KDD training data in an embodiment of the present invention.
6 is a table showing an example of a neural network configuration in an embodiment of the present invention.
7 is a block diagram for explaining the configuration of a network intrusion detection system according to an embodiment of the present invention.
8 is a flowchart illustrating a network intrusion detection method in a network intrusion detection system according to an embodiment of the present invention.

본 발명의 실시예에서는 네트워크 침입 탐지의 성능 개선을 위해 하이브리드 특징 선택(Hybrid Feature Selection) 기법을 제안하고, 하이브리드 특징 선택 기법을 심층 신경망에 이용한 HFS-DNN(Deep Neural Network) 모델을 통해 네트워크 침입 탐지를 수행하는 동작에 대하여 설명한다.In an embodiment of the present invention, a hybrid feature selection technique is proposed to improve the performance of network intrusion detection, and network intrusion detection is performed through the HFS-DNN (Deep Neural Network) model using the hybrid feature selection technique in a deep neural network. The operation to perform is described.

도 1은 본 발명의 실시예에 있어서, 데이터 셋을 설명하기 위한 도면이다.1 is a diagram for explaining a data set according to an embodiment of the present invention.

도 1을 참조하면, 네트워크 침입 탐지 시스템은 훈련 데이터 셋을 이용하여 특징 선택과 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델(HFS-DNN 모델)에 훈련 데이터를 입력받고, 입력받은 훈련 데이터에 대한 특징 선택과 데이터 균형을 통해 불필요한 속성들을 제거하도록 특징 선택과 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델(HFS-DNN 모델)을 학습할 수 있다.Referring to FIG. 1, the network intrusion detection system uses a training data set to receive training data to a deep neural network-based network intrusion detection model (HFS-DNN model) through feature selection and data balance, and to the input training data A deep neural network-based network intrusion detection model (HFS-DNN model) can be learned through feature selection and data balancing to remove unnecessary attributes through feature selection and data balancing.

네트워크 침입 탐지 시스템은 학습된 특징 선택과 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델에 네트워크 침입 탐지를 위한 테스트 데이터를 입력받고, 네트워크 침입 탐지 모델을 이용하여 네트워크 침입 탐지를 위한 테스트 데이터에 대한 네트워크 침입을 탐지할 수 있다.The network intrusion detection system receives test data for network intrusion detection from a deep neural network-based network intrusion detection model through learned feature selection and data balance, and uses the network intrusion detection model to test data for network intrusion detection. Detect network intrusions.

본 실시예에서는 NSL-KDD 데이터 셋이 이용될 수 있다. HFS-DNN(Deep Neural Network) 모델의 NSL-KDD 데이터 셋은 1999년 DARPA 침입탐지 평가 프로그램을 통해 만들어진 KDD CUP 99 데이터 셋에서 개선된 것으로, 미 공군의 네트워크를 모델링하여 38가지의 네트워크 침입 탐지 공격 시뮬레이션을 통해 생성된 것이다.In this embodiment, the NSL-KDD data set may be used. The NSL-KDD data set of the HFS-DNN (Deep Neural Network) model is an improvement from the KDD CUP 99 data set created through the DARPA intrusion detection evaluation program in 1999. created through simulation.

KDD CUP 99 데이터 셋의 규모가 지나치게 크며, 많은 중복 레코드 등을 포함하는 문제점이 있는데, 이는 발생 빈도가 높은 공격에 데이터가 매우 치우쳐져 있음을 의미한다. 이러한 문제를 해결한 NSL-KDD 데이터 셋 또한 클래스간 불균형 문제는 여전히 존재한다.The size of the KDD CUP 99 data set is too large and has problems including many duplicate records, which means that the data is very biased towards attacks with a high frequency of occurrence. In the NSL-KDD data set that solves this problem, the imbalance problem between classes still exists.

NSL-KDD 데이터 셋은 정답(label)을 포함하여 42개의 속성으로 구성되며, 공격은 실제 개별 38개의 공격을 모두 분류하는 것이 아닌, 도 2에서 제시된 4개의 공격유형과 1개의 정상상태로, 총 5개의 클래스를 분류하는 것을 목표로 할 수 있다.The NSL-KDD data set consists of 42 attributes, including the correct answer (label), and the attack is not actually classified as all 38 individual attacks, but with the 4 attack types and 1 normal state presented in FIG. 2, a total You can aim to classify 5 classes.

또한, 훈련 데이터 셋과 테스트 데이터 셋을 따로 구분하여 제공할 수 있다. 훈련 데이터 셋에는 24가지 공격 유형만이 포함되어 있다. 이는 두 데이터 셋의 간극이 크다고 할 수 있다. In addition, a training data set and a test data set may be separately provided. The training data set contains only 24 attack types. It can be said that the gap between the two data sets is large.

도 2는 본 발명의 실시예에 따른 네트워크 침입 탐지 시스템의 모델 구조를 설명하기 위한 도면이다.2 is a diagram for explaining the model structure of a network intrusion detection system according to an embodiment of the present invention.

도 2는 HFS-DNN 모델 구조를 나타낸 것으로, HFS-DNN 모델 구조는 데이터 전처리부(Data Preprocessing)(110), 하이브리드 특징 선택부(Hybrid Feature Selection)(120), 데이터 균형부(Data Balancing)(130) 및 심층 신경망(DNN)(140) 등을 포함할 수 있다.2 shows the HFS-DNN model structure, which includes a data preprocessing unit 110, a hybrid feature selection unit 120, and a data balancing unit ( 130) and a deep neural network (DNN) 140.

데이터 전처리부(110)는 데이터 형식에 따른 서로 다른 데이터 정규화 과정을 포함할 수 있다. 심층 신경망에서 입력 데이터의 정규화(Normalization)는 학습 속도를 장려하며, 지역 최적점(Local Optimum)에 제외되는 것을 방지하는 것으로 알려져 있다.The data pre-processing unit 110 may include different data normalization processes according to data formats. Normalization of input data in deep neural networks is known to encourage learning rates and prevent exclusion from local optima.

NSL-KDD 데이터 셋 또한 학습 전 정규화 과정이 진행되며, 모든 속성값들을0과 1사이의 값으로 변경될 수 있다. 정규화 과정은 데이터 형식에 따라 달리 진행될 수 있으며, NSL-KDD 데이터 셋의 데이터 형식은 명목형(nominal), 정수형(numeric), 바이너리(binary)를 포함하는 3가지 형식의 데이터로 구분될 수 있다.The NSL-KDD data set also undergoes a normalization process before learning, and all attribute values can be changed to values between 0 and 1. The normalization process can be performed differently depending on the data format, and the data format of the NSL-KDD data set can be divided into three types of data including nominal, numeric, and binary.

명목형 타입(nominal type)의 데이터들은 범주형 문자 데이터들로 신경망의 입력으로 사용할 수 없는 형태이다. 따라서, 모두 정수형으로 인코딩한 뒤 원-핫(one-hot) 벡터로 변환될 수 있다.Data of nominal type are categorical character data and cannot be used as inputs of neural networks. Therefore, all of them can be encoded as integers and converted into one-hot vectors.

정수형 타입(numeric type)의 데이터들에 대해서는 속성 값들의 범위의 차이를 왜곡하지 않고 공통 스케일로 변경하기 위해 최소 최대 정규화(Min-max Normalization)가 진행될 수 있으며, 바이너리 타입(binary type) 데이터들의 경우 모두 0과 1로 구성되기 때문에 별다른 전처리 과정을 수행하지 않는다. 이를 통해 41 입력차원에서 122 입력차원으로 변환될 수 있으며, 노멀(nominal) 데이터들의 원-핫 표현에 따라 입력차원이 증가될 수 있다.Min-max normalization can be performed for data of numeric type to change to a common scale without distorting the difference in the range of attribute values, and in the case of binary type data Since they are all composed of 0 and 1, no special preprocessing process is performed. Through this, it can be converted from 41 input dimensions to 122 input dimensions, and the input dimensions can be increased according to the one-hot representation of normal data.

이후, 데이터를 관측한 결과 num_outbound_cmds 특징은 표준편차가 0으로 모든 데이터의 값이 동일하기 때문에 학습에 불필요하다 판단되어 사전에 제거될 수 있다. 이를 통해 데이터는 최종적으로 총 121 입력차원으로 변환될 수 있다.Afterwards, as a result of observing the data, the num_outbound_cmds feature can be removed in advance as it is determined that it is unnecessary for learning because the standard deviation is 0 and the value of all data is the same. Through this, the data can be finally converted into a total of 121 input dimensions.

하이브리드 특징 선택부(120)는 하이브리드 특징 선택(Hybrid Feature Selection; HFS) 기법에 사용된 세 가지 단일 특징 선택 기법들과, 전체 특징 선택 기법들을 앙상블(ensemble)할 수 있다.The hybrid feature selection unit 120 may ensemble three single feature selection techniques used in the Hybrid Feature Selection (HFS) technique and all feature selection techniques.

대용량 네트워크 트래픽을 실시간 탐지하기 위해서는 학습 성능을 보장하면서도, 더 적은 하위 속성 집합을 찾을 수 있어야 한다.In order to detect large-volume network traffic in real time, it is necessary to find a smaller set of sub-attributes while guaranteeing learning performance.

본 실시예에서는 하이브리드 특징 선택 기법을 제안하며, 단일 특징 선택 알고리즘들에 비해 더 적은 하위 속성 집합으로 학습 모델의 정확도를 유지할 수 있음을 보인다. 하이브리드 특징 선택 기법은 단일 특징 선택 알고리즘들의 출력 결과인 각 하위 속성 집합들을 구한 뒤, 이들의 교집합을 사용하는 것으로 비교적 간단한 방법으로 효율성을 증대할 수 있다.In this embodiment, we propose a hybrid feature selection technique and show that the accuracy of the learning model can be maintained with a smaller set of sub-properties compared to single feature selection algorithms. The hybrid feature selection technique can increase efficiency in a relatively simple way by obtaining each sub-property set, which is the output result of single feature selection algorithms, and then using the intersection of these sets.

하이브리드 특징 선택 기법은 중첩 특징(Irrelevant Feature) 및 학습에 무관한 특징(Redundant Feature)을 제거하는 2 가지 목적에 따라 3가지 특징 선택 알고리즘(Pearson Based Feature Selection, Feature Importance Based Feature Selection, Attribute Ratio Based Feature Selection)을 분류할 수 있다.The hybrid feature selection technique uses three feature selection algorithms (Pearson Based Feature Selection, Feature Importance Based Feature Selection, and Attribute Ratio Based Feature) according to the two purposes of removing irrelevant features and redundant features. Selection) can be classified.

도 2를 참고하면, 하이브리드 특징 선택부(120)에 3가지 특징 선택 알고리즘이 분류된 것을 나타낸 예시이다.Referring to FIG. 2 , it is an example showing that three feature selection algorithms are classified in the hybrid feature selection unit 120 .

a) 피어슨 기반 특징 선택(Pearson Based Feature Selection Pearson)a) Pearson Based Feature Selection Pearson

상관계수를 이용한 특징 선택 방법은 높은 상관계수를 가지는 두 가지 특징을 중첩관계로 보고, 이 중 하나만을 사용하는 것이다. 피어슨 상관계수는 아래의 수학식 1로 정의되며, -1과 1사이의 값으로 나타난다.In the feature selection method using the correlation coefficient, two features having a high correlation coefficient are regarded as an overlapping relationship and only one of them is used. The Pearson correlation coefficient is defined by Equation 1 below, and is represented by a value between -1 and 1.

[수학식 1][Equation 1]

수학식 1에 있어서, cov는 공분산을 의미하며,

는 모집단 X의 표준편차를 나타내고,

는 모집단 Y의 표준편차를 나타낸다. 1에 가까울수록 양의 상관관계에 있다고 할 수 있으며, -1에 가까울수록 음의 상관관계에 가까움을 의미한다. 0에 근접할 경우는 상관관계가 없음을 의미한다.In Equation 1, cov means covariance,

represents the standard deviation of the population X,

represents the standard deviation of the population Y. The closer it is to 1, the more positive the correlation is, and the closer it is to -1, the closer it is to the negative correlation. If it is close to 0, it means that there is no correlation.

피어슨 상관계수는 연속형 자료들간의 상관관계를 나타냄에 따라 NSL-KDD 데이터 셋에서는 넘버릭(numeric) 속성들에 대해서만 특징 선택이 진행될 수 있다. 임계값(예컨대, 0.9)을 사용하여 특징들의 관계를 분석하였으며, 이들의 관계를 무방향 그래프 자료구조로 표현할 수 있다. 이때, 임계값은 사용자 또는 컴퓨터에 의해 설정될 수 있다.As Pearson's correlation coefficient represents the correlation between continuous data sets, feature selection can proceed only for numeric attributes in the NSL-KDD data set. The relationship between the features was analyzed using a critical value (eg, 0.9), and the relationship between them can be expressed as an undirected graph data structure. In this case, the threshold may be set by a user or a computer.

표현된 그래프 내에서 최소수 하위 특징 집합을 선택할 경우 입력 특징 집합의 크기를 최소화 할 수 있으며, 이는 최소 지배 집합 문제(Minimum dominating set)로 귀결될 수 있다.In the case of selecting the minimum number of sub-feature sets within the represented graph, the size of the input feature set can be minimized, which can result in a minimum dominating set problem.

최소 지배 집합 문제는 NP-Hard에 해당 하는 문제지만, 실시예에서는 도 3과 같이 4쌍의 완전 그래프 결과를 획득할 수 있으며, 이에 따라 어떠한 속성만을 한 가지 사용하여도 최소수가 보장되는 것을 알 수 있다.The minimum dominant set problem corresponds to the NP-Hard problem, but in the embodiment, 4 pairs of complete graph results can be obtained as shown in FIG. there is.

본 실시예에서는 도 3의 각 행의 맨 앞 4가지 속성만을 사용하며, 나머지 중첩 특징들은 제거할 수 있다. 이를 통해 113개의 특징이 선출될 수 있다. 도 3을 참고하면, 피어슨 상관관계가 0.9 이상인 특징들을 나타낸 표이다.In this embodiment, only the first four attributes of each row in FIG. 3 are used, and the remaining overlapping features can be removed. This allows 113 features to be selected. Referring to FIG. 3 , a table showing features having a Pearson's correlation of 0.9 or more.

b) 특징 중요도 기반 특징 선택(Feature Importance Based Feature Selection)b) Feature Importance Based Feature Selection

특징 중요도(Feature Importance)를 이용한 특징 선택 방법에서는 의사결정 트리 모델을 학습시킨 뒤, 정보 획득량(Information Gain)에 기반하여 학습에 사용된 각 특징들의 중요도를 파악할 수 있다.In the feature selection method using feature importance, after learning a decision tree model, the importance of each feature used for learning can be grasped based on information gain.

의사결정 트리는 정보 획득량을 최대화하는 특징을 기준으로 노드를 우선 분할한다. 이는 노드의 중요도 값이 클수록 해당 노드에서의 불순도가 크게 감소하는 것을 의미한다.The decision tree first divides the nodes based on the feature that maximizes the amount of information obtained. This means that as the importance value of a node increases, the impurity at the corresponding node decreases significantly.

본 실시예에서는 랜덤 포레스트(Random Forest) 학습 모델을 통해 이러한 특징 중요도를 추출한 뒤 정렬하여, 임계값(예컨대, 0.0001)을 통해 복수 개의 특징을 선출할 수 있다. 이때, 임계값은 사용자 또는 컴퓨터에 의해 설정될 수 있다. 예를 들면, 상위 55개의 특징이 선출될 수 있다. 도 4를 참고하면, 하이브리드 특징 선택을 통해 선택된 특징 셋의 예를 보여준다.In this embodiment, a plurality of features may be selected through a threshold value (eg, 0.0001) by extracting and sorting these feature importances through a random forest learning model. In this case, the threshold may be set by a user or a computer. For example, the top 55 features may be selected. Referring to FIG. 4, an example of a feature set selected through hybrid feature selection is shown.

c) 속성 비율 기반 특징 선택(Attribute Ratio Based Feature Selection)c) Attribute Ratio Based Feature Selection

AR(Attribute Ratio) 기반 특징 선택 방법은 위의 두 방법(피어스 기반 특징, 특징 중요도 기반 특징 선택)과는 일반적이지 않은 새로운 접근 방법으로, 특징의 빈도수와 평균값을 통해 특징 중요도를 계산하며, NSL-KDD 데이터 셋의 학습 과정에서 매우 적은 특징수를 가지고도 학습이 잘 된다.The AR (Attribute Ratio)-based feature selection method is a new approach that is different from the above two methods (Pierce-based feature and feature importance-based feature selection). In the learning process of the KDD data set, learning works well even with a very small number of features.

실시예에서는 임계값(예컨대, 0.1)을 통해 복수 개의 특징을 선출할 수 있다. 이때, 임계값은 사용자 또는 컴퓨터에 의해 설정될 수 있다. 예를 들면, 상위 50개의 특징이 선출될 수 있다.In an embodiment, a plurality of features may be selected through a threshold value (eg, 0.1). In this case, the threshold may be set by a user or a computer. For example, the top 50 features may be selected.

도 3은 피어슨 상관 계수가 0.9 이상의 관계를 가진 특징들을 나타내고 있다.3 shows features having a Pearson correlation coefficient of 0.9 or more.

각 특징들의 옆에는 랜덤 포레스트 학습 모델을 통해 선출된 특징 중요도의 순위를 표시되어 있는데, 이를 참조하면 높은 상관관계를 가진 특징들끼리는 모두 유사한 순위를 가지는 것을 알 수 있으며, 또한 모두 특징 중요도 기반 특징 선택을 통해 선출한 상위 55개에 포함되어 있는 것을 알 수 있다. 이는 특징 중요도만을 통해 특징 선택을 할 경우 이러한 중첩 특징들을 고려할 수 없음을 의미한다.Next to each feature, the ranking of feature importance selected through the random forest learning model is displayed. Referring to this, it can be seen that features with high correlation all have similar ranks, and all features are selected based on feature importance. It can be seen that it is included in the top 55 selected through This means that such overlapping features cannot be considered when feature selection is performed only through feature importance.

따라서, 55개의 특징 집합에서 추가적으로 8개의 특징을 제거할 수 있음을 알 수 있다. 이는 이들 특징 선택 기법들의 결과에 해당하는 하위 집합들의 교집합임을 알 수 있다.Therefore, it can be seen that 8 features can be additionally removed from the 55 feature sets. It can be seen that this is the intersection of subsets corresponding to the results of these feature selection techniques.

본 실시예에서 제시되는 하이브리드 특징 선택은 이러한 두 가지 관점의 특징들을 모두 필터링할 수 있다. 이와 같이 위에 제시된 3가지 특징 선택 기법의 교집합을 통해, 단일 특징 선택에 비해 더 작으면서 학습 성능을 저해하지 않는 입력 특징 집합을 획득할 수 있다. 최종적으로 도 4에서 제시된 39개의 특징을 사용한다. 이는 기존 121 입력차원 대비 32% 규모의 크기를 가진다.The hybrid feature selection presented in this embodiment can filter out features from both perspectives. In this way, through the intersection of the three feature selection techniques presented above, it is possible to obtain an input feature set that is smaller than single feature selection and does not hinder learning performance. Finally, the 39 features presented in FIG. 4 are used. This has a size of 32% compared to the existing 121 input dimensions.

데이터 균형부(130)는 학습에 사용되는 NSL-KDD 데이터 셋의 불균형 문제를 해소할 수 있다. 불균형 데이터란 소수 클래스(minor class)에 데이터 수가 다수 클래스(major class)에 포함된 데이터 수와 비교해 현저히 적은 데이터를 의미한다.The data balance unit 130 may solve the imbalance problem of the NSL-KDD data set used for learning. The imbalanced data means that the number of data in the minor class is significantly smaller than the number of data included in the major class.

기계 학습 모델의 성능은 데이터에 의존적이며, 이를 해소하지 않은 채 학습 모델에 적용할 시 분류 성능의 저하를 야기할 수 있다. 특히, 소수 클래스들의 탐지율은 크게 떨어지게 되는데, 이는 소수 클래스의 범주가 다수에 클래스에 의해 침범 당하기 때문이다.The performance of a machine learning model is dependent on data, and when applied to a learning model without resolving it, classification performance may deteriorate. In particular, the detection rate of the minority class is greatly reduced because the category of the minority class is invaded by the majority class.

본 실시예에서 학습에 사용하는 NSL-KDD 데이터 셋 또한 도 5와 같이 클래스 간 샘플 수의 차이가 매우 크다. 이러한 문제를 해결하기 위해서 본 실시예에서 오버 샘플링(over sampling) 기법과 언더 샘플링 기법(under sampling)을 통해 불균형 문제를 해소할 수 있다.In the NSL-KDD data set used for learning in this embodiment, the difference in the number of samples between classes is also very large, as shown in FIG. In order to solve this problem, the imbalance problem can be solved through an over-sampling technique and an under-sampling technique in the present embodiment.

데이터의 절반에 해당하는 노멀(Normal) 클래스는 다수 클래스로 RUS(Random Under Sampling) 기법을 통해 데이터를 샘플 수를 축소하였으며, 상대적으로 소수 클래스에 해당하는 Probe, U2R, R2L은 SMOTE(Synthetic Minority Over Sampling Technique) 기법을 통해 샘플 수를 비슷한 수준의 크기로 늘려줄 수 있다.The normal class, which corresponds to half of the data, is a majority class, and the number of data samples is reduced through RUS (Random Under Sampling) technique, and the probe, U2R, and R2L, which are relatively minor classes, Sampling Technique) technique can increase the number of samples to a similar level.

SMOTE는 불균형 데이터를 해결하기 위한 기법으로 소수 클래스들의 중 임의샘플을 중심으로 KNN 알고리즘을 활용해 복수 개(예컨대, k개, k는 자연수)의 샘플을 합성하여, k 샘플들 사이에 새로운 가공 데이터를 생성하는 방법이다. 도 5를 참조하면, 불균형을 해소한 데이터 셋의 샘플수와 비율을 확인할 수 있으며, 분류 모델의 소수 클래스들의 탐지율 개선을 기대할 수 있다.SMOTE is a technique for resolving imbalanced data. It synthesizes a plurality of samples (e.g., k, where k is a natural number) by using a KNN algorithm based on a random sample of minority classes, and creates new processed data between k samples. how to create Referring to FIG. 5, it is possible to check the number and ratio of samples in the data set in which the imbalance is resolved, and an improvement in the detection rate of minority classes of the classification model can be expected.

네트워크 침입 탐지 분류 모델로 심층 신경망(140)을 사용할 수 있다. 도 6을 참고하면, 신경망 구성의 예를 나타낸 표이다. 도 6을 통해 제안된 심층 신경망 네트워크의 구조를 확인할 수 있으며, 은닉 계층(hidden layer)의 활성 함수로 relu를 사용할 수 있다.A deep neural network 140 may be used as a network intrusion detection classification model. Referring to FIG. 6 , it is a table showing an example of a neural network configuration. The structure of the proposed deep neural network can be confirmed through FIG. 6, and relu can be used as an activation function of a hidden layer.

신경망의 학습에서 초기 가중치 설정은 매우 중요한 역할을 하는데, 이는 기울기 소실(gradient vanishing)과 같은 문제로 이어질 수 있기 때문이다. 많이 사용되고 언급되는 Xavier Initializer는 relu 함수와 같이 사용할 경우 레이어의 깊이가 깊어질수록 출력값에 0에 가까워지는 문제가 발생한다.Initial weight setting plays a very important role in neural network training, as it can lead to problems such as gradient vanishing. Xavier Initializer, which is widely used and mentioned, has a problem that the output value gets closer to 0 as the depth of the layer increases when used with the relu function.

따라서, 이러한 문제를 해결한 He Initializer를 사용하여 신경망의 초기값을 설정한다. 또한 학습 데이터에 과적합이 되는 것을 방지하기 위해서 L2 규제를 사용한다.Therefore, set the initial value of the neural network using He Initializer, which solves this problem. In addition, L2 regularization is used to prevent overfitting the training data.

도 7은 일 실시예에 따른 네트워크 침입 탐지 시스템의 구성을 설명하기 위한 블록도이고, 도 8은 일 실시예에 따른 네트워크 침입 탐지 시스템에서 네트워크 침입 탐지 방법을 설명하기 위한 흐름도이다. 7 is a block diagram illustrating a configuration of a network intrusion detection system according to an embodiment, and FIG. 8 is a flowchart illustrating a network intrusion detection method in the network intrusion detection system according to an embodiment.

네트워크 침입 탐지 시스템(100)의 프로세서는 모델 구성부(710), 데이터 입력부(720) 및 침입 탐지부(730) 등을 포함할 수 있다.The processor of the network intrusion detection system 100 may include a model construction unit 710 , a data input unit 720 and an intrusion detection unit 730 .

이러한 프로세서의 구성요소들은 네트워크 침입 탐지 시스템에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다.Components of the processor may represent different functions performed by the processor according to control instructions provided by program codes stored in the network intrusion detection system.

프로세서 및 프로세서의 구성요소들은 도 8의 네트워크 침입 탐지 방법이 포함하는 단계들(810 내지 830)을 수행하도록 네트워크 침입 탐지 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서의 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다.The processor and components of the processor may control the network intrusion detection system to perform steps 810 to 830 included in the network intrusion detection method of FIG. 8 . In this case, the processor and components of the processor may be implemented to execute instructions according to the code of an operating system included in the memory and the code of at least one program.

프로세서는 네트워크 침입 탐지 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예컨대, 네트워크 침입 탐지 시스템에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 네트워크 침입 탐지 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서가 포함하는 모델 구성부(710), 데이터 입력부(720) 및 침입 탐지부(730) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(810 내지 830)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다.The processor may load a program code stored in a file of a program for a network intrusion detection method into a memory. For example, when a program is executed in the network intrusion detection system, the processor may control the network intrusion detection system to load a program code from a file of the program into a memory under the control of an operating system. At this time, each of the processor and the model configuration unit 710, the data input unit 720, and the intrusion detection unit 730 included in the processor executes a command of a corresponding part of the program code loaded into the memory to perform subsequent steps (810 to 810 to 810). 830) may be different functional representations of the processor.

단계(810)에서 모델 구성부(710)는 특징 선택과 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델을 구성할 수 있다. 모델 구성부(710)는 특징 선택과 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델에 훈련 데이터를 입력받고, 입력받은 훈련 데이터에 대한 특징 선택과 데이터 균형을 통해 불필요한 속성들을 제거하도록 특징 선택과 데이터 균형을 통한 심층 신경망 기반의 네트워크 침입 탐지 모델을 학습할 수 있다.In step 810, the model construction unit 710 may construct a network intrusion detection model based on a deep neural network through feature selection and data balance. The model configuration unit 710 receives training data for a network intrusion detection model based on a deep neural network through feature selection and data balancing, and features selection and data balancing to remove unnecessary attributes through feature selection and data balancing on the input training data. A deep neural network-based network intrusion detection model can be learned through data balancing.

단계(820)에서 데이터 입력부(720)는 구성된 네트워크 침입 탐지 모델에 네트워크 침입 탐지를 위한 테스트 데이터를 입력받을 수 있다. 데이터 입력부(720)는 입력받은 네트워크 침입 탐지를 위한 테스트 데이터가 지역 최적점(Local Optimum)에 제외되는 것을 방지하기 위하여 정규화(Normalization) 과정을 통해 속성값들을 0 과 1사이의 값으로 변경할 수 있다.In step 820, the data input unit 720 may receive test data for network intrusion detection from the configured network intrusion detection model. The data input unit 720 may change attribute values to values between 0 and 1 through a normalization process to prevent the input test data for network intrusion detection from being excluded from the local optimum. .

이때, 정규화 과정은, 명목형(nominal), 숫자형(numeric), 바이너리(binary)를 포함하는 데이터의 형식(type)에 따라 다르게 진행되며, 명목형의 데이터 형식은, 범주형 문자 데이터들에 대하여 정수형 데이터로 인코딩하고, 인코딩된 정수형 데이터를 원-핫(one-hot) 벡터로 변환하고, 숫자형의 데이터 형식은, 숫자형 데이터들에 대하여 속성 값들의 범위의 차이를 왜곡하지 않고 공통 스케일로 변경하기 위해 최소 최대 정규화(Min-max Normalization)를 진행하고, 바이너리의 데이터 형식은, 바이너리 데이터들에 대해서는 정규화를 수행하지 않는다.At this time, the normalization process proceeds differently depending on the type of data including nominal, numeric, and binary data, and the nominal data type corresponds to categorical character data. encoding into integer data, converting the encoded integer data into a one-hot vector, and converting the numeric data format into a common scale without distorting the difference in the range of attribute values for the numeric data. In order to change to , Min-max Normalization is performed, and the binary data format does not perform normalization on binary data.

단계(830)에서 침입 탐지부(730)는 네트워크 침입 탐지 모델을 이용하여 네트워크 침입 탐지를 위한 테스트 데이터에 대한 네트워크 침입을 탐지할 수 있다.In step 830, the intrusion detection unit 730 may detect network intrusion with respect to test data for network intrusion detection using the network intrusion detection model.

침입 탐지부(730)는 단일 특징 선택 알고리즘들의 출력 결과인 각 하위 속성 집합들의 교집합을 사용하는 하이브리드 선택 기법(Hybrid Feature Selection; HFS)을 이용하여 중첩 특징(Irrelevant Feature) 및 학습에 무관한 특징(Redundant Feature)을 제거할 수 있다.The intrusion detection unit 730 uses a Hybrid Feature Selection (HFS) method that uses the intersection of each sub-property set, which is an output result of single feature selection algorithms, to determine an irrelevant feature and a feature irrelevant to learning ( redundant feature) can be removed.

침입 탐지부(730)는 피어슨 기반 특징 선택(Pearson Based Feature Selection) 방법을 이용하여 학습에 무관한 특징을 제거하고, 특징 중요도 기반 특징 선택(Feature Importance Based Feature Selection) 방법 및 속성 비율 기반 특징 선택(Attribute Ratio Based Feature Selection) 방법을 이용하여 중첩 특징을 제거할 수 있다.The intrusion detection unit 730 uses the Pearson Based Feature Selection method to remove features irrelevant to learning, and the feature importance based feature selection method and attribute ratio based feature selection ( Overlapping features can be removed using the Attribute Ratio Based Feature Selection) method.

침입 탐지부(730)는 기 설정된 기준 이상의 피어슨 상관계수를 가지는 두 가지의 특징을 중첩관계로 간주하고, 중첩관계로 간주된 두 가지의 특징 중 하나의 특징을 선택할 수 있다. 이때, 피어슨 상관계수는, 전술한 수학식 1로 정의되며, -1과 1사이의 값으로 나타나며, 수학식 1에서 cov는 공분산, 는 모집단 X의 표준편차, 는 모집단 Y의 표준 편차를 나타내고, 1에 가까울수록 두 가지의 특징이 양의 상관관계에 있고, -1에 가까울수록 두 가지의 특징이 음의 상관관계에 가까움을 의미할 수 있다.The intrusion detection unit 730 may consider two features having a Pearson's correlation coefficient equal to or higher than a preset standard as an overlapping relationship, and may select one of the two features considered as an overlapping relationship. At this time, the Pearson correlation coefficient is defined by the above-mentioned Equation 1, and is represented by a value between -1 and 1, in Equation 1, cov is the covariance, is the standard deviation of the population X, represents the standard deviation of the population Y, The closer to 1, the more the two features are positively correlated, and the closer to -1, the closer the two features are to the negative correlation.

침입 탐지부(730)는 의사결정 트리 모델을 학습시킨 뒤, 정보 획득량(Information Gain)에 기반하여 학습에 사용된 각 특징들의 중요도를 파악할 수 있다. 이때, 의사결정 트리는, 정보 획득량을 최대화하는 특징을 기준으로 노드를 우선적으로 분할할 수 있다.After learning the decision tree model, the intrusion detection unit 730 may determine the importance of each feature used for learning based on the information gain. At this time, the decision tree may preferentially divide nodes based on characteristics that maximize the amount of information acquisition.

침입 탐지부(730)는 랜덤 포레스트 학습 모델을 통해 특징 중요도를 추출하고, 추출된 특징 중요도를 기 설정된 기준으로 정렬하여 임계값 이상의 특징을 선출할 수 있다.The intrusion detection unit 730 may extract feature importance through a random forest learning model, align the extracted feature importance with a preset criterion, and select a feature having a threshold value or higher.

침입 탐지부(730)는 특징의 빈도수와 평균값을 통해 계산된 특징 중요도가 임계값 이상인 특징을 선출할 수 있다. 침입 탐지부(730)는 오버 샘플링(over sampling) 기법과 언더 샘플링 기법(under sampling)을 통해 입력받은 테스트 데이터의 절반에 해당하는 다수 클래스에 대하여 RUS(Random Under Sampling) 기법을 통해 데이터의 샘플 수를 축소하고, 소수 클래스에 해당하는 Probe, U2R, R2L에 대하여 SMOTE(Synthetic Minority Over Sampling Technique) 기법을 통해 샘플 수를 증가시킬 수 있다.The intrusion detection unit 730 may select a feature whose feature importance calculated through the feature frequency and average value is greater than or equal to a threshold value. The intrusion detection unit 730 uses the RUS (Random Under Sampling) technique for the number of samples of data corresponding to half of the test data received through the over sampling technique and the under sampling technique. can be reduced, and the number of samples can be increased through the Synthetic Minority Over Sampling Technique (SMOTE) technique for Probe, U2R, and R2L corresponding to the minority class.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions.

처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

110 : 데이터 전처리부
120 : 하이브리드 특징 선택부
130 : 데이터 균형부
140 : 심층 신경망
710 : 모델 구성부
720 : 데이터 입력부
730 : 침입 탐지부110: data pre-processing unit
120: hybrid feature selection unit
130: data balance unit
140: deep neural network
710: model component
720: data input unit
730: intrusion detection unit

Claims

In the network intrusion detection method performed by the network intrusion detection system,
constructing a network intrusion detection model based on a deep neural network through feature selection and data balance;
receiving test data for network intrusion detection into the configured network intrusion detection model; and
Detecting a network intrusion for the test data for detecting the network intrusion using the network intrusion detection model
Network intrusion detection method comprising a.

According to claim 1,
The configuration step is
Training data is input to a network intrusion detection model based on a deep neural network through feature selection and data balancing, and feature selection and data balancing are performed to remove unnecessary attributes through feature selection and data balancing on the input training data. Learning a network intrusion detection model based on a deep neural network through
Network intrusion detection method comprising a.

According to claim 1,
In the step of receiving the input,
Changing attribute values to values between 0 and 1 through a normalization process to prevent the test data from being excluded from the local optimum.
Network intrusion detection method comprising a.

According to claim 3,
The normalization process,
It proceeds differently depending on the type of data including nominal, numeric, and binary.
The nominal data format encodes categorical character data into integer data, converts the encoded integer data into a one-hot vector,
In the numeric data format, minimum-max normalization is performed to change the numeric data to a common scale without distorting the difference in the range of attribute values,
The binary data format does not normalize binary data,
Network intrusion detection method, characterized in that.

According to claim 1,
The detection step is
Eliminating irrelevant features and redundant features using a hybrid feature selection (HFS) using the intersection of each sub-property set, which is the output result of single feature selection algorithms
Network intrusion detection method comprising a.

According to claim 5,
The detection step is
Remove features irrelevant to learning using Pearson Based Feature Selection method, Feature Importance Based Feature Selection method and Attribute Ratio Based Feature Selection method Step of removing overlapping features using
Network intrusion detection method comprising a.

According to claim 6,
The detection step is
Considering two features having a Pearson's correlation coefficient higher than a predetermined criterion as an overlapping relationship, and selecting one of the two features considered as an overlapping relationship.
including,
The Pearson correlation coefficient is defined as Equation 1 and is represented by a value between -1 and 1,
[mathematical expression]

In the above equation, cov is the covariance, is the standard deviation of population X, and is the standard deviation of population Y. The closer to 1, the positive correlation between the two features, and the closer to -1, the more This closeness to the negative correlation implies that
Network intrusion detection method, characterized in that.

According to claim 6,
The detection step is
After learning the decision tree model, the step of identifying the importance of each feature used for learning based on the information gain
including,
The decision tree preferentially divides nodes based on a feature that maximizes the amount of information acquisition.
Network intrusion detection method, characterized in that.

According to claim 8,
The detection step is
Extracting feature importance through a random forest learning model and sorting the extracted feature importance by a preset criterion to select a feature that is greater than or equal to a threshold value
Network intrusion detection method comprising a.

According to claim 6,
The detection step is
Selecting a feature whose feature importance calculated through the feature frequency and average value is greater than or equal to a threshold value
Network intrusion detection method comprising a.

According to claim 5,
The detection step is
The number of samples of data is reduced through RUS (Random Under Sampling) technique for the majority class corresponding to half of the input test data through over sampling technique and under sampling technique, and the minority class Increasing the number of samples through the Synthetic Minority Over Sampling Technique (SMOTE) technique for probes, U2R, and R2L corresponding to
Network intrusion detection method comprising a.

A computer-readable recording medium storing a computer program,
When the computer program is executed by a processor,
constructing a network intrusion detection model based on a deep neural network through feature selection and data balance;
receiving test data for network intrusion detection into the configured network intrusion detection model; and
Detecting a network intrusion for the test data for detecting the network intrusion using the network intrusion detection model
A computer readable recording medium comprising a.

As a computer program stored on a computer-readable recording medium,
When the computer program is executed by a processor,
constructing a network intrusion detection model based on a deep neural network through feature selection and data balance;
receiving test data for network intrusion detection into the configured network intrusion detection model; and
detecting a network intrusion for the test data for detecting the network intrusion using the network intrusion detection model;
A computer program comprising instructions for causing the processor to perform an operation comprising:

In the network intrusion detection system,
a model configuration unit configuring a network intrusion detection model based on a deep neural network through feature selection and data balance;
a data input unit that receives test data for network intrusion detection into the configured network intrusion detection model;
An intrusion detection unit for detecting a network intrusion with respect to the test data for detecting the network intrusion using the network intrusion detection model.
A network intrusion detection system comprising a.