KR100686399B1

KR100686399B1 - Lightweight intrusion detection method through correlation based hybrid feature selection

Info

Publication number: KR100686399B1
Application number: KR1020060021669A
Authority: KR
Inventors: 박종서; 김동성; 모함메드 사자드 카자; 노봉남
Original assignee: 전남대학교산학협력단; 박종서; 김동성; 모함메드 사자드 카자
Priority date: 2006-03-08
Filing date: 2006-03-08
Publication date: 2007-02-26

Abstract

A lightweight intrusion detection method through correlation based hybrid feature selection in a computer is provided to remarkably reduce a training and testing time while keeping a stable feature detection result, a low false detection rate, and a high detection rate. Preprocessed checking data from checking data is classified into a training dataset and a testing dataset. The training dataset is classified again into a feature selection dataset processing through a hybrid feature selection process based on correlation resulted as a set of selected features, a model building dataset used for constructing an intrusion detection model using the selected feature, and a verification dataset. The intrusion detection model is verified by the verification dataset and is tested by the testing dataset. The correlation based hybrid feature selection process is sent to the modeling dataset having the reduced feature and the verification dataset having the reduced feature through an adder. The modeling dataset and the verification dataset are sent to a classification model through machine learning and verification.

Description

Lightweight Intrusion Detection Method Through Correlation Based Hybrid Feature Selection}

도 1 은 본 발명의 상관관계 기반의 하이브리드 특징 선택을 통한 경량화된 침입탐지시스템을 설명하기 위한 전체 구성도, 1 is an overall configuration diagram for explaining a lightweight intrusion detection system through the correlation-based hybrid feature selection of the present invention,

도 2 는 선택된 특징의 특징 벡터 또는 세트를 표현하는 유전자를 설명하는 도면, 2 illustrates a gene representing a feature vector or set of selected features;

도 3 은 본 발명의 상관관계를 기반으로 한 하이브리드 특징선택 알고리즘을 도시해 놓은 플로우차트, 3 is a flowchart illustrating a hybrid feature selection algorithm based on the correlation of the present invention;

도 4 는 모델 구축시간 대 데이터세트 지수, 4 is a model construction time versus dataset index,

도 5 는 모델 테스팅시간 대 데이터세트 지수, 5 is the model testing time versus dataset index,

도 6 은 탐지율 대 데이터세트 지수, 6 shows the detection rate versus the dataset index,

도 7 은 오탐율 대 데이터세트 지수이다.7 is the false positive rate vs. dataset index.

* 도면의 주요 부분에 대한 부호의 설명* Explanation of symbols for the main parts of the drawings

10 : 검사데이터 11 : 전 처리된 검사데이터10: inspection data 11: preprocessed inspection data

20 : 훈련 데이터세트 30 : 테스팅 데이터세트20: training dataset 30: testing dataset

40 : 모델링 데이터세트 50 : 검증 데이터세트 40: modeling dataset 50: validation dataset

본 발명은 컴퓨터 상에서 상관관계 기반의 하이브리드 특징선택을 통한 경량화된 침입탐지방법에 관한 것으로, 더욱 상세하게는 상관관계를 기반으로 한 하이브리드 특징선택으로 명명된 새로운 특징선택 연구법을 기반으로 하여 경량화된 침입탐지시스템(IDS)을 모델링하기 위한 새로운 연구법에 관한 것이다. The present invention relates to a lightweight intrusion detection method through correlation based hybrid feature selection on a computer, and more particularly, a lightweight intrusion based on a new feature selection method called hybrid feature selection based on correlation. A new method for modeling detection systems (IDS).

침입탐지시스템(IDS)을 모델링하기 위한 기존의 연구법은, 분류 알고리즘 및 특징선택을 기초로 하여 기존의 침입탐지시스템을 개선시키기 위한 새로운 기법을 제공하는 데에 초점을 맞추어 왔다. 먼저, 침입탐지시스템의 모델링의 관점에서, 많은 연구가 분류 알고리즘들, 예를 들어 클러스터링 알고리즘 및 인공 신경망(ANN), 은닉 마르코프모델(HMM), 서포트벡터머신(SVM), k-means 클러스터링, 퍼지 기술 등의 소프트 컴퓨팅연구법의 다양한 종류를 기반으로 하여 침입탐지모델을 제안하여 왔다. Existing research methods for modeling intrusion detection systems (IDS) have focused on providing new techniques for improving existing intrusion detection systems based on classification algorithms and feature selection. First, in terms of modeling an intrusion detection system, many studies have classified classification algorithms, such as clustering algorithms and artificial neural networks (ANN), hidden Markov models (HMM), support vector machines (SVM), k-means clustering, and fuzzy. Intrusion detection model has been proposed based on various kinds of soft computing research methods such as technology.

상기 은닉 마르코프모델(HMM)을 사용하여 다단계 침입을 모델링하고 있는 바, 상기 은닉 마르코프모델이 결정트리 및 인공 신경망과 같은 고전적인 기계학습보 다 성능이 뛰어나다는 것을 보여 주고 있다. 침입탐지시스템을 은닉 마르코프모델로 모델링하는 것의 가장 중요한 단점은 많은 양의 계산 자원을 요구한다는 것이다. Multi-step intrusion is modeled using the Hidden Markov Model (HMM), which shows that the Hidden Markov Model outperforms classical machine learning such as decision trees and artificial neural networks. The most important disadvantage of modeling an intrusion detection system as a hidden Markov model is that it requires a large amount of computational resources.

상기 침입탐지시스템을 위하여 서포트벡터머신(SVM)기법을 적용하여 왔는 바, 이는 비정상 패턴 탐지에 이용하고 있다. 여러가지 종류의 서포트벡터머신이 정상적인 네트워크 트래픽 및 비정상적인 네트워크 트래픽을 식별할 수 있다는 것을 보여주고 있다. 이때, 비정상적인 특성은 정상적이거나 또는 공격받은 자유로운 데이터을 갖는 비감독자학습의 서포트벡터머신을 훈련시킴으로 예측할 수 있다는 것을 설명하고 있다. 다른 종류는 침입탐지를 위하여 강인한(robust) 서포트벡터머신을 이용하고 있고, 그것을 기존의 서포트벡터머신 및 k-최단 클러스터링 기법과 비교하고 있다.The support vector machine (SVM) technique has been applied to the intrusion detection system, which is used for abnormal pattern detection. It shows that various kinds of support vector machines can identify normal network traffic and abnormal network traffic. At this time, the abnormal characteristics can be predicted by training a support vector machine of non-supervisor learning with normal or attacked free data. The other type uses a robust support vector machine for intrusion detection and compares it with the existing support vector machine and k-short clustering technique.

한편, 탐지모델의 부하를 최소화할 뿐만 아니라 또한 탐지율을 최대화하기 위하여 중요한 특징 또는 특징 세트를 산출하는 데에 노력하여 왔다. 여기서, 특징 선택의 용어는 패턴인식에 있어 피상패턴에서 추출되는 복수 개의 특징사이에 중복성이 있는 경우에 그 들 중에서 되도록 식별이 유효한 특징을 선택하는 것으로서, 래퍼(Wrapper) 및 필터(Filter) 연구법을 통하여 중요한 특징들을 식별하는 방법을 제안하여 왔다. On the other hand, efforts have been made to calculate important features or feature sets to minimize the load on the detection model as well as to maximize the detection rate. Here, the term of feature selection is to select a feature that is effective to identify among the plurality of features extracted from the superficial pattern in pattern recognition. We have proposed a method for identifying important features.

상기 래퍼방법은 특징 또는 특징세트의 우수성을 평가하기 위한 기계학습 알고리즘을 적용한다. 래퍼방법은 평가도구로서 학습알고리즘의 분류율을 사용하며 적절한 특징을 선택하는 것의 더 좋은 성능을 제공한다. 반면에 필터방법은 특징 선택을 위해 임의의 기계학습알고리즘을 사용하지 않고, 오히려 거리 도구, 상관관계 도구, 양립성 도구와 같은 몇몇의 독립적인 도구에 의하여 특징 또는 특징 세트의 관련성을 평가하기 위하여 훈련 데이터의 기본적인 특징을 이용한다. The wrapper method applies a machine learning algorithm to evaluate the superiority of the feature or feature set. The wrapper method uses the classification rate of the learning algorithm as an evaluation tool and provides better performance in selecting the appropriate features. On the other hand, the filter method does not use any machine learning algorithms for feature selection, but rather training data to assess the relevance of a feature or feature set by some independent tools, such as distance tools, correlation tools, and compatibility tools. Use the basic features of

몇몇의 특징선택기법이 웹 및 데이터 마이닝, 그리고 음성인식의 분야에서 사용된다할지라도, 그러나 침입탐지분야에서는 매우 적은 관련 연구가 존재한다. Although some feature selection techniques are used in the fields of web and data mining, and voice recognition, however, very little relevant research exists in the field of intrusion detection.

프로브, DoS(Denial of Service), R2L(Remote-to-Local) 및 U2R(User-to-Root)과 같은 특정한 공격의 종류를 탐지하기 위해, 특징의 중요성에 따라 특징을 식별하고 분류하는 것에 노력하였다. 특징 선택알고리즘으로서 후방 선택방법 및, 서포트벡터머신 및 신경망(NN)을 사용하여 왔다. 이는 분류의 정확성, 훈련 및 테스팅율에서 특성 가중치에 따라 특징을 단계를 지어 분류하기 위한 규칙을 제안하였다. 또한, 유전자알고리즘((Genetic Algorithm, GA)을 기반으로 한 특성 선택 방법을 제안하였다. 이것 이외에, 몇몇의 PCA(Principle Component Analysis, 주성분분석) 및 ICA(Independent Component Analysis, 독립성분분석) 연구법이 침입탐지시스템의 부하를 감소시키고 탐지율을 증가시키기 위하여 제안되어 왔다.Efforts to identify and classify features based on their importance to detect specific types of attacks such as probes, Denial of Service (DoS), Remote-to-Local (R2L), and User-to-Root (U2R) It was. As the feature selection algorithm, a back selection method and a support vector machine and a neural network (NN) have been used. This proposed a rule for classifying features in stages according to feature weights in classification accuracy, training and testing rates. In addition, we proposed a feature selection method based on the genetic algorithm (GA), in addition to several PCA (Principle Component Analysis) and ICA (Independent Component Analysis) methods. It has been proposed to reduce the load on the detection system and increase the detection rate.

그러나, 이러한 연구법은 여전히 일반화 오류(Generation errors)를 최소화하기 위하여 요구되는 분류와 크로스 입증방법을 위하여 많은 양의 계산량이 필요하기 때문에 많은 부하를 가진다. 그들은 파라미터와 초기 조건의 변경을 기초로 하여 분류 알고리즘의 성능을 단지 향상시킴으로써 시스템의 성능을 향상시키는 데에 노력하였다. 그러나 이러한 방안은 관련이 있고 또는 관련이 없는 전체의 주어진 특징을 다루기 때문에 더 많은 계산량을 초래한다. 더욱이, 특징선택의 결과로 서 선택된 특징세트는 탐지모델에 따라 다양하기 때문에, 그들은 경량화된 침입탐지시스템의 구현 면에서 더욱 적절하지가 않다.However, these methods still have a heavy load because of the large amount of computation required for the classification and cross-validation methods required to minimize generation errors. They sought to improve the performance of the system by merely improving the performance of the classification algorithm based on changes in parameters and initial conditions. However, this approach incurs more computation since it deals with a given feature of the related or unrelated whole. Moreover, because the feature set selected as a result of feature selection varies with the detection model, they are not more appropriate in terms of the implementation of a lightweight intrusion detection system.

이상 설명한 바와 같이 종래 침입탐지시스템의 성능의 개선을 위해서는, 침입탐지모델의 계수들을 조정하는 방안과, 침입탐지시스템의 검사데이터의 특징들을 선택하는 방안이 존재하고 있다. 먼저, 전자는 새로운 탐지알고리즘에 따라 그 계수를 최적화하는 것으로 거의 성숙화되고 있는 기술이다. 이와 달리 후자는 침입탐지시스템의 검사데이터의 특징 중에서 침입탐지시스템의 검사데이터에 대한 처리 부하를 낮추면서 침입 탐지율을 높이는데 중요한 요소이며 아직 전자에 비해 많은 연구가 필요하다. As described above, in order to improve the performance of the conventional intrusion detection system, there are methods of adjusting the coefficients of the intrusion detection model and selecting the characteristics of the inspection data of the intrusion detection system. First, the former is a technology that is almost mature by optimizing its coefficients according to new detection algorithms. On the other hand, the latter is an important factor in increasing the intrusion detection rate while lowering the processing load on the inspection data of the intrusion detection system among the characteristics of the inspection data of the intrusion detection system, and much more research is needed than the former.

상기 검사데이터의 특징을 처리하는 방안으로는 래퍼 방안과 필터 방안이 있는 바, 상기 래퍼 방안은 하나의 검사데이터의 특징을 지우고 이때 침입탐지알고리즘으로 학습과 테스트의 반복을 통하여 그 특징이 얼마나 중요한 역할을 수행하는가를 판단하는 방안이고, 이 방안은 우수한 침입탐지알고리즘을 사용할 경우에는 좋은 특징을 선택할 수 있지만, 이 중요한 특징들을 찾기 위한 학습과 테스트의 수행시간이 길고 특징간의 변별력이 떨어지고 탐지 알고리즘의 성능에 따라 데이터의 특징에 따라 영향을 받을 수 있다는 단점이 존재한 것이다. As a method for processing the features of the inspection data, there are a wrapper scheme and a filter scheme. The wrapper scheme erases the characteristics of a single inspection data, and at this time, how important the features are by repeating learning and testing with an intrusion detection algorithm. Although this method can be used to select a good feature when using a good intrusion detection algorithm, the time required for learning and testing these important features is long, the distinction between features is poor, and the performance of the detection algorithm is low. According to this, there is a disadvantage that it may be affected by the characteristics of the data.

상기 필터 방안은 상기 래퍼 방안과 달리 침입탐지알고리즘과 상관없이 검사데이터의 특징간의 상관관계와, 검사데이터의 특징과 검사데이터의 종류와의 상관관계를 바탕으로 중요한 특징들을 찾아내는 방안이다. 이 방안은 상시 래퍼 방안에 비해 처리 속도는 빠르지만, 탐지 알고리즘과 관계가 없기 때문에 침입 탐지율이 감소하다는 단점이 존재하는 것이다. Unlike the wrapper method, the filter method is a method of finding important features based on the correlation between the characteristics of the inspection data and the correlation between the characteristics of the inspection data and the type of the inspection data, regardless of the intrusion detection algorithm. This scheme is faster than the regular wrapper scheme, but has the disadvantage of reducing the intrusion detection rate because it is not related to the detection algorithm.

본 발명의 상관관계를 기반으로 한 하이브리드 특징선택방법은, 안정된 특징 탐지 결과와 마찬가지로 낮은 오탐율을 갖고 높은 탐지율을 유지하는 동안 훈련시간 및 테스팅 시간을 현저하게 감소시킬 수 있고, 예컨데 KDD(Knowledge Discovery in Databases) 1999 침입탐지 데이터세트에서의 실험 결과는 경량화된 침입탐지시스템을 모델링하는 것을 가능하게 하는 연구법의 실행 가능성을 보여줄 수 있는 상관관계 기반의 하이브리드 특징선택을 통한 경량화된 침입탐지시스템을 제공하고자 함에 그 목적을 한다.The hybrid feature selection method based on the correlation of the present invention, like the stable feature detection result, can significantly reduce the training time and the testing time while maintaining a high detection rate with a low false detection rate, for example, KDD (Knowledge Discovery). in Databases) 1999 Experimental results on intrusion detection datasets aim to provide a lightweight intrusion detection system through correlation-based hybrid feature selection that can demonstrate the feasibility of a methodology that enables the modeling of lightweight intrusion detection systems. The purpose is to.

상기 목적을 달성하기 위한 본 발명은, 검사데이터로부터의 전 처리된 검사데이터는 훈련 데이터세트와 테스팅 데이터세트로 분류되도록 구비하는 단계; 상기 훈련 데이터세트는 선택된 특징의 세트로 귀착되는 상관관계를 기반으로 한 하이브리드 특징선택 프로세스를 통하여 처리하는 특징선택 데이터세트, 선택된 특징을 사용한 침입탐지모델을 구축하기 위하여 사용되는 모델빌딩 데이터세트 및 검증 데이터세트로 더욱 분류시키는 단계; 상기 침입탐지모델은 검증 데이터세트에 의해 검증되고 이후 침입탐지모델은 테스팅 데이터세트에 의해 테스팅되도록 하는 단계를 포함한 것을 그 특징으로 한다. The present invention for achieving the above object comprises the steps of having the pre-processed inspection data from the inspection data to be classified into a training data set and a testing data set; The training dataset is a feature selection dataset processed through a hybrid feature selection process based on a correlation resulting in a set of selected features, a model building dataset used to build an intrusion detection model using the selected feature, and validation. Further categorizing into a dataset; And wherein the intrusion detection model is verified by the verification dataset and then the intrusion detection model is tested by the testing dataset.

본 발명의 다른 구체적인 수단은, 유전자 알고리즘을 이용하여 특징의 서브세트의 초기 개체군을 생성하고; 각각의 서브세트들을 상관관계를 기반으로 한 특징선택 알고리즘의 기준으로 평가하며, 가장 최상의 서브세트가 될 수 있음을 선택하고, 이 특징 서브세트를 가지고 서포트벡터머신을 가지고 침입 탐지율을 구하고 그 탐지율을 최상으로 셋팅하며, 이 작업은 다음 세대에 더 좋은 서브세트가 선택되지 않거나 유전자 알고리즘의 최대 세대에 도달할 때까지 반복 수행하며; 모든 확률적인 알고리즘처럼 유전자의 초기 개체군은 무작위로 생성하고, 각각의 유전자의 Merit은 상관관계를 기반으로 한 특징선택에 의해 계산하며, 가장 높은 Merit θbest를 갖는 유전자는 개체군에서 가장 최적의 특징 서브 세트 Sbest를 표현하며, 이 서브세트는 지원벡터머신의 분류 알고리즘에 의해 평가되며 그 결과 값은 평가의 메트릭을 표현하는 θbest에 저장하고; 선택, 크로스오버 및 변화와 같은 유전자 연산자가 수행되고 유전자의 새로운 개체군을 생성하고, 각 세대에서 가장 최적의 유전자 또는 특징 서브세트는 이전의 최적의 서브세트 Sbest와 비교하며, 새로운 서브세트가 이전 것보다 더 최적이면 그것을 가장 최적의 서브세트로 할당하고, 이 서브세트가 서포트벡터머신에 의해 평가하고; 새로운 탐지율이 이전 것보다 더 높다면 이값은 θbest이 되고 알고리즘은 계속 수행되고, 새로운 탐지율이 이전 것보다 더 높지 않다면 Sbest가 특징의 최적의 서브세트가 되며, 가장 좋은 서브세트가 다음 세대 내에서 찾아지지 않거나 세대의 최대 수가 도달할 경우에 알고리즘은 종료된 것을 특징으로 한다.Another specific means of the present invention uses genetic algorithms to generate an initial population of subsets of features; Evaluate each subset based on correlation-based feature selection algorithms, choose the best subset possible, and use this feature subset to find intrusion detection rates with support vector machines and determine their detection rates. Setting best, this operation is repeated until a better subset for the next generation is not selected or the maximum generation of the genetic algorithm is reached; Like all probabilistic algorithms, the initial population of genes is randomly generated, the Merit of each gene is calculated by correlation-based feature selection, and the gene with the highest Merit θbest is the most optimal subset of features in the population. Expressing Sbest, and this subset is evaluated by the classification algorithm of the support vector machine and the resulting value is stored in θbest representing the metric of the evaluation; Genetic operators such as selection, crossover and change are performed to create a new population of genes, the most optimal gene or feature subset in each generation compared to the previous optimal subset Sbest, with the new subset being the old one If more optimal, allocate it to the most optimal subset, which subset is evaluated by the support vector machine; If the new detection rate is higher than the previous one, this value is θbest and the algorithm continues to run, and if the new detection rate is not higher than the previous one, Sbest is the optimal subset of features, and the best subset is found within the next generation. If not, or if the maximum number of generations is reached, the algorithm is terminated.

이하, 본 발명의 바람직한 실시예를 예시도면에 의거하여 상세히 설명한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명에서는 상관관계를 기반의 하이브리드 특징선택으로 명명된 새로운 특징 선택법을 기반으로 하여 경량화된 침입탐지시스템(IDS)을 모델링하기 위한 새로운 연구법을 제공하고 있다. 따라서, 본 발명의 상관관계를 기반의 하이브리드 특징선택기법은, 안정화 특징 선택결과와 마찬가지로 낮은 오탐율을 갖고 높은 탐지율을 유지하는 동안 훈련시간 및 테스팅시간을 현저하게 감소할 수 있다. 예컨데, KDD 1999 침입탐지 데이터세트에서의 실험결과는 경량화된 침입탐지시스템을 모델링하는 것을 가능하게 하는 연구법의 실행 가능성을 보여 준다.The present invention provides a new method for modeling a lightweight intrusion detection system (IDS) based on a new feature selection method called hybrid feature selection based on correlation. Accordingly, the hybrid feature selection method based on the correlation of the present invention can significantly reduce the training time and the testing time while maintaining a high detection rate with a low false positive rate, as in the stabilization feature selection result. For example, experimental results in the KDD 1999 intrusion detection dataset show the feasibility of a methodology that makes it possible to model lightweight intrusion detection systems.

도 1 은 본 발명으로 개발된 연구법의 전체적인 구성도로서, 검사데이터(Audit Data : 10)로부터의 전 처리된 검사데이터(11)는 2 개의 데이터 세트로 분류되고 있다. 즉, 훈련 데이터세트(20)와 테스팅 데이터세트(30)로 분류된다. 여기서, 검사는 데이터의 안전보호및 데이터의 완전성에 대한 수준의 타당성및 유효성을 시험하기 위하여 가동 가능한 데이터처리시스템의 기록 및 활동에 대하여 재검토 또는 정밀 검사하는 것이다. FIG. 1 is an overall configuration diagram of a research method developed by the present invention, wherein preprocessed inspection data 11 from audit data 10 is classified into two data sets. That is, it is classified into training data set 20 and testing data set 30. Here, the inspection is a review or overhaul of the records and activities of a flexible data processing system to test the validity and validity of the level of data protection and the integrity of the data.

상기 훈련 데이터세트(20)는 3 개의 세트, 즉 특징선택 데이터세트, 모델빌딩 데이터세트 및 검증 데이터세트로 더욱 분류된다. 상기 특징선택(Feature Selection)은 패턴인식에 있어 대상 패턴에서 추출되는 복수 개의 특징 사이에 중복성이 있는 경우 그들 중에서 되도록 식별에 유효한 특징을 선택하는 것이다. The training dataset 20 is further classified into three sets: feature selection dataset, model building dataset and verification dataset. The feature selection is to select features that are effective for identification among the plurality of features extracted from the target pattern in pattern recognition so as to be among them.

먼저, 상기 특징선택 데이터세트는 선택된 특징의 세트로 귀착되는 발명된 상관관계를 기반으로 한 하이브리드 특징 선택 프로세스(25)를 통하여 처리된다. 이는 합산기(27)를 통해 감소된 특징을 갖는 모델링 데이터세트(40)와 감소된 특징을 갖는 검증 데이터세트(50)로 각기 보내어진다. 따라서, 상기 모델링 데이터세트(40)와 검증 데이터세트(50)는 기계학습과 검증(52)을 통해 분류모델(70)로 된다. First, the feature selection dataset is processed through a hybrid feature selection process 25 based on the invented correlation that results in a set of selected features. This is sent through summer 27 to modeling dataset 40 with reduced features and verification dataset 50 with reduced features, respectively. Thus, the modeling dataset 40 and the verification dataset 50 are classified into a classification model 70 through machine learning and verification 52.

상기 모델빌딩 데이터세트는 선택된 특징을 사용한 침입탐지모델을 구축하기 위하여 사용되고, 상기 침입탐지모델은 검증 데이터세트에 의해서 검증된다. 이후, 상기 침입탐지모델은 테스팅 데이터세트(30)에 의해서 테스팅된다. 이는 컷오프 비선택된 특징(31)을 통해 감소된 특징을 갖는 테스팅 데이터세트(60)로 보내어진다. 따라서, 상기 테스팅 데이터세트(60)는 전술한 분류모델(70)로 보내어지고 각각의 클래스 레벨(80)로 나누어진다. The model building dataset is used to build an intrusion detection model using the selected feature, and the intrusion detection model is verified by the verification dataset. The intrusion detection model is then tested by the testing dataset 30. This is sent to the testing dataset 60 with the reduced feature through the cutoff unselected feature 31. Thus, the testing dataset 60 is sent to the classification model 70 described above and divided into respective class levels 80.

상기 훈련 데이터세트(20)에서의 특징선택 데이터세트는 상관관계를 기반으로 한 특징선택 및 상관관계를 기반으로 한 하이브리드 특징선택을 각각 설명한다.The feature selection dataset in the training dataset 20 describes the correlation based feature selection and the correlation based hybrid feature selection, respectively.

먼저, 상기 상관관계를 기반으로 한 특징선택은 종래의 필터 방법을 포함하며, 클래스에서 가장 최적으로 관련이 있고 여분의 특징을 포함하지 않는 특징 서브세트를 찾는 것이다. 이 방법은 "좋은 특징 서브세트는 클래스 내에서 높게 상관관계된 특징을 포함하고, 서로는 아직 상관관계가 없는 것이다" 라는 것을 기반으로 하여 특징 서브세트의 우수함을 평가한다. First, the feature selection based on the correlation involves a conventional filter method, and finds a feature subset that is most optimally relevant in the class and does not include extra features. This method evaluates the superiority of the feature subset based on the fact that "a good subset of features contains highly correlated features in the class and they are not yet correlated".

이는 특징 - 클래스 상관 관계이고, 특징 - 특징 상관관계인 것이다. 먼저, 상기 특징 - 클래스 상관관계는 얼마나 많은 특징이 특정된 클래스와 상관관계되어 있는 가는 나타낸다. 상기 특징 - 특징 상관관계는 두 개의 특징 간의 상관관계이다. Pearson 상관 관계식으로 알려진 식(1)은 특성의 k 번을 포함하는 특성 서브세 트의 우수성을 제공한다.This is a feature-class correlation and a feature-feature correlation. First, the feature-class correlation indicates how many features are correlated with the specified class. The feature-feature correlation is a correlation between two features. Equation (1), known as the Pearson correlation, provides the superiority of the feature subset, including k times of feature.

--- (1)

--- (One)

상기 식(1)에서,

는 평균 특징 - 클래스 상관관계,

는 평균 특징 - 특징 상관 관계이다.

와

의 평가를 위하여, 특징 간의 상관관계와 특징과 클래스 간의 상관관계를 계산할 필요가 있다. 이산 클래스문제에서 상관관계를 기반으로 한 특징선택은 먼저 Fayyad와 Irani 기법을 사용하여 수치 특징을 이산화한 후, 이어 이산 특징 간의 관련성의 정도를 평가하기 위하여 대칭적인 불균형(수정된 정보이득 측정법)을 사용한다.In the above formula (1),

Mean-class correlation,

Is the mean feature-feature correlation.

Wow

In order to evaluate, it is necessary to calculate the correlation between features and the correlation between features and classes. Correlation-based feature selection in discrete class problems first discretizes numerical features using the Fayyad and Irani techniques, and then uses symmetrical imbalance (corrected information gain measurement) to evaluate the degree of relevance between discrete features. use.

--- (2)

상기 식(2)에서, H(X)와 H(Y)는 특징 X와 Y의 엔트로피를 나타낸다. 대칭적 인 불균형이 대칭측정이기 때문에 사용되고, 이에 따라 클래스 속성의 개념이 없는 곳에서 특징 - 특징 상관관계를 측정하는 데에 사용된다. 연속적인 클래스 데이터에 대해서, 속성 간의 상관관계는 표준선형 상관관계이다. 이것은 포함되는 두 개의 속성이 둘 다 연속적일 때 직선이다.In the formula (2), H (X) and H (Y) represents the entropy of the features X and Y. Because symmetrical imbalance is a symmetry measure, it is used to measure feature-feature correlations where there is no concept of class attributes. For continuous class data, the correlation between attributes is a standard linear correlation. This is a straight line when both properties involved are contiguous.

--- (3)

상기 식(3)에서, X와 Y는 편차에 기초로 두고 표현되는 두 개의 연속적인 특징변수이다. 식(1) 에서 Merit를 계산하기 위하여 또 다른 중요한 것은 서브세트를 생성하는 것이다. 주어진 특징세트로부터 특징 서브세트를 생성하는 것은 NP-Hard 문제이며, 이는 포괄적인 연구에 의해서 해결되는 최적이다. In Equation (3), X and Y are two consecutive feature variables expressed based on the deviation. Another important thing to calculate Merit in equation (1) is to generate a subset. Generating feature subsets from a given feature set is an NP-Hard problem, which is optimal to be solved by comprehensive research.

그러나, 이것은 단지 적은 수의 특징에 대해서는 가능하다. 만약에 특징의 전체적인 수가 n 이라면, 전체 2ⁿ개의 서브세트가 존재한다. 즉, KDD 1999 침입 탐지 데이터세트는 전부 41 개의 특징을 포함하고, 여기에는 2.199e + 12개의 서브세트가 존재하며 이는 매우 큰 수이다. 따라서, 시뮬레이트 어닐링 알고리즘, 힐 클라이빙 알고리즘, 최적 우선알고리즘, 유전자(genetic) 알고리즘 등과 같은 휴리스틱 검색 기법이 필요하게 된다. However, this is only possible for a small number of features. If the total number of features is n, there are a total of 2 ms subsets. That is, the KDD 1999 intrusion detection dataset contains 41 features in total, and there are 2.199e + 12 subsets, which is a very large number. Accordingly, there is a need for heuristic search techniques such as simulated annealing algorithms, hill climbing algorithms, optimal priority algorithms, genetic algorithms, and the like.

상기 유전자 알고리즘은 자연(natural) 선택 및 유전자(genetics)의 발달된 아이디어를 기반으로 한 확률적인 검색 알고리즘이다. 비록 유전자 알고리즘이 무작위 검색기법을 사용한다 할지라도 무작위적인 방법은 아니다. 상기 유전자 알고리즘은 검색의 적절한 함수를 적용함으로써 최적의 솔루션을 향하여 수렴된다. 상기 유전자 알고리즘은 또한 변화 및 크로스오버와 같은 유전자 연산자를 사용함으로써 다양한 전역적인 최적을 설정하는 것을 표현한다. The genetic algorithm is a probabilistic search algorithm based on advanced ideas of natural selection and genetics. Although genetic algorithms use random search techniques, they are not random. The genetic algorithm converges towards the optimal solution by applying the appropriate function of the search. The genetic algorithm also expresses setting various global optimals by using genetic operators such as change and crossover.

따라서, 본 발명에서는 특징 서브세트를 생성하기 위하여 유전자 알고리즘을 사용한다. 상기 유전자 알고리즘은 솔루션의 개체군에서 수행되며, 상기 개체군 솔루션에서 최적의 솔루션을 찾는다. 상기 유전자 알고리즘을 사용하여 검색하기 위해서, 후보 솔루션의 개체군을 초기화하거나 또는 구축할 필요가 있다. Thus, the present invention uses genetic algorithms to generate feature subsets. The genetic algorithm is performed on a population of solutions and finds the optimal solution in the population solution. In order to search using the genetic algorithm, it is necessary to initialize or build a population of candidate solutions.

각각의 후보 솔루션은 염색체로써 표현된다. 이 염색체는 한 개 이상의 유전인자로 구성된다. 각각의 유전인자는 특징을 표현한다. KDD 1999 침입탐지 데이터세트에서, 각각의 인스턴스는 41 개의 특징으로 구성된다. 따라서, 각각의 유전자는 41 개의 유전인자로 구성되며, 각각의 유전인자는 그 유전자 내에서 특징의 존재여부를 나타내는 2진 비트이다. 특정 유전인자의 값이 1이라면, 그 유전자 또는 특징벡터 내에서 상응되는 특징의 존재를 표현한다. Each candidate solution is represented by a chromosome. This chromosome consists of one or more genetic factors. Each genetic factor expresses a characteristic. In the KDD 1999 Intrusion Detection Dataset, each instance consists of 41 features. Thus, each gene consists of 41 genes, each of which is a binary bit indicating the presence or absence of a feature in that gene. If the value of a particular gene is 1, it represents the presence of the corresponding feature in that gene or feature vector.

도 2 는 선택된 특징의 특징벡터 또는 세트를 표현하는 유전자를 설명한 것이다. 각각의 유전자는 평가함수에 대해서 평가되는 것이 필요로 한다. 상기 평가함수의 선택은 문제의 도메인에 의존된다. 2 illustrates a gene representing a feature vector or set of selected features. Each gene needs to be evaluated against an evaluation function. The choice of evaluation function depends on the domain in question.

본 발명의 경우에 있어서, 상관관계를 기반으로 한 특징선택이 평가함수로써 작용된다. 이 함수는 유전자(특징세트)의 전술한 Merit를 사용하여 각각의 유전자를 평가한다. 유전자 알고리즘은 "가장 적합한 생존"에 기초로 하기 때문에, 추정의 유전자 (특정세트)는 룰렛 휠 선택을 통하여 선택된다. 선택 프로세스 후에, 다음 세대로서 가장 좋은 유전자의 개체군을 구축하기 위하여 크로스오버 및 변화와 같은 유전자 연산자가 수행된다. In the case of the present invention, the feature selection based on correlation serves as an evaluation function. This function evaluates each gene using the aforementioned Merit of a gene (feature set). Since genetic algorithms are based on "best survival", the putative gene (specific set) is selected through roulette wheel selection . After the selection process, genetic operators such as crossovers and changes are performed to build up the population of the best genes as the next generation.

일반적으로, 유전자 알고리즘은 아래와 같이 표현된다.In general, the genetic algorithm is expressed as follows.

개체군을 초기화한다.Initialize the population.

초기화 개체군을 평가한다.Evaluate the initial population.

DODO

선택을 수행한다.Make a selection.

새로운 new 솔루션에Solution 대한 변경 (크로스오버 및 변화 연산자) For change (crossover and change operator)

개체군내에서 Within a population 솔루션을Solution 평가한다. Evaluate.

WhileWhile 종료 조건이 만족하지 않는 경우 If the termination condition is not met

한편, 상관관계를 기반으로 한 하이브리드 특징선택에서 본 발명의 하이브리드 특징선택 알고리즘은 전술한 상관관계를 기반으로 한 특징선택과 서포트벡터머신의 교묘한 조합이다. 침입탐지문제에서와 마찬가지로 좋은 성능의 패턴 식을 보여주는 서포트벡터머신을 적용한다. 본 발명의 하이브리드 선택알고리즘이 도 3 에 도시되어 있다.On the other hand, in the hybrid feature selection based on correlation, the hybrid feature selection algorithm of the present invention is a sophisticated combination of the above-described correlation based feature selection and the support vector machine. As in intrusion detection problem, we apply support vector machine which shows good performance pattern expression. The hybrid selection algorithm of the present invention is shown in FIG.

도 3 은 본 발명의 상관관계를 기반으로 한 하이브리드 특징선택 알고리즘의 플로우차트로서, 서포트벡터머신은 분류 알고리즘으로 침입탐지모델에서 사용되는 알고리즘이다. 유전자 알고리즘을 이용하여 특징의 서브세트(sub set)의 초기 개체군(population)을 생성한다(도 2 참고). 여기서 41개의 특징들을 유전자 알고리즘을 이용하여 엔코딩(encoding)하는 방안인 것이다. 3 is a flowchart of a hybrid feature selection algorithm based on the correlation of the present invention, wherein the support vector machine is an algorithm used in the intrusion detection model as a classification algorithm. Genetic algorithms are used to generate an initial population of subsets of features (see FIG. 2). Here, 41 features are encoded using a genetic algorithm.

그리고, 각각의 서브세트들을 이미 정의된 상관관계를 기반으로 한 특징선택 알고리즘의 기준으로 평가한다(전술한 식(1),(2),(3) 참고). 가장 최상의 서브세트, 즉 특징 41개 중에서 2, 3, 6, 9 이 최상의 서브세트가 될 수 있음을 선택한다. 이 특징 서브세트를 가지고 서포트벡터머신을 가지고 침입 탐지율을 구하고 그 탐지율을 최상으로 셋팅한다. 이 작업은 다음 세대에 더 좋은 서브세트가 선택되지 않거나 유전자 알고리즘의 최대 세대(generation)에 도달할 때까지 반복 수행된다.Each subset is then evaluated on the basis of a feature selection algorithm based on a previously defined correlation (see equations (1), (2) and (3) above). Choose the best subset, that is, 2, 3, 6, or 9 of the 41 features can be the best subset. With this subset of features, we get the intrusion detection rate with the support vector machine and set the detection rate to the best. This task is repeated until no better subset is selected for the next generation or until the maximum generation of the genetic algorithm is reached.

앞에서 언급한 것처럼, 유전자 알고리즘은 주어진 특징세트로부터 특징의 서브세트를 생성하는 데에 사용된다. 발명된 알고리즘은 입력으로써 모든 특징을 취하고, 상관관계를 기반으로 한 특징선택 및 서포트벡터머신에 의해서 평가된 후에 특징의 최적의 서브세트를 출력한다. 각각의 유전자는 특징벡터를 표현한다. 유전자의 길이가 41 개의 유전인자가 되며, 각각의 유전인자(비트)는 1 또는 0의 값을 갖는다. 이값은 각각 해당 특징이 특징벡터에 포함하는가 또는 포함하지 않는가의 여부를 나타낸다. As mentioned above, genetic algorithms are used to generate a subset of features from a given feature set. The invented algorithm takes all features as input and outputs an optimal subset of features after being evaluated by correlation based feature selection and support vector machines. Each gene represents a feature vector. The gene is 41 genes in length, and each gene (bit) has a value of 1 or 0. Each of these values indicates whether or not the feature is included in the feature vector.

모든 확률적인 알고리즘처럼, 유전자의 초기 개체군은 무작위로 생성한다. 각각의 유전자의 Merit (식(1) 참조)은 상관관계를 기반으로 한 특징선택에 의해서 계산된다. 가장 높은 Merit θbest를 갖는 유전자는 개체군에서 가장 최적의 특징 서브 세트 Sbest를 표현한다. 이러한 서브세트는 서포트벡터머신의 분류 알고리즘에 의해서 평가되며, 그 결과 값은 평가의 메트릭을 표현하는 θbest에 저장된다. 여기서, 탐지율과 오탐율의 조합 또는 규칙기반척도와 같은 복잡한 척도를 사용할 수 있다 할지라도 메트릭으로써 침입 탐지율을 선택하고 있다.As with all probabilistic algorithms, the initial population of genes is randomly generated. Merit of each gene (see equation (1)) is calculated by correlation based feature selection. The gene with the highest Merit θbest represents the most optimal subset of features Sbest in the population. This subset is evaluated by the classification algorithm of the support vector machine, and the resulting value is stored in [theta] best representing the metric of the evaluation. Here, the intrusion detection rate is selected as a metric even though a complex measure such as a combination of a detection rate and a false detection rate or a rule-based measure can be used.

따라서, 선택, 크로스오버 및 변화와 같은 유전자 연산자가 수행되고 유전자의 새로운 개체군이 생성된다. 각각의 세대에서, 가장 최적의 유전자 또는 특징 서브세트는 이전의 최적의 서브세트 Sbest와 비교한다. 새로운 서브세트가 이전의 것보다 더 최적이면, 그것을 가장 최적의 서브세트로 할당한다. 이어서, 이러한 서브세트가 서포트벡터머신에 의해서 평가된다. Thus, genetic operators such as selection, crossover and change are performed and a new population of genes is created. In each generation, the most optimal gene or feature subset is compared to the previous optimal subset Sbest. If the new subset is more optimal than the previous one, it is assigned the most optimal subset. This subset is then evaluated by the support vector machine.

새로운 탐지율이 이전의 것보다 더 높다면, 이값은 θbest이 되고, 알고리즘은 계속 수행된다. 새로운 탐지율이 이전의 것보다 더 높지 않다면, Sbest가 특징의 최적의 서브세트가 된다. 가장 좋은 서브세트가 다음 세대 내에서 찾아지지 않거나, 세대의 최대 수가 도달할 경우에 알고리즘은 종료된다. If the new detection rate is higher than the previous one, this value becomes θbest, and the algorithm continues to run. If the new detection rate is not higher than the previous one, Sbest is the optimal subset of features. The algorithm terminates when the best subset is not found within the next generation, or when the maximum number of generations is reached.

본 발명의 실험과 그의 결과는 전체적인 구조가 전술된 바와 같이 도 1 에 도시되어 있다. 본 발명의 실행가능성을 입증하기 위하여, KDD 1999 침입 탐지 데이터세트에서 여러 실험을 수행한다. 본 발명은 3 개의 부분 ◎1) 특징 선택, 2) 훈련 및 3) 테스팅으로 구분된다. 다음은 실험 데이터세트, 실험환경 및 실험결과를 설명한다.The experiments and their results of the present invention are shown in FIG. 1 as the overall structure is described above. To demonstrate the feasibility of the present invention, several experiments are performed on the KDD 1999 intrusion detection dataset. The present invention is divided into three parts: 1) feature selection, 2) training and 3) testing. The following describes the experimental dataset, experimental environment, and experimental results.

2 개의 클래스 데이터세트 ◎정상적인 것과 공격받은 것- 를 만들기 위하여 KDD 1999 CUP로 라벨링된 데이터세트를 전처리하고 있다. 상기 데이터세트는 전체 494,021 개의 인스턴스를 포함하여, 이 중 97,278(19.69%) 개의 인스턴스는 정상적인 것이고, 396,743(80.31%) 개의 인스턴스는 공격받은 것이다. 데이터세트는 24 개의 서로 다른 형태의 침입을 포함하며, 이러한 침입은 넓게 4 개의 그룹으로 구분된다. 즉, 프로브, DoS(서비스거부공격), U2R 및 R2L의 그룹으로 구분된다. 데이터세트의 분포가 변화되지 않고 유지되기 위하여 균일 불규칙 분포에 의해서 원래의 자료로부터 20995 개의 인스턴스를 갖는 15 개의 서로 다른 데이터세트를 샘플링하고 있다. 각각의 데이터의 인스턴스는 41 개의 특성으로 구성되어 있으며, 이는 x1, x2, x3, x4 등으로 라벨링되어 있다. Two class datasets-preprocessed datasets labeled with the KDD 1999 CUP-to make normal and attacked. The dataset contains a total of 494,021 instances, of which 97,278 (19.69%) are healthy and 396,743 (80.31%) are attacked. The dataset contains 24 different types of intrusions, which are broadly divided into four groups. That is, it is classified into a group of probes, denial of service attacks (DoS), U2R, and R2L. In order to keep the distribution of the dataset unchanged, we sampled 15 different datasets with 20995 instances from the original data by a uniform random distribution. Each instance of data consists of 41 properties, labeled x1, x2, x3, x4, and so on.

본 발명에서는 DoS 형태의 공격만을 사용할 것이며, 이는 다른 형태의 공격은 매우 작은 수의 인스턴스를 갖고 있고 이에 따라 본 발명의 실험에는 이러한 공격은 적합하지 않다.In the present invention, only DoS type of attack will be used, which has a very small number of instances of other types of attacks, and thus such an attack is not suitable for the experiment of the present invention.

본 발명의 실험 환경에서 모든 실험은 Linux (Fedora core 3) 머신에서 수행되고 있으며, 사용된 Linux 머신은 Intel Pentium 4, 2.0GHz, 512MB RAM, 커널 버전 2.6.9-1.667로 구성되어 있다. 지원벡터머신 및 상관관계를 기반으로 한 특징선택 알고리즘을 위해서 개방된 소스 WEKA 라이브러리를 사용하고 있다. 개발된 알고리즘을 구현하기 위하여, "weak.attributeSelection.GeneticSearch"와 같은 변경된 여러 클래스의 WEKA 라이브러리를 갖는다.In the experimental environment of the present invention, all experiments are performed on a Linux (Fedora core 3) machine, and the Linux machine used is composed of Intel Pentium 4, 2.0GHz, 512MB RAM, kernel version 2.6.9-1.667. We use the open source WEKA library for feature selection algorithms based on support vector machines and correlations. To implement the developed algorithm, we have a modified WEKA library of several classes, such as "weak.attributeSelection.GeneticSearch".

특징선택을 위하여, 15 개의 데이터세트로부터 무작위로 데이터세트를 선택하고 있으며, 전술한 바와 같이 본 발명의 알고리즘을 적용하고 있다. 낮은 일반화 오류를 얻기 위하여 그리고 침입 탐지율을 결정하기 위하여 10 개의 폴트 클래스 입증을 적용하고 있다. For feature selection, a random dataset is selected from 15 datasets, and the algorithm of the present invention is applied as described above. Ten fault class proofs are applied to obtain low generalization errors and to determine intrusion detection rates.

선택된 최적의 서브세트는 99.56%의 탐지율을 보이고 있다. 선택된 특징의 지표는 x1, x6, x12, x14, x23, x24, x31, x32, x37, x40 및 x41이다. 특징 벡터의 크기가 43에서 12로 감소되고 있으며, 이는 분류율이 99%이상인 것으로 탁월한 성능을 나타낸다.The optimal subset selected shows a detection rate of 99.56%. Indicators of the selected feature are x1, x6, x12, x14, x23, x24, x31, x32, x37, x40 and x41. The size of the feature vector is reduced from 43 to 12, which shows an excellent performance with a classification ratio of more than 99%.

훈련 및 테스팅에 있어, 완전한 특징과 선택된 특징을 구비한 서로 다른 데이터세트에서 15번의 실험을 수행하고 있다. 각각의 데이터세트는 훈련 세트와 테스팅 세트로 구분되며, 이것 각각은 15740 개의 인스턴스 및 5255 개의 인스턴스를 포함한다. 도 4 내지 도 7 까지는 서로 다른 성능 지표 간을 비교한 것이다. In training and testing, we performed 15 experiments on different datasets with full and selected features. Each dataset is divided into a training set and a testing set, each containing 15740 instances and 5255 instances. 4 to 7 compare different performance indicators.

도 4 는 특징선택 프로세스가 특징의 전체 개수에서 70%를 나누었기 때문에 기대한 것처럼 감소된 특성을 갖는 모델 구축시간이 극적으로 감소되는 것을 입증한다. 또한, 도 5 에 도시된 테스팅 시간은 모델 훈련시간을 초과한다.4 demonstrates that the model building time with reduced characteristics is dramatically reduced as expected because the feature selection process divided 70% of the total number of features. In addition, the testing time shown in FIG. 5 exceeds the model training time.

도 4 는 모델 구축시간 대 데이터세트 지수이고, 도 5 는 모델 테스팅시간 대 데이터세트 지수이며, 도 6 은 탐지율 대 데이터세트 지수이고, 도 7 은 오탐율 대 데이터세트 지수이다. FIG. 4 is model construction time versus dataset index, FIG. 5 is model testing time versus dataset index, FIG. 6 is detection rate versus dataset index, and FIG. 7 is false positive rate versus dataset index.

선택된 특징을 위하여, 탐지율이 완전한 특징을 갖는 것보다 더 낮을 지라도, 감소는 매우 작으며, 이는 평균 0.83% 범위에 있는다 (도 6 참조). For the selected feature, even if the detection rate is lower than having the full feature, the reduction is very small, which is in the range of 0.83% on average (see FIG. 6).

그러나, 중요한 성능이 오탐율의 감소로 인하여 얻어지며, 그 값은 평균 37.5%이다 (도 7 참조). 상기 모든 실험을 위하여, WEKA에서 지원벡터머신 분류자 를 위한 기본 값인 지수 1이고 c=1인 다항식 커널을 사용하고 있다. 커널을 최적화하기 위하여 어떠한 측정도 사용하지 않았으며, 본 발명에서의 주된 목적으로써의 지원벡터머신 파라미터를 결정하는 것은 극복할 수 있는 영역 내에서 탐지율 및 오탐율을 유지하는 동안 특성 선택이 어떻게 계산 자원을 감소시켰는가를 조사하는 것이다.However, significant performance is obtained due to a reduction in false positive rate, with an average of 37.5% (see FIG. 7). For all the above experiments, WEKA uses a polynomial kernel with an index of 1 and c = 1, which is the default value for the support vector machine classifier. No measurements were used to optimize the kernel, and determining the support vector machine parameters as the primary objective in the present invention is how computational selection is made while feature selection maintains detection and false detection rates within an area that can be overcome. Is to investigate whether

탐지율과 오탐율 및 탐지율 간의 최적화의 증진은 더 좋은 커널 함수를 실행하고 적용하며, 분류 알고리즘을 개선하는 파라미터에 의해서 더욱 개선될 수 있다. 그러나, 특징 선택은 이러한 최적화를 증대시킨다.The enhancement of the optimization between detection rate and false detection rate and detection rate can be further improved by parameters that execute and apply better kernel functions and improve the classification algorithm. However, feature selection increases this optimization.

본 발명의 상관관계 기반의 하이브리드 특징선택을 통한 가벼운 침입탐지시스템에 대한 기술사상을 예시도면에 의거하여 설명했지만, 이는 본 발명의 가장 양호한 실시예를 예시적으로 설명한 것이지 본 발명의 특허청구범위를 한정하는 것은 아니다. 본 발명은 이 기술분야의 통상 지식을 가진 자라면 누구나 본 발명의 기술사상의 범주를 이탈하지 않는 범위 내에서 다양한 변형 및 모방이 가능함은 명백한 사실이다.Although the technical concept of the light intrusion detection system through the correlation-based hybrid feature selection of the present invention has been described based on the exemplary drawings, this is illustrative of the best embodiments of the present invention and the claims of the present invention. It is not limited. It will be apparent to those skilled in the art that various modifications and imitations can be made without departing from the scope of the technical idea of the present invention.

이상 설명한 바와 같이 본 발명은 상관관계를 기반으로 한 하이브리드 특징 선택을 통한 경량화된 침입탐지시스템을 모델링하기 위한 새로운 연구법을 제공하고 있다. 본 발명은 탐지율의 감소를 배제하는 대신에 훈련 시간 및 테스팅 시간을 크기의 차수로 감소되는 것을 보여준다. 또한, 서로 다른 데이터세트 간에서 탐지 율 및 오탐율의 균일성을 보여준다. 빠른 훈련 및 테스팅은 제공된 모델의 유지 또는 수정의 용이성을 제공하는 것과 마찬가지로 경량화된 침입탐지시스템을 구축하는 것을 도와준다. As described above, the present invention provides a new method for modeling a lightweight intrusion detection system through a hybrid feature selection based on correlation. The present invention shows that the training time and testing time are reduced in orders of magnitude instead of excluding the reduction in detection rate. It also shows uniformity of detection rate and false positive rate between different datasets. Rapid training and testing helps to build a lightweight intrusion detection system, as well as providing ease of maintenance or modification of the model provided.

앞으로는, DoS, 프로브, U2R 및 R2L과 같은 공격의 형태를 특징화하는 것과 마찬가지로 실시간 침입 탐지 환경하에서 기법을 구현하는 것의 가능성을 연구하는 것이며, 이는 침입탐지시스템의 능력과 성능을 증대하는 것이다.In the future, as well as characterizing the types of attacks such as DoS, probes, U2R, and R2L, we will explore the possibilities of implementing the technique under real-time intrusion detection environments, increasing the capabilities and capabilities of intrusion detection systems.

Claims

The preprocessed inspection data from the inspection data is arranged to be classified into a training data set and a testing data set;

The training dataset is a feature selection dataset processed through a hybrid feature selection process based on a correlation resulting in a set of selected features, a model building dataset used to build an intrusion detection model using the selected feature, and validation. Further categorizing into a dataset;

Wherein said intrusion detection model is verified by a validation dataset and thereafter said intrusion detection model is tested by a testing dataset.

The method of claim 1,

The correlation-based hybrid feature selection process is sent through a summer to a modeling dataset with reduced features and a validation dataset with reduced features, respectively, wherein the modeling dataset and the validation dataset are machine learning and validation. Lightweight intrusion detection method through correlation based hybrid feature selection on a computer characterized in that it was sent to the classification model through.

The method of claim 1,

The testing dataset is sent to a testing dataset with reduced features through a cutoff non-selected feature, which is then sent to a classification model and divided into respective class levels. Lightweight intrusion detection method through hybrid feature selection.

The method according to claim 1 or 2,

The hybrid feature selection algorithm is a lightweight intrusion detection method through the correlation-based hybrid feature selection on the computer, characterized in that the combination of the correlation-based feature selection and the support vector machine.

Generate an initial population of subsets of features using genetic algorithms;

Evaluate each subset based on correlation-based feature selection algorithms, choose the best subset possible, and use this feature subset to find intrusion detection rates with support vector machines and determine their detection rates. Setting best, this operation is repeated until a better subset for the next generation is not selected or the maximum generation of the genetic algorithm is reached;

Like all probabilistic algorithms, the initial population of genes is randomly generated, the Merit of each gene is calculated by correlation-based feature selection, and the gene with the highest Merit θbest is the most optimal subset of features in the population. Expressing Sbest, and this subset is evaluated by the classification algorithm of the support vector machine and the resulting value is stored in θbest representing the metric of the evaluation;

Genetic operators such as selection, crossover and change are performed to create a new population of genes, the most optimal gene or feature subset in each generation compared to the previous optimal subset Sbest, with the new subset being the old one If more optimal, allocate it to the most optimal subset, which subset is evaluated by the support vector machine;

If the new detection rate is higher than the previous one, this value is θbest and the algorithm continues to run, and if the new detection rate is not higher than the previous one, Sbest is the optimal subset of features, and the best subset is found within the next generation. Lightweight intrusion detection method through correlation-based hybrid feature selection on a computer, characterized in that the algorithm is terminated when it is not supported or when the maximum number of generations is reached.

The method of claim 5,

Lightweight intrusion detection method through correlation based hybrid feature selection on a computer, characterized in that for encoding 41 features using the genetic algorithm.

The method according to claim 5 or 6,

The genetic algorithm is used to generate a subset of features from a given feature set, which takes all the features as input and is evaluated by the correlation-based feature selection and support vector machine and then the optimal subset of features. Lightweight intrusion detection method through correlation based hybrid feature selection on a computer, characterized in that the output set.

The method of claim 7, wherein

Each gene expresses a feature vector, the gene has a length of 41 genes, and each gene has a value of 1 or 0, and each value includes or does not include the feature in the feature vector. Lightweight intrusion detection method through correlation-based hybrid feature selection on a computer, characterized in that whether or not.

The method of claim 5,

Lightweight intrusion detection method through correlation based hybrid feature selection on a computer, characterized in that the intrusion detection rate is selected as a metric even though a complex measure such as a combination of the detection rate and the false detection rate or a rule-based measure can be used.