KR20230087232A

KR20230087232A - Method and system for analyzing performance of machine learning algorithm by using lid-ds dataset

Info

Publication number: KR20230087232A
Application number: KR1020210175995A
Authority: KR
Inventors: 김진국; 박정찬; 신동일; 신동규; 박대경
Original assignee: 국방과학연구소; 세종대학교산학협력단
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2023-06-16

Abstract

본 발명에 따른 기계학습 알고리즘 성능 분석 방법은, 알고리즘 비교 시스템에 의해 수행되는 기계학습 알고리즘 성능 분석 방법에 있어서, 사이버 공격 데이터에 대한 전처리를 수행하는 단계; 전처리가 수행된 상기 사이버 공격 데이터를 훈련 데이터 및 테스트 데이터로 분할하는 단계; 및 분할된 상기 훈련 데이터 및 상기 테스트 데이터를 이용하여 기계학습 알고리즘들의 성능을 분석하는 단계를 포함할 수 있다. A method for analyzing performance of a machine learning algorithm according to the present invention is performed by an algorithm comparison system, comprising: performing preprocessing on cyber attack data; Dividing the preprocessed cyber attack data into training data and test data; and analyzing performance of machine learning algorithms using the divided training data and the test data.

Description

Method and system for analyzing machine learning algorithm performance using LID-DS data set

본 발명은 사이버 공격의 탐지 정확성을 분석하는 기술에 관한 것으로, 더욱 상세하게는 LID-DS 데이터 세트를 이용하여 기계학습 알고리즘의 성능을 분석하는 방법 및 시스템에 관한 것이다.The present invention relates to a technique for analyzing the detection accuracy of a cyber attack, and more particularly, to a method and system for analyzing the performance of a machine learning algorithm using a LID-DS data set.

오늘날 정보통신 기술이 급격하게 발달하면서 IT 인프라에서 보안의 중요성이 점진적으로 높아지고 있고, 이와 동시에 사이버 공격은 지능형 지속 공격처럼 고도화되고 지능적으로 다양해지고 있다.Today, as information and communication technology develops rapidly, the importance of security in IT infrastructure is gradually increasing, and at the same time, cyber attacks are becoming sophisticated and intelligently diversified like intelligent continuous attacks.

점점 더 고도화되는 사이버 공격을 방어하는 것은 매우 중요한 사안인데, IDS(Intrusion Detection System) 기술의 발달 속도가 빠르게 변형되는 사이버 공격을 완벽하게 탐지하지 못하고 있는 실정이다. 따라서 현재는 HIDS 데이터 분석을 통해서 위와 같은 사이버 공격을 방어하는데 침입 탐지 시스템에서 생성된 데이터를 이용하고 있다.It is a very important issue to defend against increasingly sophisticated cyberattacks, but the speed of development of IDS (Intrusion Detection System) technology does not completely detect rapidly changing cyberattacks. Therefore, the data generated by the intrusion detection system is currently being used to defend against the above cyber attacks through HIDS data analysis.

침입 탐지 시스템은 네트워크 기반인 NIDS, 호스트 기반인 HIDS 두 가지 방식으로 나눌 수 있다. 네트워크 기반 침입 탐지 시스템과 달리 호스트 기반 침입 탐지 시스템은 시스템 내부와 외부를 전체적으로 모니터링을 해야 하는 어려움이 있다.Intrusion detection systems can be divided into two types: network-based NIDS and host-based HIDS. Unlike network-based intrusion detection systems, host-based intrusion detection systems have difficulties in monitoring the inside and outside of the system as a whole.

그에 따라, 연구가 많이 부족하고 침입 탐지 시스템은 새로운 사이버 공격 및 내부 사이버 공격에 대한 방어 대책이 미흡하며 오경보가 증가하는 문제점이 있다.Accordingly, there is a problem in that there is a lot of lack of research, the intrusion detection system has insufficient defense measures against new cyber attacks and internal cyber attacks, and false alarms increase.

호스트 기반 침입 탐지 시스템 방식은 오용 탐지와 이상 탐지 2가지 방법으로 나눌 수 있다. 오용 탐지 방법은 시그니처 기반으로 사이버 공격을 탐지하기 때문에 기존의 사이버 공격을 탐지하는 것은 효과적이지만 반면에 새로운 사이버 공격에 대한 탐지는 부적합하다.The host-based intrusion detection system method can be divided into two methods: misuse detection and anomaly detection. Since the misuse detection method detects cyberattacks based on signatures, it is effective to detect existing cyberattacks, but it is inadequate to detect new cyberattacks.

이상 탐지 방법은 오용 탐지 방법과 반대로 정상적인 동작 및 행위로 정의된 상태가 아닌 것에 대한 모든 상황을 이상 행위로 판단하여 탐지하게 된다. 즉, 오용탐지 방법과 달리 이상 탐지 방법은 제로 데이터 공격에 대한 탐지에는 적합하지만, 정상 동작 및 이상 행위를 판단할 수 있는 많은 데이터가 요구되거나 데이터가 너무 부족하여 기계학습에 적용하기에는 어려움이 있다.Contrary to the misuse detection method, the anomaly detection method detects all situations that are not defined as normal actions and behaviors as abnormal actions. In other words, unlike the misuse detection method, the anomaly detection method is suitable for detecting zero-data attacks, but it is difficult to apply it to machine learning because it requires a lot of data to determine normal and abnormal behavior or the data is too insufficient.

시그니처 기반 탐지 기술에서 IDS는 공격 시그니처가 포함된 데이터베이스를 사용하여 데이터의 침입을 탐지하므로 탐지율이 상대적으로 우수하다. 하지만 탐지 체계의 문제점은 새로운 사이버 공격에 대한 데이터가 없어서 새로운 사이버 공격을 탐지할 수 없다는 문제가 있다.In the signature-based detection technology, the IDS detects data intrusion using a database containing attack signatures, so the detection rate is relatively excellent. However, the problem with the detection system is that it cannot detect new cyber attacks because there is no data on new cyber attacks.

최근 들어, 호스트 기반 침입 탐지 연구는 기계학습 기술을 사용하여 변칙에 기반한 탐지 기술은 비정상적인 행동을 찾아 새로운 공격을 탐지하고 이상을 탐지하기 위해 기계학습 알고리즘이 사용되고 있다.Recently, host-based intrusion detection research uses machine learning technology, anomaly-based detection technology finds abnormal behavior, detects new attacks, and machine learning algorithms are used to detect anomalies.

한국등록특허 제10-2292968호(공고일 : 2021. 08. 25.)Korean Patent Registration No. 10-2292968 (Announcement date: 2021. 08. 25.) 한국공개특허 제10-2018-0085157호(공개일 : 2018. 07. 26.)Korean Patent Publication No. 10-2018-0085157 (Publication date: 2018. 07. 26.)

본 발명은 LID-DS 데이터 세트를 사용하여 기계학습 알고리즘의 성능을 분석하는 방법 및 시스템을 제공하고자 한다. The present invention seeks to provide a method and system for analyzing the performance of a machine learning algorithm using a LID-DS data set.

본 발명은, 일 관점에 따라, 알고리즘 비교 시스템에 의해 수행되는 기계학습 알고리즘 성능 분석 방법에 있어서, 사이버 공격 데이터에 대한 전처리를 수행하는 단계; 전처리가 수행된 상기 사이버 공격 데이터를 훈련 데이터 및 테스트 데이터로 분할하는 단계; 및 분할된 상기 훈련 데이터 및 상기 테스트 데이터를 이용하여 기계학습 알고리즘들의 성능을 분석하는 단계를 포함하는 기계학습 알고리즘 성능 분석 방법을 제공할 수 있다.According to one aspect, the present invention provides a method for analyzing the performance of a machine learning algorithm performed by an algorithm comparison system, comprising: performing preprocessing on cyber attack data; Dividing the preprocessed cyber attack data into training data and test data; and analyzing performance of machine learning algorithms using the divided training data and the test data.

본 발명의 상기 사이버 공격 데이터는, LID-DS(Leipzig Intrusion Detection-Data Set)이고, 상기 LID-DS 데이터는, 복수 개의 사이버 공격 방법에 따라 분류되고, 각 사이버 공격 방법에 따라 복수 개의 특징으로 구성된 것을 포함하며, 상기 전처리를 수행하는 단계는, 상기 LID-DS 데이터에 대해 아규먼트 특징(Argument Feature)과 결측값을 삭제하고, event_time Feature에 대해 콜론을 제거하며, event_direction과 event_type에 대해 LabelEncoder를 사용하는 단계를 포함할 수 있다.The cyber attack data of the present invention is a Leipzig Intrusion Detection-Data Set (LID-DS), and the LID-DS data is classified according to a plurality of cyber attack methods and is composed of a plurality of characteristics according to each cyber attack method. Including, the step of performing the preprocessing includes deleting argument features and missing values for the LID-DS data, removing colons for event_time features, and using LabelEncoder for event_direction and event_type. steps may be included.

본 발명의 상기 전처리를 수행하는 단계는, 각 사이버 공격에 사용된 프로세스들을 하나의 프로세스로 통합하고, 상기 통합된 프로세스로 구성된 라벨들을 라벨인코더(LabelEncoder)를 사용하여 각 프로세스에 라벨을 부착하는 단계를 포함할 수 있다.The step of performing the preprocessing of the present invention is the step of integrating the processes used in each cyber attack into one process, and attaching labels to each process using LabelEncoder for the labels composed of the integrated process. can include

본 발명의 상기 전처리를 수행하는 단계는, 최소 최대 정규화(Min-Max Normalization)를 사용하여 상기 LID-DS 데이터가 가진 특징에 대해 각각의 최소값을 0, 최대값을 1, 나머지 값들을 0과 1사이의 값으로 변환하는 단계를 포함할 수 있다.In the step of performing the preprocessing of the present invention, the minimum value is 0, the maximum value is 1, and the remaining values are 0 and 1 for each feature of the LID-DS data using Min-Max Normalization. It may include converting to a value in between.

본 발명의 상기 분할하는 단계는, 기계학습 모델에 사용할 상기 훈련 데이터와 상기 테스트 데이터를 기 설정된 비율로 분할하는 단계를 포함할 수 있다.The dividing step of the present invention may include dividing the training data and the test data to be used for the machine learning model at a preset ratio.

본 발명의 상기 분석하는 단계는, 분할된 상기 훈련 데이터를 각각의 기계학습 모델에 입력받고, 분할된 상기 훈련 데이터를 이용하여 상기 각각의 기계학습 모델을 학습시키고, 상기 학습된 각각의 기계학습 모델에 분할된 상기 테스트 데이터를 입력받고, 상기 학습된 각각의 기계학습 모델을 이용하여 분할된 상기 테스트 데이터에 대한 침입 탐지 정확성을 도출하는 단계를 포함할 수 있다.In the analyzing step of the present invention, the divided training data is input to each machine learning model, each machine learning model is trained using the divided training data, and each machine learning model is learned. and deriving intrusion detection accuracy for the divided test data using each of the learned machine learning models.

본 발명의 상기 분석하는 단계는, 정밀도(Precision), 재현율(Recall), F1 스코어(F1 Score)와 오류율을 사용하여 상기 학습된 각각의 기계학습 모델의 성능을 평가하는 단계를 포함할 수 있다.The analyzing step of the present invention may include evaluating the performance of each of the trained machine learning models using precision, recall, F1 score, and error rate.

본 발명은, 다른 관점에 따라, 컴퓨터 프로그램을 저장하고 있는 컴퓨터 판독 가능 기록매체로서, 상기 컴퓨터 프로그램은, 프로세서에 의해 실행되면, 사이버 공격 데이터에 대한 전처리를 수행하는 단계; 전처리가 수행된 상기 사이버 공격 데이터를 훈련 데이터 및 테스트 데이터로 분할하는 단계; 및 분할된 상기 훈련 데이터 및 상기 테스트 데이터를 이용하여 기계학습 알고리즘들의 성능을 분석하는 단계를 포함하는 컴퓨터 판독 가능한 기록매체를 제공할 수 있다.According to another aspect, the present invention is a computer readable recording medium storing a computer program, wherein the computer program, when executed by a processor, performs preprocessing on cyber attack data; Dividing the preprocessed cyber attack data into training data and test data; and analyzing performance of machine learning algorithms using the divided training data and the test data.

본 발명은, 또 다른 관점에 따라, 컴퓨터 판독 가능한 기록매체에 저장되어 있는 컴퓨터 프로그램으로서, 상기 컴퓨터 프로그램은, 프로세서에 의해 실행되면, 사이버 공격 데이터에 대한 전처리를 수행하는 단계; 전처리가 수행된 상기 사이버 공격 데이터를 훈련 데이터 및 테스트 데이터로 분할하는 단계; 및 분할된 상기 훈련 데이터 및 상기 테스트 데이터를 이용하여 기계학습 알고리즘들의 성능을 분석하는 단계를 포함하는 동작을 상기 프로세서가 수행하도록 하기 위한 명령어를 포함하는 컴퓨터 프로그램을 제공할 수 있다.The present invention, according to another aspect, is a computer program stored in a computer readable recording medium, wherein the computer program, when executed by a processor, performs preprocessing on cyber attack data; Dividing the preprocessed cyber attack data into training data and test data; and analyzing performance of machine learning algorithms using the divided training data and the test data.

본 발명은, 또 다른 관점에 따라, 기계학습 알고리즘 성능 분석 시스템에 있어서, 사이버 공격 데이터에 대한 전처리를 수행하는 전처리부; 전처리가 수행된 사익 사이버 공격 데이터를 훈련 데이터 및 테스트 데이터로 분할하는 데이터 분할부; 및 분할된 상기 훈련 데이터 및 상기 테스트 데이터를 이용하여 기계학습 알고리즘들의 성능을 분석하는 성능 분석부를 포함하는 알고리즘 성능 분석 시스템을 제공할 수 있다.According to another aspect, the present invention provides a system for analyzing the performance of a machine learning algorithm, comprising: a pre-processing unit performing pre-processing on cyber attack data; a data division unit dividing the preprocessed private cyber attack data into training data and test data; and a performance analyzer configured to analyze performance of machine learning algorithms using the divided training data and the test data.

본 발명의 상기 사이버 공격 데이터는, LID-DS(Leipzig Intrusion Detection-Data Set)이고, 상기 LID-DS 데이터는, 복수 개의 사이버 공격 방법에 따라 분류되고, 각 사이버 공격 방법에 따라 복수 개의 특징으로 구성된 것을 포함하며, 상기 전처리부는, 상기 LID-DS 데이터에 대해 아규먼트 특징(Argument Feature)과 결측값을 삭제하고, event_time Feature에 대해 콜론을 제거하며, event_direction과 event_type에 대해 LabelEncoder를 사용할 수 있다.The cyber attack data of the present invention is a Leipzig Intrusion Detection-Data Set (LID-DS), and the LID-DS data is classified according to a plurality of cyber attack methods and is composed of a plurality of characteristics according to each cyber attack method. The pre-processing unit may delete an argument feature and a missing value from the LID-DS data, remove a colon from event_time feature, and use LabelEncoder from event_direction and event_type.

본 발명의 상기 전처리부는, 각 사이버 공격에 사용된 프로세스들을 하나의 프로세스로 통합하고, 상기 통합된 프로세스로 구성된 라벨들을 라벨인코더(LabelEncoder)를 사용하여 각 프로세스에 라벨을 부착할 수 있다.The pre-processing unit of the present invention may integrate processes used in each cyber attack into one process, and attach labels to each process using a LabelEncoder for labels composed of the integrated processes.

본 발명의 상기 전처리부는, 최소 최대 정규화(Min-Max Normalization)를 사용하여 상기 LID-DS 데이터가 가진 특징에 대해 각각의 최소값을 0, 최대값을 1, 나머지 값들을 0과 1사이의 값으로 변환할 수 있다.The pre-processing unit of the present invention sets the minimum value to 0, the maximum value to 1, and the remaining values to values between 0 and 1 for each characteristic of the LID-DS data using Min-Max Normalization. can be converted

본 발명의 상기 데이터 분할부는, 기계학습 모델에 사용할 상기 훈련 데이터와 상기 테스트 데이터를 기 설정된 비율로 분할할 수 있다.The data partitioning unit of the present invention may divide the training data and the test data to be used for the machine learning model at a predetermined ratio.

본 발명의 상기 성능 분석부는, 분할된 상기 훈련 데이터를 각각의 기계학습 모델에 입력받고, 분할된 상기 훈련 데이터를 이용하여 상기 각각의 기계학습 모델을 학습시키고, 상기 학습된 각각의 기계학습 모델에 분할된 상기 테스트 데이터를 입력받고, 상기 학습된 각각의 기계학습 모델을 이용하여 분할된 상기 테스트 데이터에 대한 침입 탐지 정확성을 도출할 수 있다.The performance analysis unit of the present invention receives the divided training data as input to each machine learning model, uses the divided training data to learn the respective machine learning model, and uses the divided training data to learn each machine learning model. The divided test data may be input, and intrusion detection accuracy for the divided test data may be derived using each of the learned machine learning models.

본 발명의 상기 성능 분석부는, 정밀도(Precision), 재현율(Recall), F1 스코어(F1 Score)와 오류율을 사용하여 상기 학습된 각각의 기계학습 모델의 성능을 평가할 수 있다.The performance analysis unit of the present invention may evaluate the performance of each of the learned machine learning models using precision, recall, F1 score, and error rate.

본 발명의 실시예에 따르면, LID-DS 데이터 세트에서 공격 유형에 따른 기계학습 알고리즘의 성능을 효과적으로 평가할 수 있다.According to an embodiment of the present invention, the performance of the machine learning algorithm according to the attack type can be effectively evaluated in the LID-DS data set.

도 1은 본 발명의 실시예에 있어서, 데이터 세트를 설명하기 위한 표이다.
도 2는 본 발명의 실시예에 있어서, LID-DS 데이터 세트의 공격 시뮬레이션 절차를 나타낸 도면이다.
도 3은 본 발명의 실시예에 있어서, LID-DS 데이터 세트 성능 비교 모델의 구조를 설명하기 위한 도면이다.
도 4는 본 발명의 실시예에 따른 알고리즘 비교 시스템의 구성을 설명하기 위한 블록도이다.
도 5는 본 발명의 실시예에 따른 알고리즘 비교 시스템에서 기계학습 알고리즘을 비교하는 방법을 설명하기 위한 흐름도이다.
도 6은 본 발명의 실시예에 있어서, LID-DS에 저장된 데이터 형식을 설명하기 위한 도면이다.
도 7은 본 발명의 실시예에 있어서, 데이터 타입과 공격 타입을 설명하기 위한 표이다.1 is a table for explaining a data set according to an embodiment of the present invention.
2 is a diagram illustrating an attack simulation procedure of an LID-DS data set according to an embodiment of the present invention.
3 is a diagram for explaining the structure of a LID-DS data set performance comparison model in an embodiment of the present invention.
4 is a block diagram for explaining the configuration of an algorithm comparison system according to an embodiment of the present invention.
5 is a flowchart illustrating a method of comparing machine learning algorithms in an algorithm comparison system according to an embodiment of the present invention.
6 is a diagram for explaining a data format stored in an LID-DS according to an embodiment of the present invention.
7 is a table for explaining data types and attack types according to an embodiment of the present invention.

먼저, 본 발명의 장점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되는 실시 예들을 참조하면 명확해질 것이다. 여기에서, 본 발명은 이하에서 개시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 발명의 범주를 명확하게 이해할 수 있도록 하기 위해 예시적으로 제공되는 것이므로, 본 발명의 기술적 범위는 청구항들에 의해 정의되어야 할 것이다.First, the advantages and characteristics of the present invention, and methods for achieving them will become clear with reference to the embodiments described later in detail in conjunction with the accompanying drawings. Here, the present invention is not limited to the embodiments disclosed below, but may be implemented in a variety of different forms, only the present embodiments make the disclosure of the present invention complete, and conventional techniques in the technical field to which the present invention pertains. Since it is provided as an example so that those skilled in the art can clearly understand the scope of the invention, the technical scope of the present invention should be defined by the claims.

아울러, 아래의 본 발명을 설명함에 있어서 공지 기능 또는 구성 등에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들인 것으로, 이는 사용자, 운용자 등의 의도 또는 관례 등에 따라 달라질 수 있음은 물론이다. 그러므로, 그 정의는 본 명세서의 전반에 걸쳐 기술되는 기술사상을 토대로 이루어져야 할 것이다.In addition, in describing the present invention below, if it is determined that a detailed description of a known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to intentions or customs of users, operators, etc., of course. Therefore, the definition should be made based on the technical idea described throughout this specification.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예에 대하여 상세하게 설명한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 있어서, 데이터 세트를 설명하기 위한 표이다. 1 is a table for explaining a data set according to an embodiment of the present invention.

2018년 Leipzig University에서 호스트 기반 침입 탐지 시스템의 이상 탐지 연구를 위한 LID-DS 데이터 세트가 공개되었는데, LID-DS 데이터는 연속적인 시스템 호출을 공격 유형과 연관시킨 데이터이다.In 2018, Leipzig University published the LID-DS data set for an anomaly detection study in host-based intrusion detection systems.

LID-DS 데이터 세트는 기존에 공개되었던 데이터들과 다르게 현재 공개된 데이터 세트들보다 최신 컴퓨터 시스템의 다양한 특징들과 사이버 공격 방법 및 시나리오로 구성되어 있다.Unlike previously published data, the LID-DS data set consists of various features of the latest computer system and cyber attack methods and scenarios compared to currently published data sets.

이러한 LID-DS 데이터를 통해 기존에 데이터 세트들의 데이터가 부족하여 기계학습에 적용하기 어려웠던 부분을 해결하고 기계학습 방법을 이용하여 새로운 이상 행동들을 더 정확하게 탐지하여 차단할 수 있으며, 이를 통해 침입 탐지 시스템의 문제점인 오경보율을 줄일 수 있다.Through this LID-DS data, it is possible to solve the part that was difficult to apply to machine learning due to the lack of data in existing data sets, and to more accurately detect and block new abnormal behaviors using machine learning methods. The false alarm rate, which is a problem, can be reduced.

LID-DS 데이터 세트는 시스템 호출과 관련된 다양한 데이터가 포함되어 있으며, 소프트웨어와 다양한 사이버 공격이 기록된다. LID-DS 데이터 세트는, 일례로서 도 1의 표에 도시된 바와 같이 사이버 공격 방법과 여러 시나리오로 구성되며, 시나리오를 통해 정상적인 데이터, 비정상적인 데이터를 생성하고 기록하는 프로세스를 구성할 수 있다.The LID-DS data set contains various data related to system calls, software and various cyber attacks are recorded. As an example, the LID-DS data set is composed of a cyber attack method and several scenarios, as shown in the table of FIG. 1, and a process of generating and recording normal and abnormal data can be configured through the scenarios.

도 2는 본 발명의 실시예에 있어서, LID-DS 데이터 세트의 공격 시뮬레이션 절차를 나타낸 도면이다.2 is a diagram illustrating an attack simulation procedure of an LID-DS data set according to an embodiment of the present invention.

도 2는 도1의 사이버 공격 시나리오를 이용하여 데이터를 생성하는 과정을 나타낸 것으로, LID-DS 데이터 세트의 결과 시스템 호출 추적을 기록하기 위해 공격 대상은 초기 상태를 정의하고, 각 공격 후 초기 상태로 되돌리기 위해 도커(Docker) 10 컨테이너 가상화 소프트웨어 내에서 실행될 수 있다.Figure 2 shows the process of generating data using the cyber attack scenario of Figure 1. In order to record the system call trace as a result of the LID-DS data set, the attack target defines an initial state and returns to the initial state after each attack. To revert, it can be run within Docker 10 container virtualization software.

기록을 위해 LID-DS 프레임워크를 이용하여 먼저 공격 대상을 호스팅 하는 Docker 컨테이너를 시작한다. 그 다음 시나리오에 따라 초기화 작업이 실행되며 정상 동작의 시뮬레이션이 시작된다.For the record, we use the LID-DS framework to start the Docker container hosting the attack target. Initialization tasks are then executed according to the scenario and simulation of normal operation begins.

그 후, 공격 대상 소프트웨어의 시작 효과를 기록하지 않기 위해서 Sysdig가 활성화되기 전에 짧은 시간 동안 기다린다. 공격 동작을 기록하는 경우 임의의 시간이 지나면 공격이 시작되는데 원하는 시간 동안 녹화가 실행된 후 제어 스크립트에 의해 녹화가 중지된다. 또한, 정상적인 동작 및 사용된 Docker 컨테이너의 시뮬레이션을 중지하고 제거한다.After that, it waits for a short period of time before Sysdig is activated in order not to record the starting effect of the targeted software. In the case of recording an attack action, the attack starts after a certain amount of time, and after the recording is executed for the desired time, the recording is stopped by the control script. Also, it stops and removes normal operation and simulation of used Docker containers.

ADFA 데이터 세트는 일련의 시스템 호출 ID만 포함하고 현대의 사이버 공격 패턴을 포함하지 않기 때문에 ADFA 데이터 세트를 이용하여 이상 탐지 테스트를 하기에는 적절하지 않다.The ADFA data set is not suitable for anomaly detection testing because it contains only a set of system call IDs and does not include modern cyberattack patterns.

실시예에서는 기존에 사용되었던 데이터 세트에서 결여된 스레드 정보, 메타 데이터 및 버퍼 데이터를 포함하고, 데이터가 부족하여 기계학습에 적용하지 못했던 문제를 해결하기 위해서 LID-DS(Leipzig Intrusion Detection-Data Set) 데이터를 사용한다. LID-DS 데이터를 사용하여 기계학습 알고리즘에 관한 비교 연구를 통해 호스트 기반 침입 탐지 시스템이 나아갈 방향을 제시할 수 있다.In the embodiment, LID-DS (Leipzig Intrusion Detection-Data Set) is used to include thread information, meta data, and buffer data that are lacking in previously used data sets, and to solve problems that have not been applied to machine learning due to lack of data. use the data Comparative studies on machine learning algorithms using LID-DS data can provide directions for host-based intrusion detection systems.

도 6을 참조하면, LID-DS의 데이터 파일 자체의 형식을 나타낸 표이다. ADFA와 다르게 LID-DS 데이터에는 시스템 호출의 인수, 반환 값, 고정밀 타임 스탬프, 해당 프로세스 이름 및 데이터 버퍼의 내용이 포함될 수 있다.Referring to FIG. 6, it is a table showing the format of the LID-DS data file itself. Unlike ADFA, LID-DS data can include system call arguments, return values, high-precision timestamps, corresponding process names, and the contents of data buffers.

도 3은 본 발명의 실시예에 있어서, LID-DS 데이터 세트 성능 비교 모델의 구조를 설명하기 위한 도면이다.3 is a diagram for explaining the structure of a LID-DS data set performance comparison model in an embodiment of the present invention.

알고리즘 비교 시스템은 LID-DS 데이터 셋에 대한 전처리를 수행할 수 있다. LID-DS 데이터 세트는 복수 개의 사이버 공격 방법에 따라 데이터가 분류될 수 있으며, 각 사이버 공격 방법은 복수 개의 특징으로 구성될 수 있다.The algorithm comparison system may perform pre-processing on the LID-DS data set. In the LID-DS data set, data may be classified according to a plurality of cyber attack methods, and each cyber attack method may be composed of a plurality of characteristics.

도 7을 참조하면, 데이터 타입과 공격 타입을 설명하기 위한 표이다. 도 7은 LID-DS 데이터 세트에 대한 10개의 사이버 공격 방법에 따라 데이터가 분류되고, 각 사이버 공격 방법은 7개의 특징으로 구성된 것을 나타낸 표이다. 데이터를 기계학습 알고리즘에 사용하기 위해 원본 크기가 큰 일부 데이터를 제외하고 CSV 형식으로 구성될 수 있다.Referring to FIG. 7, it is a table for explaining data types and attack types. 7 is a table showing that data is classified according to 10 cyber attack methods for the LID-DS data set, and each cyber attack method is composed of 7 characteristics. To use the data in machine learning algorithms, it can be organized in CSV format except for some data with large original size.

알고리즘 비교 시스템은 모든 데이터에 대해 Argument Feature와 결측값을 삭제하고, event_time Feature에서 콜론(:)을 제거할 수 있다. event_direction과 event_type은 LabelEncoder를 사용할 수 있다.The algorithmic comparison system can delete argument features and missing values for all data, and remove colons (:) from event_time features. Event_direction and event_type can use LabelEncoder.

프로세스 카테고리는 총 16개로 구성되어 있으며 각 공격 방법마다 프로세스의 개수는 다르다. 그에 따라, 각 사이버 공격 방법에 사용된 프로세스들은 하나의 프로세스로 통합될 수 있다. 그 결과, 총 10개의 프로세스로 구성된 라벨들은 라벨인코더(LabelEncoder)를 사용하여 각 프로세스에 라벨을 붙여 사용할 수 있다.There are a total of 16 process categories, and the number of processes is different for each attack method. Accordingly, the processes used in each cyber attack method can be integrated into one process. As a result, labels composed of a total of 10 processes can be used by attaching a label to each process using LabelEncoder.

이때, 원 핫 인코더(OneHotEncoder)를 사용하지 않은 이유는 event_type의 카테고리가 99개로 구성되어 있기 때문이다. 즉, 차원의 저주(Curse of Dimensionality)가 발생할 수 있는 위험이 있으므로 LabelEncoder만이 사용될 수 있다. 정규화(Normalization)는 데이터가 가진 특징의 스케일이 심하게 차이가 나는 문제를 해결하기 위해서 최소-최대 정규화(Min-Max Normalization)를 사용하여 특징에 대해 각각의 최솟값 0, 최댓값 1로 다른 값들은 0과 1 사이의 값으로 변환될 수 있다.At this time, the reason why OneHotEncoder is not used is that event_type has 99 categories. That is, only LabelEncoder can be used because there is a risk that the Curse of Dimensionality may occur. Normalization uses Min-Max Normalization to solve the problem that the scale of the features of the data differs greatly, setting the minimum value to 0 and the maximum value to 1 for each feature, and other values to 0 and 1. It can be converted to a value between 1 and 1.

알고리즘 비교 시스템은 전처리가 수행된 LID-DS 데이터를 훈련 데이터 및 테스트 데이터로 분할할 수 있다. 훈련 데이터는 10개의 사이버 공격 방법 데이터를 하나의 CSV 파일로 합쳐서 사용될 수 있다. 기계학습에 사용할 훈련 데이터와 테스트 데이터는 train_test_split 모듈을 사용하여 일반적으로 많이 사용하는 8:2 비율로 나누어 사용될 수 있다.The algorithm comparison system may divide preprocessed LID-DS data into training data and test data. Training data can be used by combining 10 cyber attack method data into one CSV file. Training data and test data to be used for machine learning can be split at the commonly used 8:2 ratio using the train_test_split module.

하지만 훈련 데이터를 학습시킨 후 테스트 데이터에 모델을 적용함에도, 과적합(Overfitting) 현상이 발견될 수 있다. 과적합이란 모델이 너무 과적합 되도록 학습한 나머지, 예측률이 현저히 떨어지는 현상을 말한다. 과적합을 방지하기 위하여 다른 비율로 나누어진 데이터를 포함하여 교차 검증(Cross-validation)으로 모델 평가가 진행될 수 있다.However, even though the model is applied to the test data after learning the training data, overfitting may be found. Overfitting is a phenomenon in which the prediction rate drops significantly because the model has been trained to overfit. In order to prevent overfitting, model evaluation may be performed by cross-validation, including data divided by different ratios.

그 결과, 특정 비율(예컨대, 7:3 비율)로 나누어진 데이터 세트의 모델 성능이 가장 높게 나올 수 있다. 본 실시예에서는 훈련 데이터와 테스트 데이터를 7:3 비율로 나누어 사용하는 것을 예를 들어 설명한다.As a result, the model performance of a data set divided by a specific ratio (eg, 7:3 ratio) may be the highest. In this embodiment, an example in which training data and test data are divided and used at a ratio of 7:3 will be described.

알고리즘 비교 시스템은 분할된 훈련 데이터 및 테스트 데이터를 이용하여 기계학습 알고리즘들의 성능을 분석할 수 있다. 알고리즘 비교 시스템은 Decision Tree, Naive Bayes, MLP, Logistic Regression, LSTM, RNN를 포함하는 6가지의 기계학습 알고리즘들의 성능을 분석할 수 있다. 예컨대, 기계학습 알고리즘들의 성능을 분석하기 위하여 실험이 수행될 수 있다. 이 실험에 사용된 알고리즘의 하이퍼 파라미터는 아래의 표 1과 같다.The algorithm comparison system may analyze the performance of machine learning algorithms using the divided training data and test data. The algorithm comparison system can analyze the performance of six machine learning algorithms including Decision Tree, Naive Bayes, MLP, Logistic Regression, LSTM, and RNN. For example, experiments may be performed to analyze the performance of machine learning algorithms. The hyperparameters of the algorithm used in this experiment are shown in Table 1 below.

[표 1][Table 1]

일례로, LID-DS의 앞서 설명 방법으로 전처리 과정이 수행된 사이버 공격 데이터가 사용될 수 있다. LabelEncoder를 사용하여 라벨을 부여한 Process 카테고리 10개의 데이터를 통하여 LID-DS 데이터 세트의 침입 탐지 정확성을 비교하는 실험이 진행될 수 있다.As an example, cyber attack data pre-processed by the above-described LID-DS method may be used. An experiment comparing the intrusion detection accuracy of the LID-DS data set can be conducted through the data of 10 process categories labeled using LabelEncoder.

학습된 모델의 성능 평가는 정밀도(Precision), 재현율(Recall), F1 스코어(F1 Score)를 사용할 수 있다. 그 이유는 데이터에 대해서 정확도만을 가지고 평가하는 것은 부적합하기 때문이다. 또한, 보안에서 중요한 문제점인 오류율을 확인하기 위해 FAR(False Alarm Rate) 및 ERR(Error Rate)의 수치가 확인될 수 있다.Performance evaluation of the learned model can use precision, recall, and F1 score. The reason is that it is inappropriate to evaluate data only with accuracy. In addition, in order to check the error rate, which is an important problem in security, values of a false alarm rate (FAR) and an error rate (ERR) may be checked.

일례로, 실험에 사용한 알고리즘은 크게 데이터의 시퀀스(sequence)를 고려하지 아니하고 학습하는 알고리즘 4개와 시퀀스를 고려하여 학습하는 알고리즘 2개로 실험될 수 있다. LID-DS 데이터 세트에서 가장 특징적인 부분이라고 할 수 있는 공격 유형에 따른 연속적인 시스템 콜을 어떠한 기계학습 알고리즘이 가장 잘 찾아낼 수 있는지 평가할 수 있다.For example, the algorithm used in the experiment can be largely experimented with four algorithms that learn without considering the sequence of data and two algorithms that learn in consideration of the sequence. It is possible to evaluate which machine learning algorithm can best detect continuous system calls according to the type of attack, which is the most characteristic part of the LID-DS data set.

실시예에 따르면, LID-DS 데이터의 경우 사이버 공격 데이터를 직접 만들 수 있어서 데이터의 양이 많으므로 기존에 데이터가 부족하여 기계학습을 진행하지 못하는 부분을 해결할 수 있다.According to the embodiment, in the case of LID-DS data, cyber attack data can be directly created, so the amount of data is large, so it is possible to solve the problem where machine learning cannot proceed due to insufficient data.

도 4는 일 실시예에 따른 알고리즘 비교 시스템의 구성을 설명하기 위한 블록도이고, 도 5는 일 실시예에 따른 알고리즘 비교 시스템에서 기계학습 알고리즘을 비교하는 방법을 설명하기 위한 흐름도이다.4 is a block diagram for explaining the configuration of an algorithm comparison system according to an embodiment, and FIG. 5 is a flowchart for explaining a method for comparing machine learning algorithms in an algorithm comparison system according to an embodiment.

알고리즘 비교 시스템(100)의 프로세서는 전처리부(410), 데이터 분할부(420) 및 성능 분석부(430) 등을 포함할 수 있다. 이러한 프로세서의 구성요소들은 알고리즘 비교 시스템에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 프로세서 및 프로세서의 구성요소들은 도 5의 기계학습 알고리즘 비교 방법이 포함하는 단계들(510 내지 530)을 수행하도록 알고리즘 비교 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서의 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다.The processor of the algorithm comparison system 100 may include a pre-processing unit 410, a data segmentation unit 420, and a performance analysis unit 430. Components of such a processor may be representations of different functions performed by the processor according to control instructions provided by program codes stored in the algorithm comparison system. The processor and components of the processor may control the algorithm comparison system to perform steps 510 to 530 included in the machine learning algorithm comparison method of FIG. 5 . In this case, the processor and components of the processor may be implemented to execute instructions according to the code of an operating system included in the memory and the code of at least one program.

프로세서는 기계학습 알고리즘 비교 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예컨대, 알고리즘 비교 시스템에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 알고리즘 비교 시스템을 제어할 수 있다.The processor may load a program code stored in a program file for a machine learning algorithm comparison method into a memory. For example, when a program is executed in the algorithm comparison system, the processor may control the algorithm comparison system to load a program code from a program file into a memory under the control of an operating system.

이때, 프로세서 및 프로세서가 포함하는 전처리부(410), 데이터 분할부(420) 및 성능 분석부(430) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(510 내지 530)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다. At this time, each of the processor and the pre-processing unit 410, the data division unit 420, and the performance analysis unit 430 included in the processor executes a command of a corresponding part of the program code loaded into the memory to perform subsequent steps (510 to 510). 530) may be different functional representations of the processor.

단계(510)에서 전처리부(410)는 사이버 공격 데이터에 대한 전처리를 수행할 수 있다. 전처리부(410)는 LID-DS 데이터에 대해 아규먼트 특징(Argument Feature)과 결측값을 삭제하고, event_time Feature에 대해 콜론을 제거하고, event_direction과 event_type에 대해 LabelEncoder를 사용할 수 있다.In step 510, the pre-processing unit 410 may perform pre-processing on cyber attack data. The pre-processor 410 can delete argument features and missing values from LID-DS data, remove colons from event_time features, and use LabelEncoder from event_direction and event_type.

전처리부(410)는 각 사이버 공격에 사용된 프로세스들을 하나의 프로세스로 통합하고, 통합된 프로세스로 구성된 라벨들을 라벨인코더(LabelEncoder)를 사용하여 각 프로세스에 라벨을 부착할 수 있다.The pre-processing unit 410 may integrate processes used in each cyber attack into one process, and attach labels to each process using LabelEncoder, which is composed of the integrated processes.

전처리부(410)는 최소 최대 정규화(Min-Max Normalization)를 사용하여 LID-DS 데이터가 가진 특징에 대해 각각의 최소값을 0, 최대값을 1, 나머지 값들을 0과 1사이의 값으로 변환할 수 있다.The pre-processor 410 converts the minimum value to 0, the maximum value to 1, and the remaining values to values between 0 and 1 for each feature of the LID-DS data using min-max normalization. can

단계(520)에서 데이터 분할부(420)는 전처리가 수행된 사이버 공격 데이터를 훈련 데이터 및 테스트 데이터로 분할할 수 있다. 데이터 분할부(420)는 기계학습에 사용할 훈련 데이터와 테스트 데이터를 기 설정된 비율로 분할할 수 있다.In step 520, the data division unit 420 may divide the preprocessed cyber attack data into training data and test data. The data division unit 420 may divide training data and test data to be used for machine learning at a preset ratio.

단계(530)에서 성능 분석부(430)는 분할된 훈련 데이터 및 테스트 데이터를 이용하여 기계학습 알고리즘들의 성능을 분석할 수 있다. 성능 분석부(430)는 분할된 훈련 데이터를 각각의 기계학습 모델에 입력받고, 각각의 기계학습 모델을 상기 분할된 훈련 데이터를 이용하여 학습시키고, 학습된 각각의 기계학습 모델에 분할된 테스트 데이터를 입력받고, 학습된 각각의 기계학습 모델을 이용하여 분할된 테스트 데이터에 대한 침입 탐지 정확성을 도출할 수 있다.In step 530, the performance analyzer 430 may analyze the performance of machine learning algorithms using the divided training data and test data. The performance analyzer 430 receives divided training data as input to each machine learning model, trains each machine learning model using the divided training data, and divides test data into each learned machine learning model. , and intrusion detection accuracy for the divided test data can be derived using each learned machine learning model.

성능 분석부(430)는 정밀도(Precision), 재현율(Recall), F1 스코어(F1 Score)와 오류율을 사용하여 학습된 각각의 기계학습 모델의 성능을 평가할 수 있다. The performance analyzer 430 may evaluate the performance of each learned machine learning model using precision, recall, F1 score, and error rate.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

410 : 전처리부
420 : 데이터 분할부
430 : 성능 분석부410: pre-processing unit
420: data division
430: performance analysis unit

Claims

In the machine learning algorithm performance analysis method performed by the algorithm comparison system,
Performing pre-processing on cyber attack data;
Dividing the preprocessed cyber attack data into training data and test data; and
Analyzing the performance of machine learning algorithms using the divided training data and the test data
Machine learning algorithm performance analysis method comprising a.

According to claim 1,
The cyber attack data,
LID-DS (Leipzig Intrusion Detection-Data Set),
The LID-DS data,
It is classified according to a plurality of cyber attack methods, and includes a plurality of features according to each cyber attack method,
Performing the preprocessing step,
Deleting argument features and missing values for the LID-DS data, removing colons for event_time features, and using LabelEncoder for event_direction and event_type.
Machine learning algorithm performance analysis method comprising a.

According to claim 2,
Performing the preprocessing step,
Integrating the processes used in each cyber attack into one process, and attaching labels to each process using LabelEncoder for the labels composed of the integrated process
Machine learning algorithm performance analysis method comprising a.

According to claim 1,
Performing the preprocessing step,
Converting the minimum value to 0, the maximum value to 1, and the remaining values to values between 0 and 1 for each feature of the LID-DS data using min-max normalization
Machine learning algorithm performance analysis method comprising a.

According to claim 1,
The dividing step is
Dividing the training data and the test data to be used in the machine learning model at a predetermined ratio
Machine learning algorithm performance analysis method comprising a.

According to claim 1,
The analysis step is
The divided training data is input to each machine learning model, each machine learning model is trained using the divided training data, and the divided test data is input to each learned machine learning model , Deriving intrusion detection accuracy for the divided test data using each of the learned machine learning models
Machine learning algorithm performance analysis method comprising a.

According to claim 6,
The analysis step is
Evaluating the performance of each learned machine learning model using precision, recall, F1 score and error rate
Machine learning algorithm performance analysis method comprising a.

A computer-readable recording medium storing a computer program,
When the computer program is executed by a processor,
Performing pre-processing on cyber attack data;
Dividing the preprocessed cyber attack data into training data and test data; and
Analyzing the performance of machine learning algorithms using the divided training data and the test data
A computer-readable recording medium containing instructions for causing the processor to perform an operation including a.

As a computer program stored on a computer-readable recording medium,
When the computer program is executed by a processor,
Performing pre-processing on cyber attack data;
Dividing the preprocessed cyber attack data into training data and test data; and
Analyzing the performance of machine learning algorithms using the divided training data and the test data
A computer program comprising instructions for causing the processor to perform an operation comprising:

In the machine learning algorithm performance analysis system,
a pre-processing unit that performs pre-processing on cyber attack data;
a data division unit dividing the preprocessed private cyber attack data into training data and test data; and
A performance analyzer for analyzing the performance of machine learning algorithms using the divided training data and the test data.
Algorithm performance analysis system comprising a.

According to claim 10,
The cyber attack data,
LID-DS (Leipzig Intrusion Detection-Data Set),
The LID-DS data,
It is classified according to a plurality of cyber attack methods, and includes a plurality of features according to each cyber attack method,
The pre-processing unit,
Delete argument features and missing values for the LID-DS data, remove colons for event_time features, and use LabelEncoder for event_direction and event_type.
Algorithm performance analysis system, characterized in that.

According to claim 11,
The pre-processing unit,
The processes used in each cyber attack are integrated into one process, and the labels composed of the integrated processes are attached to each process using LabelEncoder.
Algorithm performance analysis system, characterized in that.

According to claim 10,
The pre-processing unit,
Converting the minimum value to 0, the maximum value to 1, and the remaining values to values between 0 and 1 for each feature of the LID-DS data using Min-Max Normalization
Algorithm performance analysis system, characterized in that.

According to claim 10,
The data partitioning unit,
Splitting the training data and the test data to be used in the machine learning model at a predetermined ratio
Algorithm performance analysis system, characterized in that.

According to claim 10,
The performance analysis unit,
The divided training data is input to each machine learning model, each machine learning model is trained using the divided training data, and the divided test data is input to each learned machine learning model , Deriving intrusion detection accuracy for the divided test data using each learned machine learning model
Algorithm performance analysis system, characterized in that.

According to claim 15,
The performance analysis unit,
Evaluating the performance of each trained machine learning model using precision, recall, F1 score and error rate
Algorithm performance analysis system, characterized in that.