KR20220170687A

KR20220170687A - Method, apparatus, computer-readable storage medium and computer program for detecting attack data

Info

Publication number: KR20220170687A
Application number: KR1020210081856A
Authority: KR
Inventors: 김이형; 김상수; 구성모; 신동규; 민병준; 유지훈
Original assignee: 국방과학연구소
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2022-12-30

Abstract

An attack data detection method performed by an attack data detection device according to an embodiment includes the steps of: performing data preprocessing corresponding to the type of data input from the outside; calculating a value output by passing the preprocessed data in a previously learned deep learning model and calculating a loss value between the preprocessed data; and checking whether the data is attack data based on a comparison between the calculated loss value and a threshold value. The threshold value may be a percentage value of a plurality of loss values calculated in a learning process of the previously learned deep learning model.

Description

Attack data detection method, device, computer readable recording medium and computer program

본 발명은 공격 데이터 탐지 방법, 장치, 컴퓨터 판독 가능한 기록매체 및 컴퓨터 프로그램에 관한 것이다.The present invention relates to a method and apparatus for detecting attack data, a computer readable recording medium, and a computer program.

최근, 정보 통신 기술들의 발전에 따라 네트워크 환경의 규모는 매우 빠른 속도로 확장되었으며, 동시에 네트워크 환경에 대한 사이버 위협 또한 증가하기 시작하였다.Recently, with the development of information and communication technologies, the size of the network environment has expanded very rapidly, and at the same time, cyber threats to the network environment have also begun to increase.

이에 따라, 대다수의 기업들은 네트워크상에서의 사이버 위협을 대비하여 다양한 보안 공격이 발생하였을 경우 사이버 위협을 관리자에게 보고하는 것을 목표로 네트워크 침입탐지 시스템(Network based Intrusion Detection System, NIDS)을 운용하고 있다.Accordingly, most companies operate a Network based Intrusion Detection System (NIDS) with the goal of reporting cyber threats to administrators when various security attacks occur in preparation for cyber threats on the network.

이때, 기존에 운용되던 네트워크 침입 탐지 시스템들은 오용 탐지(Misuse Detection) 방식으로서, 시그니처 기반의 탐지 방법(Signature based Detection)을 주로 사용해 왔다. 구체적으로, 시그니처 기반의 탐지 방법은 보안 전문가를 통해 빈번하게 사용되는 공격들에 대해 패턴을 정의해둔 것으로, 입력된 트래픽과 비교를 기초로 공격을 탐지하는 방법이다.At this time, existing network intrusion detection systems have mainly used a signature based detection method as a misuse detection method. Specifically, the signature-based detection method is a method in which patterns are defined for frequently used attacks through security experts, and the attack is detected based on input traffic and comparison.

하지만, 시그니처 기반의 탐지 방법을 이용할 경우, APT(Advance Persistent Threat) 공격과 같이 알려진 공격 이외의 새로운 공격을 탐지할 수 없으며, 새로운 공격을 탐지하기 위해서는, 새로운 공격에 대한 시그니처를 생성하는 것에 따른 시간과 비용이 소모된다는 문제가 발생하고 있다.However, when the signature-based detection method is used, new attacks other than known attacks such as APT (Advance Persistent Threat) cannot be detected, and to detect a new attack, it takes time to create a signature for the new attack There is a problem of excessive cost.

이에 따라, 새로운 공격을 탐지하기 위한 방법으로, 이상 행위 탐지(Anormaly Detection) 방법이 제안되었다. 구체적으로, 이상 행위 탐지(Anormaly Detection) 방법은 정상 행위(Normal Behavior)에 대해 모델링을 수행하고, 정상 행위에 대해 모델링된 모델을 통해 비정상 행위(Anomaly Behavior)를 탐지하는 방법이다. 따라서, 이상 행위 탐지 방법은 알려지지 않은 공격(Zero-day Attack)에 대해서도 탐지할 수 있다.Accordingly, as a method for detecting a new attack, an abnormality detection method has been proposed. Specifically, the abnormality detection method is a method of performing modeling on normal behavior and detecting anomaly behavior through a model modeled on the normal behavior. Therefore, the method for detecting anomalies can also detect an unknown attack (zero-day attack).

이에 대하여, 최근 이상 행위 탐지 방법을 수행하기 위하여 기계 학습 기법을 도입하고 있으며, 기계 학습(Machine Learning) 기법은, 데이터로부터 모델링이 가능하고, 모델링된 데이터를 통해 예측 결과를 추론할 수 있는 장점이 있으며, 기계 학습에서 정상과 비정상을 구분하기 위해서는 이진 분류(Binary Classification)방법을 사용해야 한다.In contrast, a machine learning technique has recently been introduced to perform an anomaly detection method, and the machine learning technique has the advantage of being able to model from data and to infer prediction results through the modeled data. In machine learning, in order to distinguish between normal and abnormal, the binary classification method should be used.

하지만, 현실세계에서 발생하는 많은 데이터들은 각 클래스별로 불균형하게 데이터들이 분포하는 불균형 데이터(Imbalanced Data)로 존재하는데, 이러한 이유는 소수 클래스(Minor Class)의 데이터가 다수 클래스(Major Class)에 비해서 현저히 적은 데이터로 구성되기 때문이다.However, many data generated in the real world exist as imbalanced data in which data are distributed disproportionately for each class. This is because the data of the minor class is significantly more This is because it consists of less data.

한편, 네트워크 침입 데이터의 비중은 전체 데이터의 약 1% 정도이기 때문에, 이러한 불균형 데이터를 이용하여 기계 학습 모델을 학습시킬 경우, 네트워크 침입 탐지를 위한 공격 데이터 탐지를 위한 학습이 제대로 수행되지 못하는 문제가 발생될 수 있었다.On the other hand, since the proportion of network intrusion data is about 1% of the total data, when a machine learning model is trained using this imbalanced data, there is a problem that learning to detect attack data for network intrusion detection is not performed properly. could have occurred

예를 들어, 불균형 데이터를 포함하는 학습 데이터 셋을 이용하여 지도 학습(Supervised Learning) 기반의 기계 학습 모델을 학습시킬 경우, 네트워크 침입 탐지를 위한 공격 데이터를 분류하는 성능이 저하될 수 있는 문제가 발생될 수 있다. 이러한 이유는, 기계학습 모델의 결정 경계(Decision Boundary)가 다수 클래스에 편향되도록 학습되기 때문에 소수 클래스의 탐지율이 저하되기 때문이다.For example, if a supervised learning-based machine learning model is trained using a training data set containing imbalanced data, the performance of classifying attack data for network intrusion detection may deteriorate. It can be. This is because the detection rate of the minority class is lowered because the decision boundary of the machine learning model is learned to be biased toward the majority class.

따라서, 실제 네트워크 환경에서 빈번하게 발생하는 불균형 데이터와는 무관하게 새로운 공격(공격 데이터)에 대해 탐지할 수 있는 기술을 필요로 하는 실정이다.Therefore, there is a need for a technology capable of detecting a new attack (attack data) regardless of imbalanced data that frequently occurs in an actual network environment.

한국등록특허공보, 10-2027389호 (2019.09.25. 등록)Korean Registered Patent Publication, No. 10-2027389 (registered on September 25, 2019)

본 발명의 해결하고자 하는 과제는, 공격 데이터 탐지 방법, 장치, 컴퓨터 판독 가능한 기록매체 및 컴퓨터 프로그램을 제공하는 것이다.The problem to be solved by the present invention is to provide an attack data detection method, apparatus, computer readable recording medium and computer program.

또한, 이러한 공격 데이터 탐지 방법, 장치, 컴퓨터 판독 가능한 기록매체 및 컴퓨터 프로그램을 통해 모델의 결정 경계가 다수 클래스에 편향되도록 학습되는 것을 방지하기 위하여 정상 데이터만을 이용하여 기 학습된 오토 인코더 기반 딥러닝 모델을 통해 새로운 공격 데이터를 탐지할 수 있는 것 등이 본 발명의 해결하고자 하는 과제에 포함될 수 있다.In addition, in order to prevent the decision boundary of the model from being learned to be biased toward multiple classes through such an attack data detection method, device, computer readable recording medium, and computer program, an auto-encoder-based deep learning model pre-learned using only normal data Being able to detect new attack data through can be included in the problem to be solved by the present invention.

다만, 본 발명의 해결하고자 하는 과제는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 해결하고자 하는 과제는 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the problem to be solved by the present invention is not limited to those mentioned above, and another problem to be solved that is not mentioned can be clearly understood by those skilled in the art from the description below. will be.

본 발명의 일 실시예에 따른 공격 데이터 탐지 방법은, 공격 데이터 탐지 장치에 의해 수행되는 공격 데이터 탐지 방법에 있어서, 외부로부터 입력 받은 데이터의 유형에 대응되는 데이터 전처리를 수행하는 단계와, 기 학습된 딥러닝 모델에 상기 전처리된 데이터를 통과시켜 출력된 값과 상기 전처리된 데이터간의 손실값을 산출하는 단계와, 상기 산출된 손실값과 임계값의 비교에 기초하여 상기 데이터가 공격 데이터인지 유무를 확인하는 단계를 포함하고, 상기 임계값은, 상기 기 학습된 딥러닝 모델의 학습 과정에서 산출된 복수의 손실값에 대한 백분율 값일 수 있다.An attack data detection method according to an embodiment of the present invention includes the steps of performing data preprocessing corresponding to the type of data input from the outside in the attack data detection method performed by the attack data detection apparatus, and Calculating a loss value between an output value and the preprocessed data by passing the preprocessed data through a deep learning model, and checking whether the data is attack data based on a comparison between the calculated loss value and a threshold value. The threshold value may be a percentage value of a plurality of loss values calculated in a learning process of the pre-learned deep learning model.

또한, 상기 기 학습된 딥러닝 모델은, 정상 데이터를 오토 인코더에 입력하여, 상기 정상 데이터보다 낮은 차원으로 압축한 후, 상기 정상 데이터의 차원으로 복원하여 벡터값을 출력하는 과정을 통해 상기 정상 데이터의 특징이 추출되도록 기 학습되어 있을 수 있다.In addition, the pre-learned deep learning model inputs normal data to an auto-encoder, compresses the normal data to a dimension lower than that of the normal data, and then restores the normal data to a dimension of the normal data and outputs a vector value through the process of outputting the normal data. It may be pre-learned so that the features of may be extracted.

또한, 상기 기 학습된 딥러닝 모델은, 상기 딥러닝 모델에 상기 정상 데이터를 통과시켜 출력된 값과 상기 정상 데이터간의 손실값이 작아지도록 기 학습되어 있을 수 있다.In addition, the pre-learned deep learning model may be pre-learned such that a loss value between a value output by passing the normal data through the deep learning model and the normal data is reduced.

또한, 공격 데이터 유무를 탐지하는 단계는, 상기 산출된 손실값이 상기 임계값을 초과할 경우, 상기 데이터를 상기 공격 데이터로 확인하고, 상기 산출된 손실값이 상기 임계값 이하일 경우, 상기 데이터를 정상 데이터로 확인할 수 있다.In addition, the step of detecting the presence or absence of attack data may include identifying the data as the attack data when the calculated loss value exceeds the threshold value, and storing the data when the calculated loss value is less than or equal to the threshold value. This can be verified with normal data.

또한, 상기 전처리를 수행하는 단계는, 상기 외부로부터 입력 받는 데이터가 제 1 유형 데이터일 경우, 상기 제 1 유형 데이터를 정수형으로 인코딩한 후, 원-핫 인코딩(one-hot encoding) 과정을 수행하여 벡터 값을 출력하는 단계와, 상기 외부로부터 입력 받는 데이터가 제 2 유형 데이터일 경우, 상기 제 2 유형 데이터를 최소 최대 정규화(Min-max Normalization)를 수행하는 단계를 포함할 수 있다.In addition, in the step of performing the preprocessing, if the data received from the outside is first type data, encoding the first type data into an integer type and then performing a one-hot encoding process to The method may include outputting a vector value and, if the data received from the outside is second type data, performing min-max normalization on the second type data.

본 발명의 일 실시예에 따른 공격 데이터 탐지 장치는, 외부로부터 데이터를 입력 받는 입출력부; 메모리; 및 상기 메모리와 전기적으로 연결된 프로세서를 포함하고, 상기 프로세서는, 상기 외부로부터 입력 받은 데이터의 유형에 대응되는 데이터 전처리를 수행하고, 기 학습된 딥러닝 모델에 상기 전처리된 데이터를 통과시켜 출력된 값과, 상기 전처리된 데이터간의 손실값을 산출하고, 상기 산출된 손실값과 기 설정된 임계값의 비교에 기초하여 상기 데이터가 공격 데이터인지 유무를 확인하고, 상기 기 설정된 임계값은, 상기 기 학습된 딥러닝 모델의 학습 과정에서 산출된 복수의 손실값에 대한 백분율 값일 수 있다.An apparatus for detecting attack data according to an embodiment of the present invention includes an input/output unit for receiving data from the outside; Memory; and a processor electrically connected to the memory, wherein the processor performs data preprocessing corresponding to the type of data received from the outside, passes the preprocessed data through a pre-learned deep learning model, and outputs a value. and, calculating a loss value between the preprocessed data, and determining whether or not the data is attack data based on a comparison between the calculated loss value and a preset threshold, and the preset threshold is the pre-learned It may be a percentage value of a plurality of loss values calculated in the learning process of the deep learning model.

또한, 상기 프로세서는, 상기 산출된 손실 값이 상기 임계값을 초과할 경우, 상기 데이터를 상기 공격 데이터로 확인하고, 상기 산출된 손실값이 상기 임계값 이하일 경우, 상기 데이터를 정상 데이터로 확인할 수 있다.In addition, the processor may identify the data as the attack data when the calculated loss value exceeds the threshold value, and identify the data as normal data when the calculated loss value is less than or equal to the threshold value. there is.

또한, 상기 프로세서는, 상기 외부로부터 입력 받는 데이터가 제 1 유형 데이터일 경우, 상기 제 1 유형 데이터를 원-핫 인코딩(one-hot encoding) 과정을 수행하여 벡터 값을 출력하고, 상기 외부로부터 입력 받는 데이터가 제 2 유형 데이터일 경우, 상기 제 2 형식 데이터를 최소 최대 정규화(Min-max Normalization)를 수행할 수 있다.In addition, when the data received from the outside is first type data, the processor performs a one-hot encoding process on the first type data to output a vector value, and inputs the first type data. If the received data is second type data, min-max normalization may be performed on the second type data.

본 발명의 일 실시예에 따른 공격 데이터 탐지 장치는, 정상 데이터를 이용하여 기 학습시킨 딥러닝 모델을 이용하기 때문에, 불균형 데이터로 학습된 딥러닝 모델 보다 공격 탐지 유무를 정확하게 탐지할 수 있다.Since the attack data detection apparatus according to an embodiment of the present invention uses a deep learning model previously trained using normal data, it can more accurately detect whether an attack is detected than a deep learning model learned with imbalanced data.

또한, 본 발명의 일 실시예에 따른 공격 데이터 탐지 장치는, 오토 인코더 기반 딥러닝 모델을 학습시키는 과정에서 산출된 복수의 손실값(재구성 손실값)의 백분율을 임계값으로 설정하고, 설정된 임계값을 기준으로 공격 데이터 유무를 판단하기 때문에, 정상 데이터 이외의 새로운 공격 데이터를 탐지할 수 있다.In addition, the attack data detection apparatus according to an embodiment of the present invention sets a percentage of a plurality of loss values (reconstruction loss values) calculated in the process of learning an auto-encoder-based deep learning model as a threshold value, and sets the set threshold value Since the presence or absence of attack data is determined based on , it is possible to detect new attack data other than normal data.

다만, 본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the effects obtainable in the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below. You will be able to.

도 1은 본 발명의 일 실시예에 따른 공격 데이터 탐지 장치의 블록도이다.
도 2는 오토 인코더의 동작을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 기 학습된 딥러닝 모델을 통해 확인한 정상 데이터와 공격 데이터를 시각화하여 나타낸 그래프이다.
도 4는 본 발명의 실시예에 따른 공격 데이터 탐지 장치를 통해 출력된 정상 데이터 및 공격 데이터의 손실값 분포를 나타낸 그래프이다.
도 5는 오토 인코더를 이용한 학습 모델과 지도 학습으로 학습된 학습 모델의 성능을 비교한 그래프이다.
도 6은 본 발명의 실시예에 따른 공격 데이터 탐지 장치에 대한 이상 탐지 혼동 행렬을 나타낸 도면이다.
도 7은 본 발명의 일 실시예에 따른 공격 데이터 탐지 방법의 절차에 대한 예시적인 순서도이다.1 is a block diagram of an attack data detection device according to an embodiment of the present invention.
2 is a diagram for explaining the operation of an auto encoder.
3 is a graph showing normal data and attack data confirmed through a pre-learned deep learning model according to an embodiment of the present invention.
4 is a graph showing loss value distributions of normal data and attack data output through an attack data detection device according to an embodiment of the present invention.
5 is a graph comparing performance of a learning model using an auto-encoder and a learning model learned through supervised learning.
6 is a diagram showing an anomaly detection confusion matrix for an attack data detection apparatus according to an embodiment of the present invention.
7 is an exemplary flowchart of a procedure of a method for detecting attack data according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the detailed description of the following embodiments taken in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only these embodiments make the disclosure of the present invention complete, and common knowledge in the art to which the present invention belongs. It is provided to completely inform the person who has the scope of the invention, and the present invention is only defined by the scope of the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In describing the embodiments of the present invention, if it is determined that a detailed description of a known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, terms to be described later are terms defined in consideration of functions in the embodiment of the present invention, which may vary according to the intention or custom of a user or operator. Therefore, the definition should be made based on the contents throughout this specification.

도 1은 본 발명의 일 실시예에 따른 공격 데이터 탐지 장치의 블록도이다.1 is a block diagram of an attack data detection device according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 공격 데이터 탐지 장치(100)는 오토 인코더 기반의 단일 클래스 학습 방법을 통해 기 학습된 딥러닝 모델을 이용하여 공격 데이터 유무를 확인할 수 있는 장치이다.The attack data detection device 100 according to an embodiment of the present invention is a device that can check the presence or absence of attack data using a deep learning model pre-learned through an auto-encoder-based single-class learning method.

여기서, 단일 클래스(One-Class) 학습 방법은 특정 클래스의 샘플(데이터)만을 학습하는 방법으로 준지도 학습(Semi-supervised Learning)에 해당한다.Here, the one-class learning method corresponds to semi-supervised learning as a method of learning only samples (data) of a specific class.

이러한, 단일 클래스 학습 방법은 대부분의 샘플이 정상 샘플에 해당하는 네트워크 침입 탐지 환경에서 용이하게 적용될 수 있으며, 구체적으로 본 발명의 일 실시예에 따른 공격 데이터 탐지 장치(100)는 오토 인코더 모델을 정상 샘플만을 이용하여 학습시킨 뒤, 입력되는 데이터와 오토 인코더 모델을 통과한 데이터 간의 손실값(또는 재구성 손실(Reconstruction Error))을 기초로 공격 행위(공격 데이터인지 유무)를 탐지할 수 있다.This single-class learning method can be easily applied in a network intrusion detection environment in which most samples correspond to normal samples. After training using only samples, an attack behavior (whether it is attack data or not) can be detected based on the loss value (or reconstruction error) between input data and data that has passed through the auto-encoder model.

도 1을 참조하면, 본 발명의 일 실시예에 따른 공격 데이터 탐지 장치(100)는 입출력부(101), 통신부(102), 메모리(110) 및/또는 프로세서(120)를 포함할 수 있다. Referring to FIG. 1 , an attack data detection device 100 according to an embodiment of the present invention may include an input/output unit 101, a communication unit 102, a memory 110 and/or a processor 120.

입출력부(101)는, 예를 들면, 사용자 또는 다른 외부 기기로부터 입력된 명령 또는 데이터를 일 실시예에 따른 공격 데이터 탐지 장치(100)의 다른 구성요소(들)에 전달하거나, 또는 일 측면에 따른 공격 데이터 탐지 장치(100)의 다른 구성요소(들)로부터 수신된 명령 또는 데이터를 사용자 또는 외부 기기로 출력할 수 있다.The input/output unit 101 transmits, for example, commands or data input from a user or other external device to other component(s) of the attack data detection device 100 according to an embodiment, or to one side. Commands or data received from other component(s) of the attack data detection device 100 may be output to a user or an external device.

일 실시예로서, 입출력부(101)는 외부로부터 데이터를 입력 받을 수 있다. 예를 들어, 입출력부(101)는 사용자로부터 데이터를 입력 받을 수 있으며, 이러한 입출력부(101)는 키보드, 마우스, 터치 패드 등을 포함할 수 있다.As an embodiment, the input/output unit 101 may receive data from the outside. For example, the input/output unit 101 may receive data from a user, and the input/output unit 101 may include a keyboard, mouse, touch pad, and the like.

여기서, 데이터는 정상 데이터 또는 공격 데이터일 수 있으며, 정상 데이터는 네트워크에 피해를 주지 않는 정보를 포함하고 있는 텍스트, 이미지 및 영상 중 적어도 하나일 수 있으며, 공격 데이터는 네트워크에 피해를 줄 수 있는 정보를 포함하는 텍스트 이미지 및 영상 중 적어도 하나일 수 있다. 예컨대, 공격 데이터는 Dos, Probe, U2R, R2L 유형의 데이터일 수 있다.Here, the data may be normal data or attack data, the normal data may be at least one of text, image, and video containing information that does not cause damage to the network, and attack data may cause damage to the network. It may be at least one of a text image and a video including a. For example, the attack data may be data of Dos, Probe, U2R, and R2L types.

통신부(102)는 공격 데이터 탐지 장치(100)와 외부 장치와의 유선 또는 무선 통신 채널의 수립 및 수립된 통신 채널을 통한 통신 수행을 지원할 수 있다.The communication unit 102 may support establishment of a wired or wireless communication channel between the attack data detection device 100 and an external device and communication through the established communication channel.

메모리(110)는 공격 데이터 탐지 장치(100)의 적어도 하나의 구성요소(프로세서(120), 입출력부(101) 및/또는 통신부(102))에 의해 사용되는 다양한 데이터, 예를 들어, 소프트웨어(예: 프로그램) 및, 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 저장할 수 있다. 메모리(110)는, 휘발성 메모리 또는 비휘발성 메모리를 포함할 수 있다.The memory 110 includes various data, for example, software ( Example: a program) and input data or output data for commands related thereto may be stored. The memory 110 may include volatile memory or non-volatile memory.

프로세서(120)(제어부, 제어 장치 또는 제어 회로라고도 함)는 연결된 공격 데이터 탐지 장치(100)의 적어도 하나의 다른 구성요소(예: 하드웨어 구성 요소(예: 입출력 부(101), 통신부(102) 및/또는 메모리(110)) 또는 소프트웨어 구성요소)를 제어할 수 있고, 다양한 데이터 처리 및 연산을 수행할 수 있다.The processor 120 (also referred to as a control unit, control unit, or control circuit) is connected to at least one other component (eg, hardware component (eg, input/output unit 101, communication unit 102) of the attack data detection device 100 and/or the memory 110) or a software component), and may perform various data processing and calculations.

또한, 프로세서(120)는 다른 구성요소들 중 적어도 하나로부터 수신된 명령 또는 데이터를 휘발성 메모리에 로드하여 처리하고, 다양한 데이터를 비휘발성 메모리에 저장할 수 있다.Also, the processor 120 may load and process commands or data received from at least one of the other components into a volatile memory, and store various data in a non-volatile memory.

이를 위해, 프로세서(120)는 해당 동작을 수행하기 위한 전용 프로세서(예를 들어, 임베디드 프로세서) 또는 메모리 디바이스에 저장된 하나 이상의 소프트웨어 프로그램을 실행함으로써, 해당 동작들을 수행할 수 있는 범용 프로세서(generic-purpose processor)(예를 들어, CPU 또는 application processor 또는 MCU(Micro controller unit) 등)로 구현될 수 있다.To this end, the processor 120 may be a dedicated processor (for example, an embedded processor) or a general-purpose processor capable of performing corresponding operations by executing one or more software programs stored in a memory device. processor) (eg, CPU or application processor or micro controller unit (MCU)).

보다 구체적으로, 프로세서(120)는 입출력부(101)를 통해 외부로부터 입력 받은 데이터의 유형에 대응되는 데이터 전처리를 수행하고, 기 학습된 딥러닝 모델에 전처리된 데이터를 통과시켜 출력된 값과 전처리된 데이터간의 손실값을 산출하고, 산출된 손실값과 임계값의 비교에 기초하여 입출력부(101)를 통해 입력 받은 데이터가 공격 데이터인지 유무를 확인할 수 있다.More specifically, the processor 120 performs data pre-processing corresponding to the type of data input from the outside through the input/output unit 101, and passes the pre-processed data through a pre-learned deep learning model to obtain output values and pre-processing. It is possible to calculate a loss value between the generated data, and check whether the data received through the input/output unit 101 is attack data based on the comparison between the calculated loss value and the threshold value.

보다 상세히, 프로세서(120)에서 입출력부(101)에서 외부로부터 입력 받는 데이터의 유형에 대응되는 전처리를 수행하는 방법에 대하여 설명하도록 한다.In more detail, a method of performing preprocessing corresponding to the type of data received from the outside in the input/output unit 101 in the processor 120 will be described.

일 실시예로서, 프로세서(120)는 입출력부(101)에서 외부로부터 입력 받는 데이터가 제 1 유형 데이터일 경우, 제 1 유형 데이터를 정수형으로 인코딩한 후, 원-핫 인코딩(one-hot encoding) 과정을 수행하여 벡터 값을 출력하여, 입출력부(101)에서 외부로부터 입력 받는 데이터를 전처리할 수 있다.As an embodiment, when data received from the outside through the input/output unit 101 is first type data, the processor 120 encodes the first type data into an integer type and then performs one-hot encoding. By performing the process and outputting a vector value, data received from the outside in the input/output unit 101 may be preprocessed.

다른 실시예로서, 프로세서(120)는 입출력부(101)에서 외부로부터 입력 받는 데이터가 제 2 유형 데이터일 경우, 제 2 유형 데이터를 최소 최대 정규화(Min-max Normalization)를 수행하여, 0에서 1 사이의 값으로 출력하여, 입출력부(101)에서 외부로부터 입력 받는 데이터를 전처리할 수 있다.As another embodiment, the processor 120, when data received from the outside in the input/output unit 101 is second type data, performs min-max normalization on the second type data, so that 0 to 1 By outputting a value in between, the input/output unit 101 may pre-process data received from the outside.

이하, 프로세서(120)에서 데이터를 전처리하는 방법에 대하여 설명하도록 한다. 아울러, 프로세서(120)에서 데이터를 전처리하는 방법은 기 학습된 딥러닝 모델의 학습 데이터 셋에도 적용될 수 있으며, 예컨대 기 학습된 딥러닝 모델은 데이터 전처리된 학습 데이터 셋을 이용하여 기 학습되어 있을 수 있다.Hereinafter, a method of pre-processing data in the processor 120 will be described. In addition, the method of preprocessing data in the processor 120 can also be applied to a training data set of a pre-learned deep learning model. there is.

프로세서(120)는 입력되는 데이터에 기 설정된 특징(예를 들어, difficulty, num_outbound_cmds 특징)이 포함되어 있을 경우, 기 설정된 특징은 제거할 수 있다. The processor 120 may remove the preset characteristics when input data includes preset characteristics (eg, difficulty, num_outbound_cmds characteristics).

이후, 프로세서(120)는 데이터 정규화(Normalization)과정을 통해 입출력부(101)에서 입력받은 데이터의 특징들을 0과 1 사이의 값으로 변경할 수 있다.Thereafter, the processor 120 may change the characteristics of data input from the input/output unit 101 to values between 0 and 1 through a data normalization process.

일 실시예로서, 프로세서(120)는 입출력부(101)에서 입력받은 데이터가 제 1 유형 데이터일 경우, 제 1 유형 데이터를 정수형으로 인코딩한 후, 원-핫 인코딩(one-hot encoding)과정을 수행하여 벡터값을 출력할 수 있다.As an embodiment, when data input from the input/output unit 101 is first type data, the processor 120 encodes the first type data into an integer type and then performs a one-hot encoding process. You can output a vector value by doing

이때, 제 1 유형 데이터는 nominal type 데이터일 수 있으며, nominal type 데이터가 정수형으로 인코딩된 후, 원-핫 인코딩(one-hot encoding)이 수행될 경우, 데이터의 차원이 커질 수 있다.In this case, the first type data may be nominal type data, and when one-hot encoding is performed after encoding the nominal type data into an integer type, the dimension of the data may increase.

다른 실시예로서, 프로세서(120)는 입출력부(101)에서 입력받은 데이터가 제 2 유형 데이터일 경우, 하기 수학식 1과 같이 최소 최대 정규화(Min-max Normalization)를 수행할 수 있다.As another embodiment, when the data input from the input/output unit 101 is the second type data, the processor 120 may perform min-max normalization as shown in Equation 1 below.

이때, 제 2 유형 데이터에 대하여 최소 최대 정규화를 수행할 경우, 속성 값들의 범위의 차이를 왜곡하지 않고 공통 스케일로 변경할 수 있으며, 제 2 유형 데이터는 numeric type 데이터일 수 있다.In this case, when minimum and maximum normalization is performed on the second type data, the difference in the range of attribute values may be changed to a common scale without distorting, and the second type data may be numeric type data.

또한, 프로세서(120)는 제 3 유형 데이터를 입력 받을 경우에는, 데이터 전처리를 수행하지 않을 수 있으며, 예컨대 제 3 유형 데이터는 binary type데이터일 수 있다.Also, when the processor 120 receives third type data, it may not perform data preprocessing, and for example, the third type data may be binary type data.

프로세서(120)는 입출력부(101)에서 외부로부터 입력 받은 데이터가 전처리된 후, 기 학습된 딥러닝 모델에 전처리된 데이터를 통과시켜 출력된 값과 전처리된 데이터간의 손실값을 산출하고, 산출된 손실값과 임계값의 비교에 기초하여 입출력부(101)에서 입력 받은 데이터가 공격 데이터인지 유무를 확인할 수 있다.The processor 120 calculates a loss value between the output value and the preprocessed data by passing the preprocessed data through a pre-learned deep learning model after data input from the outside is preprocessed by the input/output unit 101, and the calculated Based on the comparison between the loss value and the threshold value, it is possible to determine whether data received from the input/output unit 101 is attack data.

여기서, 기 학습된 딥러닝 모델은, 오토 인코더 기반으로 학습된 딥러닝 모델일 수 있으며, 이하 오토 인코더에 대하여 도 2를 통해 상세히 설명하도록 한다.Here, the pre-learned deep learning model may be a deep learning model learned based on an auto-encoder, and the auto-encoder will be described in detail with reference to FIG. 2 below.

도 2는 오토 인코더의 동작을 설명하기 위한 도면이다.2 is a diagram for explaining the operation of an auto encoder.

도 2를 참조하면, 오토 인코더(10)는 입력값과 출력값을 동일한 값으로 근사하는 비지도 학습 신경망(Unsupervised Neural Network)이다. Referring to FIG. 2 , the autoencoder 10 is an unsupervised neural network that approximates an input value and an output value to the same value.

오토 인코더는 인코더(11, Encoder)와 디코더(13, Decoder)를 포함할 수 있으며, 코드(Code) 계층(12)을 기준으로 인코더와 디코더가 대칭되는 구조가 형성될 수 있다.The auto-encoder may include an encoder 11 (Encoder) and a decoder 13 (Decoder), and a structure in which the encoder and the decoder are symmetric based on the code layer 12 may be formed.

보다 상세히, 오토 인코더(10)는 입력 데이터를 인코더 및 디코더에 통과시켜 입력 데이터의 특징을 학습할 수 있다.In more detail, the auto-encoder 10 may learn characteristics of the input data by passing the input data through an encoder and a decoder.

인코더(11)는 데이터를 입력 받아, 입력 받은 데이터의 차원보다 낮은 차원으로 압축을 수행할 수 있다. 인코더(11)는 입력 받은 데이터의 압축 과정에서 복원을 위해 중요한 정보들을 최대한 보존하는 것을 목표로 압축을 수행할 수 있기 때문에, 중요한 핵심 정보들만을 인코딩 할 수 있다.The encoder 11 may receive data and perform compression in a dimension lower than that of the received data. Since the encoder 11 can perform compression with the goal of maximally preserving important information for reconstruction in the process of compressing input data, it can encode only important core information.

이러한, 인코더(11)에서 입력 받은 데이터를 압축하여 출력된 값은 하기 수학식 2을 통해 계산할 수 있다.A value output by compressing the data received from the encoder 11 can be calculated through Equation 2 below.

여기서,

는 인코더의 출력 값(또는 코드(Code)) 이고,

는 입력되는 데이터 값이고,

및

는 인코더의 파라미터이고,

는 활성함수(activation function)이다.here,

is the output value (or code) of the encoder,

is the input data value,

and

is a parameter of the encoder,

is an activation function.

이때, 활성함수

는 비선형 함수와 선형 함수를 사용할 수 있으며, 선형 함수를 사용할 경우 선형 특징 추출 기법인 PCA(Principal component analysis)와 유사하게 작동될 수 있다.At this time, the activation function

can use a non-linear function and a linear function, and when a linear function is used, it can operate similarly to PCA (Principal Component Analysis), a linear feature extraction technique.

코드 계층(12)은 인코더(11)와 디코더(13)가 중복되는 병목 구간(bottleneck)으로, 입력 데이터의 저 차원의 잠재 공간으로 매핑된 결과를 출력할 수 있다.The code layer 12 is a bottleneck in which the encoder 11 and the decoder 13 overlap, and may output a result mapped to a low-dimensional latent space of input data.

디코더(13)는 코드 계층으로부터 출력된 출력 벡터를 입력 받아, 입력 받은 데이터를 재구성(Reconstruction)하는 과정을 수행할 수 있다.The decoder 13 may receive the output vector output from the code layer and perform a process of reconstructing the received data.

이러한, 디코더(13)에서 입력 받은 데이터를 재구성하여 출력한 값은 하기 수학식 3을 통해 계산할 수 있다.A value output by reconstructing the data received from the decoder 13 can be calculated through Equation 3 below.

여기서,

는 디코더(13)의 출력 값이고,

는 인코더(11)의 출력 값(또는 코드(Code) 이고,

및

는 디코더(13)의 파라미터이고,

는 활성함수(activation function)이다.here,

is the output value of the decoder 13,

Is the output value (or code) of the encoder 11,

and

is a parameter of the decoder 13,

is an activation function.

보다 상세히, 디코더(13)는 인코더(11)에서 출력된 값(

)을 이용하여, 인코더(11)에서 출력된 값(

)을 재구성한 값(

)을 출력할 수 있다.In more detail, the decoder 13 outputs the value from the encoder 11 (

), the value output from the encoder 11 (

) reconstructed (

) can be output.

한편, 오토 인코더(10)의 손실 함수(

)는, 인코더(11)의 입력 값(

)과 디코더(13)의 출력 값(

)간의 평균 제곱 오차(MSE: Mean Squared Error)일 수 있으며, 손실 함수(

)는 손실값 또는 재구성 손실(Reconstruction Error)로 지칭될 수 있다.On the other hand, the loss function of the auto encoder 10 (

) is the input value of the encoder 11 (

) and the output value of the decoder 13 (

), it may be a mean squared error (MSE) between, and a loss function (

) may be referred to as a loss value or reconstruction error.

이러한, 오토 인코더(10)의 손실 함수(

)는 하기 수학식 4를 통해 계산할 수 있다.Such a loss function of the auto encoder 10 (

) can be calculated through Equation 4 below.

이때, 오토 인코더(10)의 손실함수(

)는 평균 절대 오차(MAE: Mean Absolute Error)와 교차 엔트로피 오차(Cross Entropy Error) 등이 사용될 수 있으며, 오토 인코더(10)는 손실 함수(또는 손실값 또는 재구성 손실)을 최소화하도록 학습할 수 있다.At this time, the loss function of the auto encoder 10 (

) may use Mean Absolute Error (MAE) and Cross Entropy Error, etc., and the autoencoder 10 may learn to minimize the loss function (or loss value or reconstruction loss). .

한편, 적층 오토 인코더(Stacked Auto Encoder)는 오토 인코더(10)가 복수 개의 은닉층을 가지는 모델이며, 단일 오토 인코더(10)보다 깊은 레이어층을 가지므로, 오토 인코더(10)에서 학습되는 데이터보다 복잡한 데이터에 대해서 학습할 수 있다.On the other hand, the stacked auto encoder is a model in which the auto encoder 10 has a plurality of hidden layers, and since it has a deeper layer than a single auto encoder 10, it is more complex than the data learned in the auto encoder 10. You can learn about data.

이때, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(10)는 오토 인코더(10) 또는 적층 오토 인코더 기반으로 기 학습된 딥러닝 학습 모델을 이용하여 공격 데이터 유무를 확인할 수 있다.At this time, the attack data detection device 10 according to an embodiment of the present invention may check the presence or absence of attack data by using the auto-encoder 10 or a pre-learned deep learning model based on the stacked auto-encoder.

다시, 도 1을 참조하여 본 발명의 실시예에 따른 공격 데이터 탐지 장치(10)에서 공격 데이터 유무를 탐지하는데 이용하는 기 학습된 딥러닝 모델은, 정상 데이터(또는 정상 클래스(Normal-Class))와 공격 데이터(비정상 클래스 데이터)를 포함하는 전체 데이터 중, 정상 데이터만을 사용하여 정상 데이터에 대한 손실 함수가 최소화 되도록 학습될 수 있다. Again, referring to FIG. 1, the pre-learned deep learning model used to detect the presence or absence of attack data in the attack data detection device 10 according to an embodiment of the present invention is normal data (or normal-class) and Among all data including attack data (abnormal class data), only normal data may be used to learn a loss function for normal data to be minimized.

보다 상세히, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)는 정상 데이터(또는 정상 클래스)를 오토 인코더에 입력하여 정상 데이터보다 낮은 차원으로 압축한 후, 정상 데이터의 차원으로 복원하여 벡터값을 출력하는 과정을 통해 정상 데이터의 특징이 추출되도록 기 학습되어 있을 수 있다.In more detail, the attack data detection apparatus 100 according to an embodiment of the present invention inputs normal data (or normal class) to an auto-encoder, compresses it to a dimension lower than normal data, and then restores it to the dimension of normal data to obtain a vector value. It may be pre-learned to extract features of normal data through the process of outputting .

이하, 기 학습된 딥러닝 모델에서 학습하는 정상 데이터에 대하여 설명하도록 한다.Hereinafter, normal data learned from the pre-learned deep learning model will be described.

기 학습된 딥러닝 모델은 하기 표 1에 기재된 정상 데이터(Normal) 와 비정상 데이터(또는 공격 데이터, Attack) 정상 데이터만을 이용하여 기 학습되어 있을 수 있다.The pre-learned deep learning model may be pre-learned using only normal data (normal data) and abnormal data (or attack data, attack) described in Table 1 below.

여기서, 비정상 데이터(또는 공격 데이터, Attack)는 DoS, Probe, U2R, R2L 타입을 포함할 수 있다.Here, abnormal data (or attack data, Attack) may include DoS, Probe, U2R, and R2L types.

이러한, 기 학습된 딥러닝 모델은, 딥러닝 모델에 정상 데이터를 통과시켜 출력된 값과 정상 데이터간의 손실값(또는 손실함수 또는 재구성 손실)이 작아지도록 기 학습되어 있을 수 있다.Such a pre-learned deep learning model may be pre-learned so that a loss value (or loss function or reconstruction loss) between a value output by passing normal data through the deep learning model and normal data becomes small.

따라서, 기 학습된 딥러닝 모델은 오토 인코더 기반으로 정상 데이터(또는 정상 클래스)의 특징들에 대해서만 학습되어 있기 때문에, 기 학습된 딥러닝 모델을 통과한 학습에 사용되지 않은 샘플들(예를 들어, 공격 데이터)에 대해서는 높은 손실 함수(또는 손실값 또는 재구성 손실)를 가지므로, 손실 함수를 이용하여 새로운 유형의 공격 데이터를 탐지할 수 있는 효과를 도출할 수 있다.Therefore, since the pre-learned deep learning model is learned only for the features of normal data (or normal class) based on the auto-encoder, samples that are not used for learning that pass through the pre-learned deep learning model (e.g. , attack data) has a high loss function (or loss value or reconstruction loss), so it is possible to derive an effect of detecting a new type of attack data using the loss function.

예를 들어, 기 학습된 딥러닝 모델에 학습되지 않은 새로운 데이터가 기 학습된 딥러닝 모델의 학습에 사용된 정상 데이터와 유사할 경우, 새로운 데이터와, 새로운 데이터가 기 학습된 딥러닝 모델을 통과한 후 출력된 값과의 손실값이 임계값 이하일 수 있다. 또한, 새로운 데이터가 기 학습된 딥러닝 모델의 학습에 사용된 정상 데이터와 전혀 유사하지 않은 비정상 행위를 포함하는 공격 데이터일 경우, 새로운 데이터와, 새로운 데이터가 기 학습된 딥러닝 모델을 통과한 후 출력된 값과의 손실값이 임계값을 초과할 수 있다.For example, if new data that has not been trained in the pre-learned deep learning model is similar to the normal data used for training of the pre-trained deep learning model, the new data and the new data are passed through the pre-trained deep learning model. After that, the loss value with the output value may be less than the threshold value. In addition, if the new data is attack data that includes abnormal behavior that is not at all similar to the normal data used for learning the pre-learned deep learning model, after the new data and the new data pass through the pre-learned deep learning model, The loss value with the output value may exceed the threshold value.

따라서, 본 발명의 일 실시예에 따른 공격 데이터 탐지 장치(100)는 기 학습된 딥러닝 모델의 학습 과정에서 산출된 복수의 손실 값에 대한 백분율 값을 임계값으로 설정하고, 설정된 임계값을 기초로 공격 데이터 여부를 탐지할 수 있다.Therefore, the attack data detection apparatus 100 according to an embodiment of the present invention sets a percentage value for a plurality of loss values calculated in the learning process of a pre-learned deep learning model as a threshold value, and based on the set threshold value can detect attack data.

예를 들어, 기 학습된 딥러닝 학습 모델에 사용된 학습 데이터 셋에서 산출된 다수의 손실값(또는 손실함수, 또는 재구성 손실)의 백분위 값을 통해 임계값(

)이 결정되며, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)는 임계값(

)을 기준으로 정상 데이터 및 공격 데이터를 구분할 수 있다.For example, the threshold (

) is determined, and the attack data detection apparatus 100 according to an embodiment of the present invention has a threshold value (

), normal data and attack data can be distinguished.

이하, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)의 임계값에 대한 구체적인 설명은 도 4를 통해 상세히 설명하도록 한다.Hereinafter, a detailed description of the threshold value of the attack data detection device 100 according to an embodiment of the present invention will be described in detail with reference to FIG. 4 .

예를 들어, 프로세서(120)는 입출력부(101)에서 외부로부터 입력 받은 데이터가 전처리된 후, 기 학습된 딥러닝 모델에 전처리된 데이터를 통과시켜 출력된 값과 전처리된 데이터간의 손실값이 기 설정된 임계값을 초과할 경우, 입출력부(101)에서 외부로부터 입력 받은 데이터를 공격 데이터로 확인할 수 있다. 또한, 프로세서(120)는 입출력부(101)에서 외부로부터 입력 받은 데이터가 전처리된 후, 기 학습된 딥러닝 모델에 전처리된 데이터를 통과시켜 출력된 값과 전처리된 데이터간의 손실값이 기 설정된 임계값 이하일 경우, 입출력부(101)에서 외부로부터 입력 받은 데이터를 정상 데이터로 확인할 수 있다.For example, the processor 120 pre-processes data input from the outside in the input/output unit 101, passes the pre-processed data through a pre-learned deep learning model, and determines the loss value between the output value and the pre-processed data. When the set threshold value is exceeded, the input/output unit 101 may check data received from the outside as attack data. In addition, the processor 120, after pre-processing the data input from the outside in the input/output unit 101, passes the pre-processed data through a pre-learned deep learning model so that the loss value between the output value and the pre-processed data is a preset threshold. If the value is less than or equal to the value, the input/output unit 101 may check data received from the outside as normal data.

여기서, 정상 데이터는 네트워크에 피해를 주지 않는 정보를 포함하고 있는 텍스트, 이미지 및 영상 중 적어도 하나일 수 있으며, 공격 데이터는 네트워크에 피해를 줄 수 있는 정보를 포함하는 텍스트 이미지 및 영상 중 적어도 하나일 수 있다. 예컨대, 공격 데이터는 Dos, Probe, U2R, R2L 유형의 데이터일 수 있다.Here, the normal data may be at least one of text, image, and video containing information that does not cause damage to the network, and the attack data may be at least one of text image and video containing information that may cause damage to the network. can For example, the attack data may be data of Dos, Probe, U2R, and R2L types.

도 3은 본 발명의 일 실시예에 따른 기 학습된 딥러닝 모델을 통해 확인한 정상 데이터와 공격 데이터를 시각화하여 나타낸 그래프이다.3 is a graph showing normal data and attack data confirmed through a pre-learned deep learning model according to an embodiment of the present invention.

도 3을 참조하면, 기 학습된 딥러닝 모델의 출력값을 통해서 정상 데이터들과 공격 데이터들이 구분되기는 하되, 정상 데이터들과 공격 데이터가 선형 분리는 불가능한 것을 확인할 수 있으며, 이에 따라 정상 데이터로부터 비선형 관계들을 모델링 가능한 학습 모델을 이용해야만, 정상 데이터와 공격 데이터의 분리가 가능한 것을 확인할 수 있다.Referring to FIG. 3, although normal data and attack data are distinguished through the output value of the pre-learned deep learning model, it can be confirmed that linear separation between normal data and attack data is not possible. Accordingly, a non-linear relationship from normal data It can be confirmed that separation of normal data and attack data is possible only when a learning model capable of modeling them is used.

따라서, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)는 비선형 분리가 가능한 딥러닝 모델을 이용하여 학습을 수행한 후, 기 학습된 딥러닝 모델을 이용하여 공격 데이터 유무를 탐지하였으며, 기 학습된 딥러닝 모델에서 오토 인코더의 활성함수 또한 비선형 함수로 이용하여 모델링을 수행하였다.Therefore, the attack data detection apparatus 100 according to an embodiment of the present invention performs learning using a deep learning model capable of nonlinear separation, and then detects the presence or absence of attack data using the previously learned deep learning model. In the learned deep learning model, the activation function of the auto encoder was also used as a nonlinear function to perform modeling.

도 4는 본 발명의 실시예에 따른 공격 데이터 탐지 장치를 통해 출력된 정상 데이터 및 공격 데이터의 손실값 분포를 나타낸 그래프이다.4 is a graph showing loss value distributions of normal data and attack data output through an attack data detection device according to an embodiment of the present invention.

도 4의 손실값 분포 그래프에서, x축은 손실 값이고, y축은 손실 값의 밀도이다.In the loss value distribution graph of FIG. 4 , the x-axis represents the loss value, and the y-axis represents the density of the loss value.

도 4를 참조하면, 좌측에 위치한 분포 그래프 데이터(400)는 정상 데이터(Normal)의 손실값 분포이며, 우측에 위치한 분포 그래프 데이터(410)는 공격 데이터(Attack)의 손실값 분포이다.Referring to FIG. 4 , distribution graph data 400 located on the left is a loss value distribution of normal data (Normal), and distribution graph data 410 located on the right is a loss value distribution of attack data (Attack).

이때, 정상 데이터의 손실값 분포와, 공격 데이터의 손실값 분포는 서로 떨어져 위치하는 것을 확인할 수 있으며, 정상 데이터의 손실값 분포와, 공격 데이터의 손실값 분포 간의 경계를 임계값으로 사용하여, 정상 데이터와 공격 데이터를 분류하는데 사용할 수 있다.At this time, it can be confirmed that the loss value distribution of normal data and the loss value distribution of attack data are located apart from each other, and using the boundary between the loss value distribution of normal data and the loss value distribution of attack data as a threshold, It can be used to classify data and attack data.

예를 들어, 딥러닝 모델의 학습과정에서 산출된 정상 데이터의 손실값 분포와 공격 데이터의 손실값 분포 간의 경계를 임계값으로 설정할 수 있다. For example, a boundary between a loss value distribution of normal data and a loss value distribution of attack data calculated in the learning process of a deep learning model may be set as a threshold value.

따라서, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)는 딥러닝 모델에 입력되는 데이터와 기 학습된 딥러닝 모델에 입력되는 데이터를 통과시켜 출력된 값간의 손실값이 임계값 이하일 경우, 정상 데이터로 탐지하고, 손실값이 임계값을 초과할 경우에는 공격 데이터로 탐지할 수 있다.Therefore, when the attack data detection apparatus 100 according to an embodiment of the present invention passes the data input to the deep learning model and the data input to the pre-learned deep learning model, and the loss value between the output value is less than the threshold value, It is detected as normal data, and when the loss value exceeds the threshold value, it can be detected as attack data.

이하, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)의 공격 데이터 탐지 성능에 대하여 도 5 및 도 6을 참조하여 설명하도록 한다.Hereinafter, attack data detection performance of the attack data detection apparatus 100 according to an embodiment of the present invention will be described with reference to FIGS. 5 and 6 .

먼저, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)의 성능을 평가하기 위해, 혼동 행렬(Confusion Matrix)을 이용하여 성능을 비교하였으며, 정확도(Accuracy), 정밀도 (Precision), 재현율(Recall), F1 스코어(F1 score)에 대하여 모델별 성능을 비교 분석하였다.First, in order to evaluate the performance of the attack data detection device 100 according to an embodiment of the present invention, the performance was compared using a confusion matrix, and accuracy, precision, and recall were compared. ), the performance of each model was compared and analyzed for the F1 score.

이하, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)의 혼동 행렬 은 하기 표 2와 같이 나타낼 수 있다.Hereinafter, the confusion matrix of the attack data detection device 100 according to an embodiment of the present invention can be represented as shown in Table 2 below.

혼돈 행렬을 이용하여 분석하는 성능 중, 정확도(Accuracy)는 전체 샘플 중 맞게 예측한 샘플 수의 비율을 뜻하며, 정확도가 높을수록 우수한 딥러닝 모델일 수 있으며, 이러한 정확도(Accuracy)는 하기 수학식 5를 통해 계산할 수 있다. Among the performance analyzed using the chaos matrix, accuracy refers to the ratio of the number of correctly predicted samples out of all samples. The higher the accuracy, the better the deep learning model. can be calculated through

보다 상세히, 정확도는 전체 학습 데이터 셋에서 정상 데이터(또는 정상 트래픽)와 공격 데이터(또는 공격 트래픽)를 정확히 예측한 비율을 지칭한다. 이때, 불균형 데이터를 포함하는 학습 데이터 셋을 이용한 딥러닝 모델은 정확도가 낮을 수 있다.More specifically, accuracy refers to the ratio of correctly predicting normal data (or normal traffic) and attack data (or attack traffic) in the entire training data set. In this case, a deep learning model using a training data set including imbalanced data may have low accuracy.

정밀도(Precision)는 양성 클래스에 속한다고 출력한 샘플 중 실제로 양성 클래스에 속하는 샘플 수의 비율을 지칭하며, 하기 수학식 6을 통해 계산할 수 있다.Precision refers to the ratio of the number of samples actually belonging to the positive class among samples output as belonging to the positive class, and can be calculated through Equation 6 below.

재현율(Recall)은 실제 양성 클래스에 속한 표본 중에 양성 클래스에 속한다고 출력한 표본의 수의 비율을 지칭하며, 하기 수학식 7을 통해 계산할 수 있다.Recall refers to the ratio of the number of samples output as belonging to the positive class among samples belonging to the actual positive class, and can be calculated through Equation 7 below.

이때, 정밀도와 재현율은 상호 보완적인 평가 지표이므로, 어느 한쪽의 수치를 강제로 높이면 다른 한쪽의 수치는 떨어질 가능성이 있으므로, F1 Score 를 확인할 수 있는데, 이러한 F1 Score은 불균형 데이터(또는 불균형 클래스)에서 정확한 평가를 위해 사용되며, 정밀도와 재현율의 조화 평균을 지칭한다. At this time, since precision and recall are mutually complementary evaluation indicators, if one value is forcibly increased, the other value may decrease. Therefore, the F1 Score can be checked. It is used for accurate evaluation and refers to the harmonic average of precision and recall.

이때, F1 Score은 하기 수학식 8을 통해 계산할 수 있다.At this time, the F1 Score can be calculated through Equation 8 below.

도 5는 오토 인코더를 이용한 학습 모델과 지도 학습으로 학습된 학습 모델의 성능을 비교한 그래프이다.5 is a graph comparing performance of a learning model using an auto-encoder and a learning model learned through supervised learning.

하기 표 3은 도 5의 그래프에 대한 수치를 나타낸 표이다. Table 3 below is a table showing numerical values for the graph of FIG. 5 .

도 5 및 표 3을 참조하면, 지도 학습 방법으로 학습된 학습 모델들(Random Forest, Logistic Regression, Deep Neural Network, SVM, K-Neighbors)은 정상 데이터 및 공격 데이터를 포함하는 학습 데이터 셋으로 학습을 수행하였으며, 오토 인코더 기반 학습 모델(Stacked Auto Encoder)은 정상 데이터만을 이용한 학습 데이터 셋으로 학습하였다.Referring to FIG. 5 and Table 3, learning models (Random Forest, Logistic Regression, Deep Neural Network, SVM, K-Neighbors) learned by the supervised learning method are trained with a training data set including normal data and attack data. and the auto encoder-based learning model (Stacked Auto Encoder) was learned with a training data set using only normal data.

이에 대하여 살펴보면, 지도 학습 방법으로 학습된 학습 모델들과 오토 인코더 기반 학습 모델을 비교하면, 전체적인 성능에서 오토 인코더 기반 학습 모델이 지도 학습 방법으로 학습된 학습모델보다 각 성능(Accuracy, Precision, Recall, F1 Score)에 대한 수치가 더 높을 것을 확인할 수 있으며, 크게는 약 10%까지의 성능에 대한 수치에 차이를 보이고 있는 것을 확인할 수 있다.Looking at this, when comparing the learning models learned by the supervised learning method and the auto-encoder-based learning model, the auto-encoder-based learning model has better performance (Accuracy, Precision, Recall, It can be seen that the value for F1 Score) is higher, and it can be seen that the value for performance shows a difference of up to about 10%.

하기, 표 4는 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)에 대한 성능(Precision, Recall, F1 Score, Support) 및 데이터(Normal, Attack)별 세부 분류 결과를 나타낸 것이다.Table 4 below shows detailed classification results for each performance (Precision, Recall, F1 Score, Support) and data (Normal, Attack) of the attack data detection device 100 according to an embodiment of the present invention.

표 4를 참조하면, 정밀도(precision)과 재현율(recall)값은 분류 결정 임계값에 영향을 받으며, 둘의 관계는 트레이드 오프(trade off) 관계로, 한쪽 지표가 과도하게 높을 경우 성능 저하가 발생될 수 있지만, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)는 정밀도(precision)과 재현율(recall)값 모두 높은 수치를 나타내는 것을 확인할 수 있다.Referring to Table 4, the precision and recall values are affected by the classification decision threshold, and the relationship between the two is a trade off relationship. If one index is excessively high, performance degradation occurs. However, it can be confirmed that the attack data detection apparatus 100 according to the embodiment of the present invention shows high values in both precision and recall values.

도 6은 본 발명의 실시예에 따른 공격 데이터 탐지 장치에 대한 이상 탐지 혼동 행렬을 나타낸 도면이다.6 is a diagram showing an anomaly detection confusion matrix for an attack data detection apparatus according to an embodiment of the present invention.

도 6을 참조하면, 본 발명의 실시예에 따른 공격 데이터 탐지 장치(100)의 학습 결과를 혼동 행렬로 분석한 결과, 정상 데이터(Normal)와 공격 데이터(Attack)가 모두 높은 수치를 보이고 있는 것을 확인할 수 있다.Referring to FIG. 6, as a result of analyzing the learning result of the attack data detection device 100 according to an embodiment of the present invention with a confusion matrix, it is found that both normal data and attack data show high values. You can check.

도 7은 본 발명의 일 실시예에 따른 공격 데이터 탐지 방법의 절차에 대한 예시적인 순서도이다. 도 7의 공격 데이터 탐지 방법은 도 1에 도시된 공격 데이터 탐지 장치(100)에 의해 수행 가능하다. 아울러, 도 7에 도시된 공격 데이터 탐지 방법은 예시적인 것에 불과하다.7 is an exemplary flowchart of a procedure of a method for detecting attack data according to an embodiment of the present invention. The attack data detection method of FIG. 7 can be performed by the attack data detection device 100 shown in FIG. 1 . In addition, the attack data detection method shown in FIG. 7 is only exemplary.

도 7을 참조하면, 프로세서(120)는 입출력부(101)에서 외부로부터 입력 받은 데이터의 유형에 대응되는 데이터 전처리를 수행할 수 있다(단계 S10).Referring to FIG. 7 , the processor 120 may perform data preprocessing corresponding to the type of data input from the outside through the input/output unit 101 (step S10).

이후, 프로세서(120)는 기 학습된 딥러닝 모델에 전처리된 데이터를 통과시켜 출력된 값과 전처리된 데이터간의 손실값(또는 손실함수, 또는 재구성 손실)을 산출할 수 있다(단계 S20).Thereafter, the processor 120 may calculate a loss value (or loss function, or reconstruction loss) between the output value and the preprocessed data by passing the preprocessed data through the pre-learned deep learning model (step S20).

이후, 프로세서(120)는 산출된 손실값과 임계값의 비교에 기초하여 입출력부(101)에서 입력 받은 데이터가 공격 데이터인지 유무를 확인할 수 있다(단계 S30).Thereafter, the processor 120 may check whether the data received from the input/output unit 101 is attack data based on the comparison between the calculated loss value and the threshold value (step S30).

예를 들어, 프로세서(120)는 입출력부(101)에서 외부로부터 입력 받은 데이터가 전처리된 후, 기 학습된 딥러닝 모델에 전처리된 데이터를 통과시켜 출력된 값과 전처리된 데이터간의 손실값이 기 설정된 임계값을 초과할 경우, 입출력부(101)에서 외부로부터 입력 받은 데이터를 공격 데이터로 확인할 수 있다.For example, the processor 120 pre-processes data input from the outside in the input/output unit 101, passes the pre-processed data through a pre-learned deep learning model, and determines the loss value between the output value and the pre-processed data. When the set threshold value is exceeded, the input/output unit 101 may check data received from the outside as attack data.

또한, 프로세서(120)는 입출력부(101)에서 외부로부터 입력 받은 데이터가 전처리된 후, 기 학습된 딥러닝 모델에 전처리된 데이터를 통과시켜 출력된 값과 전처리된 데이터간의 손실값이 기 설정된 임계값 이하일 경우, 입출력부(101)에서 외부로부터 입력 받은 데이터를 정상 데이터로 확인할 수 있다.In addition, the processor 120, after pre-processing the data input from the outside in the input/output unit 101, passes the pre-processed data through a pre-learned deep learning model so that the loss value between the output value and the pre-processed data is a preset threshold. If the value is less than or equal to the value, the input/output unit 101 may check data received from the outside as normal data.

이상에서 살펴본 바와 같이, 본 발명의 일 실시예에 따른 공격 데이터 탐지 장치는, 정상 데이터를 이용하여 기 학습시킨 딥러닝 모델을 이용하기 때문에, 불균형 데이터로 학습된 딥러닝 모델 보다 공격 탐지 유무를 정확하게 탐지할 수 있다.As described above, since the attack data detection device according to an embodiment of the present invention uses a deep learning model pre-trained using normal data, it accurately detects whether or not an attack is detected than a deep learning model learned with imbalanced data. can detect

본 발명에 첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 인코딩 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 인코딩 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방법으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다.Combinations of each block of the block diagram and each step of the flowchart accompanying the present invention may be performed by computer program instructions. Since these computer program instructions may be loaded into an encoding processor of a general-purpose computer, special-purpose computer, or other programmable data processing equipment, the instructions executed by the encoding processor of the computer or other programmable data processing equipment are each block or block diagram of the block diagram. Each step in the flow chart creates means for performing the functions described. These computer program instructions may also be stored in a computer usable or computer readable memory that can be directed to a computer or other programmable data processing equipment to implement functionality in a particular way, such that the computer usable or computer readable memory It is also possible for the instructions stored in to produce an article of manufacture containing instruction means for performing the function described in each block of the block diagram or each step of the flow chart. The computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operational steps are performed on the computer or other programmable data processing equipment to create a computer-executed process to generate computer or other programmable data processing equipment. It is also possible that the instructions performing the processing equipment provide steps for executing the functions described in each block of the block diagram and each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.Additionally, each block or each step may represent a module, segment or portion of code that includes one or more executable instructions for executing specified logical function(s). It should also be noted that in some alternative embodiments it is possible for the functions recited in blocks or steps to occur out of order. For example, two blocks or steps shown in succession may in fact be performed substantially concurrently, or the blocks or steps may sometimes be performed in reverse order depending on their function.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 품질에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 균등한 범위 내에 있는 모든 기술사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely an example of the technical idea of the present invention, and various modifications and variations can be made to those skilled in the art without departing from the essential qualities of the present invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to explain, and the scope of the technical idea of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed according to the claims below, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present invention.

100: 공격 데이터 탐지 장치
101: 입출력부
102: 통신부
110: 메모리
120: 프로세서100: attack data detection device
101: input/output unit
102: communication department
110: memory
120: processor

Claims

In the attack data detection method performed by the attack data detection device,
Performing data preprocessing corresponding to the type of data input from the outside;
Calculating a loss value between an output value and the preprocessed data by passing the preprocessed data through a pre-learned deep learning model;
Checking whether the data is attack data based on a comparison between the calculated loss value and a threshold value;
The threshold is
Percentage value of a plurality of loss values calculated in the learning process of the pre-learned deep learning model
How to detect attack data.

According to claim 1,
The pre-learned deep learning model,
It is pre-learned to extract the features of the normal data through a process of inputting normal data to an auto-encoder, compressing it to a dimension lower than that of the normal data, restoring it to the dimension of the normal data and outputting a vector value
How to detect attack data.

According to claim 2,
The pre-learned deep learning model,
Pre-learned so that the loss value between the value output by passing the normal data through the deep learning model and the normal data is small
How to detect attack data.

According to claim 1,
The step of detecting the presence or absence of attack data is,
When the calculated loss value exceeds the threshold value, the data is identified as the attack data;
When the calculated loss value is less than the threshold value, confirming the data as normal data
How to detect attack data.

According to claim 1,
Performing the preprocessing step,
If the data received from the outside is first type data, encoding the first type data into an integer type and then performing a one-hot encoding process to output a vector value;
When the data received from the outside is second type data, performing minimum-max normalization on the second type data
How to detect attack data.

an input/output unit that receives data from the outside;
Memory; and
A processor electrically connected to the memory;
the processor,
Perform data pre-processing corresponding to the type of data input from the outside, pass the pre-processed data through a pre-learned deep learning model, calculate a loss value between an output value and the pre-processed data, Based on a comparison between a loss value and a preset threshold, it is determined whether the data is attack data;
The preset threshold is,
Percentage value of a plurality of loss values calculated in the learning process of the pre-learned deep learning model
Attack data detection device.

According to claim 6,
The pre-learned deep learning model,
It is pre-learned to extract the features of the normal data through a process of inputting normal data to an auto-encoder, compressing it to a dimension lower than that of the normal data, restoring it to the dimension of the normal data and outputting a vector value
Attack data detection device.

According to claim 7,
The pre-learned deep learning model,
Pre-learned so that the loss value between the value output by passing the normal data through the deep learning model and the normal data is small
Attack data detection device.

According to claim 6,
the processor,
When the calculated loss value exceeds the threshold, confirming the data as the attack data;
When the calculated loss value is less than the threshold value, confirming the data as normal data
Attack data detection device.

According to claim 6,
the processor,
When the data received from the outside is first type data, the first type data is encoded into an integer type, and then a one-hot encoding process is performed to output a vector value;
When the data input from the outside is second type data, performing minimum-max normalization on the second type data
Attack data detection device.

A computer-readable recording medium storing a computer program,
When the computer program is executed by a processor,
Performing data pre-processing corresponding to the type of data input from the outside;
Calculating a loss value between an output value and the preprocessed data by passing the preprocessed data through a pre-learned deep learning model;
An instruction for causing the processor to perform a method including determining whether the data is attack data based on the comparison between the calculated loss value and a threshold value;
The threshold is
Percentage value of a plurality of loss values calculated in the learning process of the pre-learned deep learning model
A computer-readable recording medium.

As a computer program stored on a computer-readable recording medium,
When the computer program is executed by a processor,
Performing data pre-processing corresponding to the type of data input from the outside;
Calculating a loss value between an output value and the preprocessed data by passing the preprocessed data through a pre-learned deep learning model;
An instruction for causing the processor to perform a method including determining whether the data is attack data based on the comparison between the calculated loss value and a threshold value;
The threshold is
Percentage value of a plurality of loss values calculated in the learning process of the pre-learned deep learning model
computer program.