KR102053781B1

KR102053781B1 - Apparatus and method for extracting signiture

Info

Publication number: KR102053781B1
Application number: KR1020180038346A
Authority: KR
Inventors: 조학수
Original assignee: 주식회사 윈스
Priority date: 2018-04-02
Filing date: 2018-04-02
Publication date: 2020-01-22
Also published as: KR20190115369A

Abstract

본 발명은 복수의 패킷들에서 비정상 트래픽 탐지에 이용되는 시그니처를 추출하기 위한 장치 및 방법에 관한 것이다. 이를 위한 본 발명의 시그니처 추출 방법은 복수의 페이로드들을 포함하는 페이로드 리스트에서 두 개의 페이로드들을 선택하는 단계; 두 개의 페이로드들의 최장 공통 부분 문자열(LCS: Longest Common Subsequence)을 도출하는 단계; 페이로드 리스트에서 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산하는 단계; 계산하는 단계에서 계산된 출현 빈도와 미리 정해진 최소 빈도값을 비교하는 단계; 및 비교 결과, 최장 공통 부분 문자열(LCS)의 출현 빈도가 최소 빈도값을 초과하면, 최장 공통 부분 문자열(LCS)을 시그니처로 결정하는 단계를 포함하는 것을 특징으로 한다.The present invention relates to an apparatus and method for extracting a signature used for abnormal traffic detection in a plurality of packets. The signature extraction method of the present invention for this purpose comprises the steps of selecting two payloads from a payload list including a plurality of payloads; Deriving a longest common subsequence (LCS) of two payloads; Calculating a frequency of appearance of the longest common substring (LCS) in the payload list; Comparing the appearance frequency calculated in the calculating step with a predetermined minimum frequency value; And as a result of the comparison, determining the longest common substring LCS as the signature when the frequency of appearance of the longest common substring LCS exceeds the minimum frequency value.

Description

Signature Extraction Apparatus and Method {APPARATUS AND METHOD FOR EXTRACTING SIGNITURE}

본 발명은 시그니처 추출 장치 및 방법에 관한 것이고, 보다 상세하게 복수의 패킷들에서 비정상 트래픽 탐지에 이용되는 시그니처를 추출하기 위한 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for extracting signatures, and more particularly, to an apparatus and method for extracting signatures used for detecting abnormal traffic in a plurality of packets.

컴퓨터 및 인터넷 보급의 확산에 따라 ISP(Internet Service Provider) 업체가 늘어나고 있다. ISP는 개인이나 기업에게 인터넷 접속 서비스, 웹 사이트 구축 등을 제공하는 회사를 의미하며, ISP 업체는 광범위한 정보통신망을 제공하고 있다. 그러나, ISP 업체가 증가함에 따라, 이에 따른 부작용도 급증하고 있다. 특히, P2P(Peer-to-Peer) 응용서비스, 이메일 등을 통해 웜, 바이러스, 백도어, 랜섬웨어 등이 인터넷에 다량으로 유포되고 있으며, 이들의 공격기법 또한 고도화되고 다양화되어 가는 추세이다. 이들 공격기법은 정보통신망의 안전성 및 신뢰성을 위협하고 있어서, 공격에 대한 발생 징후를 사전에 탐지하여 대응할 수 있는 방안의 필요성이 대두되고 있다.Internet service provider (ISP) companies are increasing due to the proliferation of computers and the Internet. ISP refers to a company that provides Internet access services and web sites to individuals and companies. ISP companies provide a wide range of information and communication networks. However, as the number of ISP companies increases, the side effects are also increasing rapidly. In particular, worms, viruses, backdoors, ransomware, etc. are widely distributed on the Internet through peer-to-peer application services and e-mail, and their attack techniques are also becoming more advanced and diversified. Since these attack techniques threaten the safety and reliability of the information communication network, there is a need for a method for detecting and responding to the occurrence of an attack in advance.

인터넷 대응시스템의 비정상 트래픽 탐지 기법들 중, 시그니처 기반의 비정상 트래픽 탐지 기법이 있다. 시그니처란 용어는 트래픽의 특징을 나타내는 것으로서, 특정 응용 프로토콜 수행을 위해서 네트워크 상으로 주고 받는 메시지 중에서 해당 응용 프로토콜에서만 발견되는 비트 패턴을 의미한다. 시그니처를 이용한 비정상 트래픽 탐지 기법은 서버로 송신되는 패킷들 중 시그니처를 포함하는 패킷을 탐지하고, 탐지된 패킷을 비정상 패킷으로 판단함으로써 이루어진다. 시그니처를 이용한 비정상 트래픽 탐지 방법은 탐지율이 높은 반면, 미리 알려진 공격 패턴만을 탐지하는 한계점도 있다. 이로 인해, 이 비정상 트래픽 탐지 기법은 악성 트래픽을 사전에 분석하고 탐지 모듈에 적용하기까지 많은 시간이 소요되는 문제점이 있다.Among the abnormal traffic detection techniques of the Internet response system, there is a signature-based abnormal traffic detection technique. The term signature refers to a characteristic of traffic and refers to a bit pattern found only in a corresponding application protocol among messages transmitted and received on a network to perform a specific application protocol. An abnormal traffic detection method using signatures is performed by detecting a packet including a signature among packets transmitted to a server, and determining the detected packet as an abnormal packet. While the abnormal traffic detection method using signatures has a high detection rate, there is a limitation in detecting only known attack patterns. For this reason, this abnormal traffic detection technique has a problem that it takes a long time to analyze the malicious traffic in advance and apply it to the detection module.

따라서, 네트워크의 고속 대용량 트렁크로부터 비정상 트래픽을 나타내는 시그니처를 신속하게 추출할 수 있는 기법이 요구된다.Therefore, there is a need for a technique for quickly extracting a signature representing abnormal traffic from a high-speed bulk trunk of a network.

한국공개특허 제2013-0096033호Korean Laid-Open Patent No. 2013-0096033

본 발명은 고속 대용량 트렁크로부터 비정상 트래픽 탐지에 이용되는 시그니처를 신속하게 추출할 수 있는 시그니처 추출 장치 및 방법을 제공하는데 그 목적이 있다.An object of the present invention is to provide a signature extraction apparatus and method capable of quickly extracting a signature used for detecting abnormal traffic from a high-speed bulk trunk.

상기와 같은 과제를 해결하기 위한 본 발명의 시그니처 추출 방법은 복수의 페이로드들을 포함하는 페이로드 리스트에서 두 개의 페이로드들을 선택하는 단계; 두 개의 페이로드들의 최장 공통 부분 문자열(LCS: Longest Common Subsequence)을 도출하는 단계; 페이로드 리스트에서 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산하는 단계; 계산하는 단계에서 계산된 출현 빈도와 미리 정해진 최소 빈도값을 비교하는 단계; 및 비교 결과, 최장 공통 부분 문자열(LCS)의 출현 빈도가 최소 빈도값을 초과하면, 최장 공통 부분 문자열(LCS)을 시그니처로 결정하는 단계를 포함하는 것을 특징으로 한다.The signature extraction method of the present invention for solving the above problems comprises the steps of selecting two payloads from a payload list including a plurality of payloads; Deriving a longest common subsequence (LCS) of two payloads; Calculating a frequency of appearance of the longest common substring (LCS) in the payload list; Comparing the appearance frequency calculated in the calculating step with a predetermined minimum frequency value; And as a result of the comparison, determining the longest common substring LCS as the signature when the frequency of appearance of the longest common substring LCS exceeds the minimum frequency value.

또한, 두 개의 페이로드들을 선택하는 단계는 복수의 페이로드들 중 무작위로 두 개의 페이로드를 선택하는 단계일 수 있다.In addition, selecting two payloads may be selecting randomly two payloads from among a plurality of payloads.

또한, 선택하는 단계, 도출하는 단계, 계산하는 단계, 비교하는 단계; 및 결정하는 단계는 이 순서로 미리 결정된 반복 횟수만큼 페이로드 리스트에 대해 반복적으로 수행될 수 있다.In addition, selecting, deriving, calculating, comparing; And the determining may be performed repeatedly on the payload list by a predetermined number of repetitions in this order.

또한, 최장 공통 부분 문자열(LCS)을 도출하는 단계에서 이용되는 페이로드들의 개수는, 페이로드 리스트에 포함된 전체 페이로드들의 개수보다 작거나 같을 수 있다.In addition, the number of payloads used in the derivation of the longest common substring LCS may be less than or equal to the total number of payloads included in the payload list.

또한, 본 발명의 일 실시예에 따른 시그니처 추출 방법은 출현 빈도가 최소 빈도값을 초과하는 최장 공통 부분 문자열(LCS)을 탐지 패턴 리스트에 추가하는 단계를 더 포함할 수 있다.In addition, the signature extraction method according to an embodiment of the present invention may further include adding the longest common substring (LCS) whose appearance frequency exceeds a minimum frequency value to the detection pattern list.

또한, 본 발명의 일 실시예에 따른 시그니처 추출 방법은 최장 공통 부분 문자열(LCS)의 길이와 미리 설정된 부분 문자열 최소 길이를 비교하는 단계를 더 포함하고, 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산하는 단계는 최장 공통 부분 문자열(LCS)의 길이가 미리 설정된 부분 문자열 최소 길이를 초과할 때 수행될 수 있다.In addition, the signature extraction method according to an embodiment of the present invention further includes the step of comparing the length of the longest common substring (LCS) with a preset substring minimum length, and the frequency of appearance of the longest common substring (LCS) The calculating may be performed when the length of the longest common substring LCS exceeds a preset substring minimum length.

또한, 본 발명의 일 실시예에 따른 시그니처 추출 방법은 두 개의 페이로드들의 최장 공통 부분 문자열(LCS)을 도출하는 단계 이후, 최장 공통 부분 문자열(LCS)과 미리 저장된 탐지 패턴 리스트에 포함된 탐지 패턴들을 비교하는 단계를 더 포함하고, 페이로드 리스트에서 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산하는 단계는 최장 공통 부분 문자열(LCS)이 탐지 패턴 리스트에 포함된 탐지 패턴들 중 어떠한 탐지 패턴들에도 포함되지 않거나, 탐지 패턴 리스트에 포함된 탐지 패턴들 중 어떠한 탐지 패턴도 최장 공통 부분 문자열(LCS)에 포함되지 않은 경우 이루어질 수 있다.In addition, the signature extraction method according to an embodiment of the present invention, after the derivation of the longest common substring (LCS) of the two payloads, the detection pattern included in the longest common substring (LCS) and the pre-stored detection pattern list And comparing the frequency of occurrence of the longest common substring (LCS) in the payload list to any of the detection patterns of the detection patterns in which the longest common substring (LCS) is included in the detection pattern list. May not be included in any of the detection patterns included in the detection pattern list or included in the longest common substring (LCS).

또한, 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산하는 단계는, 페이로드 리스트에서 최장 공통 부분 문자열(LCS)을 포함하는 페이로드들의 개수를 확인하고, 최장 공통 부분 문자열(LCS)을 포함하는 페이로드들의 개수를 페이로드 리스트에 포함된 전체 페이로드들의 개수로 나눈 값에 기초하여 계산될 수 있다.The calculating of the frequency of appearance of the longest common substring (LCS) may include determining the number of payloads including the longest common substring (LCS) in the payload list, and including the longest common substring (LCS). The number of payloads may be calculated based on a value obtained by dividing the number of payloads by the total number of payloads included in the payload list.

상기와 같은 과제를 해결하기 위한 본 발명의 시그니처 추출 장치는 복수의 페이로드들을 포함하는 페이로드 리스트에서 두 개의 페이로드들을 선택하고, 두 개의 페이로드들의 최장 공통 부분 문자열(LCS)을 도출하는 LCS 도출부; 페이로드 리스트에서 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산하는 출현 빈도 계산부; 및 출현 빈도 계산부에서 계산된 출현 빈도와 최소 빈도값을 비교하고, 최장 공통 부분 문자열(LCS)의 출현 빈도가 최소 빈도값을 초과하면, 최장 공통 부분 문자열(LCS)을 시그니처로 결정하는 판단부를 포함하는 것을 특징으로 한다.In order to solve the above problems, the signature extraction apparatus of the present invention selects two payloads from a payload list including a plurality of payloads and derives the longest common substring (LCS) of the two payloads. Derivation unit; An appearance frequency calculator for calculating an appearance frequency of the longest common substring (LCS) in the payload list; And a determination unit for comparing the appearance frequency and the minimum frequency value calculated by the appearance frequency calculator and determining the longest common substring LCS as the signature when the appearance frequency of the longest common substring LCS exceeds the minimum frequency value. It is characterized by including.

또한, LCS 도출부는 복수의 페이로드들 중 두 개의 페이로드들을 무작위로 선택할 수 있다.Also, the LCS derivation unit may randomly select two payloads from among the plurality of payloads.

또한, LCS 도출부, 출현 빈도 계산부 및 판단부는 이 순서로 미리 결정된 반복 횟수만큼 반복적으로 동작을 수행할 수 있다.In addition, the LCS derivation unit, the appearance frequency calculation unit, and the determination unit may repeatedly perform the operation by a predetermined number of repetitions in this order.

또한, 출현 빈도 계산부를 통해 최장 공통 부분 문자열(LCS)을 도출하는데 이용되는 페이로드들의 개수는 페이로드 리스트에 포함된 전체 페이로드들의 개수보다 작거나 같을 수 있다.In addition, the number of payloads used to derive the longest common substring LCS through the appearance frequency calculator may be less than or equal to the total number of payloads included in the payload list.

또한, 본 발명의 일 실시예에 따른 시그니처 추출 장치는 출현 빈도가 최소 빈도값을 초과하는 최장 공통 부분 문자열(LCS)을 탐지 패턴 리스트에 추가하는 처리부를 더 포함할 수 있다.In addition, the signature extraction apparatus according to an embodiment of the present invention may further include a processing unit for adding the longest common substring LCS whose appearance frequency exceeds a minimum frequency value to the detection pattern list.

또한, LCS 도출부는 최장 공통 부분 문자열(LCS)의 길이와 미리 설정된 부분 문자열 최소 길이를 더 비교하고, 출현 빈도 계산부는 최장 공통 부분 문자열(LCS)의 길이가 미리 설정된 부분 문자열 최소 길이를 초과할 때 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산할 수 있다.In addition, the LCS derivation unit further compares the length of the longest common substring (LCS) with a preset substring minimum length, and the appearance frequency calculating unit further compares the length of the longest common substring (LCS) when the length of the longest common substring is exceeded. The frequency of appearance of the longest common substring (LCS) can be calculated.

또한, 본 발명의 일 실시예에 따른 시그니처 추출 장치는 두 개의 페이로드들의 최장 공통 부분 문자열(LCS)을 도출한 후, 최장 공통 부분 문자열(LCS)과 미리 저장된 탐지 패턴 리스트에 포함된 탐지 패턴들을 비교하는 중복 검사부를 더 포함하고, 출현 빈도 계산부는 최장 공통 부분 문자열(LCS)이 탐지 패턴 리스트에 포함된 탐지 패턴들 중 어떠한 탐지 패턴들에도 포함되지 않거나, 탐지 패턴 리스트에 포함된 탐지 패턴들 중 어떠한 탐지 패턴도 최장 공통 부분 문자열(LCS)에 포함되지 않은 경우 페이로드 리스트에서 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산할 수 있다.In addition, the signature extracting apparatus according to an embodiment of the present invention derives the longest common substring (LCS) of the two payloads, and then detects the detection patterns included in the longest common substring (LCS) and the previously stored detection pattern list. The apparatus further includes a duplicate checking unit for comparing, and the occurrence frequency calculating unit is configured to include the longest common substring (LCS) in any of the detection patterns included in the detection pattern list or in the detection patterns included in the detection pattern list. If no detection pattern is included in the longest common substring (LCS), the occurrence frequency of the longest common substring (LCS) may be calculated in the payload list.

또한, 출현 빈도 계산부는, 페이로드 리스트에서 최장 공통 부분 문자열(LCS)을 포함하는 페이로드들의 개수를 확인하고, 최장 공통 부분 문자열(LCS)을 포함하는 페이로드들의 개수를 페이로드 리스트에 포함된 전체 페이로드들의 개수로 나눈 값에 기초하여 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산할 수 있다. In addition, the appearance frequency calculator determines the number of payloads including the longest common substring (LCS) in the payload list, and includes the number of payloads including the longest common substring (LCS) in the payload list. The appearance frequency of the longest common substring (LCS) may be calculated based on a value divided by the total number of payloads.

본 발명의 일 실시예에 따른 시그니처 추출 장치 및 방법은 적은 연산량으로 복수의 패킷들을 분석함으로써 새로운 유형의 공격 패턴(예를 들어, 시그니처)을 신속하게 탐지할 수 있다. 연산량이 적기 때문에 실용성이 증가될 수 있고, 다량의 패킷들이 송수신되는 실제 네트워크 환경에도 적용될 수 있다.The signature extraction apparatus and method according to an embodiment of the present invention can quickly detect a new type of attack pattern (eg, a signature) by analyzing a plurality of packets with a small amount of computation. Since the amount of computation is small, practicality can be increased, and it can be applied to an actual network environment where a large amount of packets are transmitted and received.

도 1은 본 발명의 일 실시예에 따른 비정상 트래픽 탐지 시스템에 대한 개념도이다.
도 2는 본 발명의 일 실시예에 따른 시그니처 추출 장치에 대한 개념도이다.
도 3a 및 도 3b는 일반적인 시그니처 추출 방법의 계산 복잡도를 설명하기 위한 개념도이다.
도 4는 본 발명의 제1 실시예에 따른 시그니처 추출 장치에 대한 블록도이다.
도 5 및 도 6은 본 발명의 제1 실시예에 따른 시그니처 추출 장치를 통해 시그니처를 추출하는 방법을 설명하기 위한 개념도이다.
도 7은 본 발명의 제2 실시예에 따른 시그니처 추출 장치에 대한 블록도이다.
도 8은 본 발명의 제1 실시예에 따른 시그니처 추출 방법에 대한 흐름도이다.
도 9는 본 발명의 제2 실시예에 따른 시그니처 추출 방법에 대한 흐름도이다.1 is a conceptual diagram of an abnormal traffic detection system according to an embodiment of the present invention.
2 is a conceptual diagram of a signature extraction apparatus according to an embodiment of the present invention.
3A and 3B are conceptual views illustrating the computational complexity of a typical signature extraction method.
4 is a block diagram of a signature extraction apparatus according to a first embodiment of the present invention.
5 and 6 are conceptual views for explaining a method of extracting a signature through the signature extraction apparatus according to the first embodiment of the present invention.
7 is a block diagram of a signature extraction apparatus according to a second embodiment of the present invention.
8 is a flowchart illustrating a signature extraction method according to a first embodiment of the present invention.
9 is a flowchart illustrating a signature extraction method according to a second embodiment of the present invention.

본 발명을 첨부된 도면을 참조하여 상세히 설명하면 다음과 같다. 여기서, 반복되는 설명, 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능, 및 구성에 대한 상세한 설명은 생략한다. 본 발명의 실시형태는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해서 제공되는 것이다. 따라서, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. Here, repeated descriptions, well-known functions and configurations that may unnecessarily obscure the subject matter of the present invention, and detailed description of the configuration are omitted. Embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art. Accordingly, the shape and size of elements in the drawings may be exaggerated for clarity.

도 1은 본 발명의 일 실시예에 따른 비정상 트래픽 탐지 시스템(1000)에 대한 개념도이다. 본 발명의 일 실시예에 따른 비정상 트래픽 탐지 시스템(1000)은 복수의 클라이언트(1a, 1b, 1c)들, 서버(2), 보안 장치(10) 및 시그니처 추출 장치(100)를 포함한다. 시그니처 추출 장치(100)는 보안 장치(10)의 내부에 장착되거나 소프트웨어의 형태로 설치될 수 있고, 별도의 장치 형태로 보안 장치(10)에 직접 연결될 수도 있다. 도 1에서, 클라이언트의 개수는 3개인 것으로 도시되었으나, 이는 예시일 뿐이고 실제 환경에 따라 다양한 수의 클라이언트들이 포함될 수 있다.1 is a conceptual diagram of an abnormal traffic detection system 1000 according to an embodiment of the present invention. The abnormal traffic detection system 1000 according to an embodiment of the present invention includes a plurality of clients 1a, 1b, and 1c, a server 2, a security device 10, and a signature extraction device 100. The signature extracting apparatus 100 may be mounted in the security apparatus 10 or installed in the form of software, or may be directly connected to the security apparatus 10 in a separate apparatus form. In FIG. 1, the number of clients is shown as three, but this is only an example and various numbers of clients may be included according to actual environments.

보안 장치(10)는 복수의 클라이언트들(1a, 1b, 1c)로부터 송신된 패킷들을 수신하고, 이를 분석함으로써 비정상 트래픽을 탐지하는 기능을 한다. 보안 장치(10)는 저장부(미도시)에 저장된 탐지패턴 리스트를 이용하여, 복수의 클라이언트들(1a, 1b, 1c)로부터 송신된 패킷(또는 패킷에 포함된 페이로드)들을 분석하고 비정상 트래픽을 탐지할 수 있다.The security device 10 functions to detect abnormal traffic by receiving packets transmitted from the plurality of clients 1a, 1b and 1c and analyzing them. The security device 10 analyzes packets (or payloads included in the packets) transmitted from the plurality of clients 1a, 1b, and 1c by using a detection pattern list stored in a storage unit (not shown), and generates abnormal traffic. Can be detected.

예를 들어, 복수의 클라이언트들(1a, 1b, 1c) 중 제1 클라이언트(1a)와 제3 클라이언트(1c)는 정상 클라이언트이고, 제2 클라이언트(1b)는 공격자 클라이언트인 상황을 가정한다. 보안 장치(10)는 수신한 패킷들을 분석함으로써 탐지패턴 리스트에 저장된 복수의 공격 패턴(이하, 시그니처)에 대응하는 패킷이 있는지 판단할 수 있다. 판단 결과 탐지패턴 리스트에 저장된 시그니처들에 대응하는 패킷들이 존재하지 않으면, 보안 장치(10)는 수신한 패킷들을 서버(2)로 전달한다. 판단 결과, 탐지패턴 리스트에 저장된 시그니처들에 대응하는 패킷이 존재하는 경우, 보안 장치(10)는 시그니처에 대응하는 패킷을 차단하고, 차단한 패킷을 제외한 나머지 패킷(예를 들어, 정상 패킷)들만 서버(2)로 전달할 수 있다. 이에 따라, 보안 장치(10)는 제1 및 제3 클라이언트(1a 및 1c)에서 송신된 패킷들을 서버(2)에 전달하는 한편, 제2 클라이언트(1b)에서 송신된 패킷들을 차단할 수 있다.For example, it is assumed that the first client 1a and the third client 1c of the plurality of clients 1a, 1b, and 1c are normal clients, and the second client 1b is an attacker client. The security apparatus 10 may determine whether there are packets corresponding to a plurality of attack patterns (hereinafter, signatures) stored in the detection pattern list by analyzing the received packets. If there is no packet corresponding to the signatures stored in the detection pattern list, the security apparatus 10 transmits the received packets to the server 2. As a result of the determination, when there is a packet corresponding to the signatures stored in the detection pattern list, the security apparatus 10 blocks the packet corresponding to the signature and only the remaining packets (for example, normal packets) except the blocked packet. To the server 2. Accordingly, the security device 10 may transmit the packets transmitted from the first and third clients 1a and 1c to the server 2 while blocking the packets transmitted from the second client 1b.

앞서 설명한 것처럼, 시그니처를 기반으로 한 비정상 트래픽 탐지 방법은 공격 방식으로 파악하고 있는 시그니처를 패킷에 적용시킴으로써 이루어지므로, 탐지율이 높은 장점이 있다. 다만, 공격 방식 및 공격 패턴이 점점 다양해지고 있으나, 시그니처 기반의 비정상 트래픽 탐지 방법은 이미 알려진 공격에만 적용 가능하므로, 점차 다양해지고 있는 공격들을 탐지하기는 어려운 실정이다. 예를 들어, 제2 클라이언트(1b)가 송신한 패킷들이 기존에 알려지지 않은(예를 들어, 탐지 패턴 리스트에 존재하지 않는) 새로운 공격 패턴을 갖는 경우, 보안 장치(10)는 제2 클라이언트(1b)로부터 송신된 패킷들을 차단하지 못하고, 서버(2)에 송신할 것이다.As described above, the abnormal traffic detection method based on the signature is made by applying the signature that is identified as an attack method to the packet, and thus has a high detection rate advantage. However, although attack methods and attack patterns are becoming more and more diverse, signature-based abnormal traffic detection methods are applicable only to known attacks, so it is difficult to detect increasingly diverse attacks. For example, if the packets sent by the second client 1b have a new attack pattern that is not previously known (eg, does not exist in the detection pattern list), then the security device 10 may attach the second client 1b. Packet will not be blocked, but will be sent to the server 2.

또한, 악성 트래픽의 분석을 어렵게 하는 가장 근본적인 이유는 악성 트래픽 그 자체가 아닌, 악성 트래픽이 흐르는 망의 데이터 양이 방대하다는 점이다. 이와 같은 현 상황에서 네트워크의 고속대용량 트렁크로부터 악성 트래픽의 존재 여부 및 악성 트래픽의 탐지 패턴을 신속하게 파악할 필요가 있다. 그러나 일반적인 컴퓨터 공학 알고리즘에 기반을 둔 악성 트래픽의 탐지패턴 추출 방법은 많은 시간이 소요되어 실용성 측면에서 완성도를 보이지 못하고, 현재 실정에 적용되기 어려운 문제가 있다.In addition, the most fundamental reason that makes it difficult to analyze malicious traffic is that the amount of data in the network through which malicious traffic flows is huge, not the malicious traffic itself. In this situation, it is necessary to quickly identify the presence of malicious traffic and the detection pattern of malicious traffic from the high-speed trunk of the network. However, the detection pattern extraction method of malicious traffic based on general computer engineering algorithms takes a lot of time and does not show completeness in terms of practicality and is difficult to apply to the current situation.

한편, 시그니처 추출 장치(100)는 시그니처를 기반으로 한 비정상 트래픽 탐지 방법의 상술한 단점을 보완하기 위한 것으로, 네트워크를 통해 수신되는 (또는 네트워크 고속대용량 트렁크로부터) 패킷들을 분석함으로써, 시그니처를 추출하는 기능을 한다. 예를 들어, 시그니처 추출 장치(100)는 새로운 공격 패턴도 탐지할 수 있고, 그에 대한 시그니처를 추출할 수 있다. 또한, 시그니처 추출 장치(100)를 통해 추출된 시그니처는 시그니처 추출 장치(100)에 포함된 저장부 또는 보안 장치(10)에 포함된 저장부에 저장될 수 있다. On the other hand, the signature extracting apparatus 100 is to compensate for the above-mentioned disadvantage of the signature-based abnormal traffic detection method, and extracts the signature by analyzing packets received through the network (or from the network high-capacity trunk). Function For example, the signature extracting apparatus 100 may also detect a new attack pattern, and extract a signature thereof. In addition, the signature extracted through the signature extraction apparatus 100 may be stored in a storage included in the signature extraction apparatus 100 or a storage included in the security apparatus 10.

시그니처 추출 장치(100)의 기능으로 인해, 보안 장치(10)는 새로운 공격 패턴이 발생하더라도 시그니처 추출 장치(100)를 통해 이를 검출할 수 있어서, 비정상 트래픽에 대한 탐지율을 증가시킬 수 있다. 뿐만 아니라, 새로운 공격 패턴이 발생할 때마다 탐지패턴 리스트가 갱신될 수 있어서, 보안 장치(10)의 성능을 보다 높일 수 있다. 또한, 시그니처 추출 장치(100)는 시그니처 추출에 필요한 계산 복잡도를 감소시킴으로써, 데이터의 양이 방대한 실제 네트워크 환경에서도 적용 가능한 장점이 있다.Due to the function of the signature extracting apparatus 100, the security apparatus 10 may detect the new attack pattern through the signature extracting apparatus 100 even if a new attack pattern occurs, thereby increasing the detection rate for abnormal traffic. In addition, the detection pattern list may be updated each time a new attack pattern occurs, thereby further improving the performance of the security device 10. In addition, the signature extraction apparatus 100 reduces the computational complexity required for signature extraction, and thus has an advantage that the signature extraction apparatus 100 can be applied even in an actual network environment where a large amount of data is used.

도 2는 본 발명의 일 실시예에 따른 시그니처 추출 장치(100)에 대한 개념도이다. 도 2에 도시된 예시는 본 발명의 일 실시예에 따른 시그니처 추출 장치(100)가 하나의 클라이언트(1)로부터 송신된 패킷들을 분석하는 상황을 가정한다. 다만, 이는 예시일 뿐이고, 시그니처 추출 장치(100)는 복수의 클라이언트들로부터 송신된 패킷들을 동시에, 병렬적으로 또는 순차적으로 분석하는 것도 가능하다.2 is a conceptual diagram of a signature extraction apparatus 100 according to an embodiment of the present invention. The example illustrated in FIG. 2 assumes a situation in which the signature extracting apparatus 100 analyzes packets transmitted from one client 1 according to an embodiment of the present invention. However, this is only an example, and the signature extracting apparatus 100 may analyze packets transmitted from a plurality of clients simultaneously, in parallel, or sequentially.

시그니처 추출 장치(100)는 클라이언트(1)가 송신하는 복수의 패킷(p)들을 수신한다. 시그니처 추출 장치(100)는 복수의 패킷(p)들 중 일부 패킷을 추출하고, 패킷에 포함된 페이로드들을 이용하여 최장 공통 부분 문자열(LCS: Longest Common Subsequence)을 도출하며, 도출한 최장 공통 부분 문자열(LCS)을 이용하여 시그니처를 추출할 수 있다. 일반적으로, 패킷은 점선 블록(A)으로 도시된 바와 같이, 출발지 주소, 목적지 주소 등을 포함하는 헤더와, 실제 데이터(예를 들어, 문자열)를 포함하는 페이로드로 구성된다. 시그니처 추출 장치(100)는 복수의 패킷(p)들 중 일부 패킷들의 페이로드(또는 문자열)를 이용하여 시그니처를 추출한다.The signature extracting apparatus 100 receives a plurality of packets p transmitted by the client 1. The signature extracting apparatus 100 extracts some packets of the plurality of packets p, derives a longest common subsequence (LCS) using the payloads included in the packets, and derives the longest common portion The signature may be extracted using a string (LCS). In general, a packet consists of a header containing a source address, a destination address, and the like, and a payload containing actual data (e.g., a string), as shown by dashed block A. FIG. The signature extracting apparatus 100 extracts a signature by using payloads (or strings) of some packets among the plurality of packets p.

또한, 본 발명의 일 실시예에 따른 시그니처 추출 장치(100)는 최장 공통 부분 문자열(LCS)을 추출할 때 패킷(p) 전체가 아닌 일부 패킷들만을 이용한다. 이는 비정상 트래픽의 탐지 패턴 추출에 소요되는 시간을 줄임으로써, 실용성을 증가시키기 위함이다.In addition, the signature extraction apparatus 100 according to an embodiment of the present invention uses only some packets, not the entire packet p, when extracting the longest common substring LCS. This is to increase the practicality by reducing the time required to extract the detection pattern of the abnormal traffic.

한편, 문자열에서 특정 패턴을 추출하는 알고리즘은 기존에도 존재하였으나, 이러한 알고리즘을 그대로 적용하여 시그니처 추출하는 것은 어려움이 많았다. On the other hand, algorithms for extracting specific patterns from strings existed in the past, but it was difficult to extract signatures by applying these algorithms as they are.

먼저, 문자열에서 특정 패턴을 추출하는 가장 단순한 알고리즘은 얼리버드(Earlybird) 알고리즘이다. 얼리버드 알고리즘은 부분 문자열에 대해서 해싱(hashing)을 수행하고 해당 부분 문자열이 발견될 때마다 계수 값을 증가시키며, 전체 패킷에 대한 분석 종료 후 해시 테이블 필드의 원소의 개수가 많은 필드 값을 추출하는 방식이다. 그러나, 네트워크의 고속대용량 트렁크에 포함된 페이로드에 대해서, 어느 부분 문자열을 해싱해야 하는지 기준이 애매하고 해싱하는 부분 문자열의 양이 방대하여 연산량이 증가하는 문제점이 있다. First, the simplest algorithm for extracting a specific pattern from a string is the Earlybird algorithm. The early bird algorithm performs hashing on substrings and increments the count value whenever the substrings are found, and extracts field values with a large number of elements in the hash table field after analysis of the entire packet. That's the way. However, with respect to the payload included in the high-speed trunk of the network, the criterion of which substring to hash is ambiguous, and the amount of substrings hashed is enormous, thereby increasing the amount of computation.

문자열에서 특정 패턴을 추출하는 다른 알고리즘인 오토시그(AutoSig) 알고리즘은 시그니처의 가능성이 있는 모든 공통 문자열을 추출하고 추출된 문자열을 구조화하여 시그니처를 생성하는 방법이다. 이 방법은 가능성이 있는 모든 공통 문자열을 추출할 때, 너무 많은 문자열이 계산과정에 포함된다. 예를 들면, 20개의 문자로 되어 있는 문자열에서 길이가 4인 문자열을 추출한다면 16개의 부분 문자열이 추출되고, 16개의 문자열은 모두 계산과정에 포함된다. 따라서 추출된 부분 문자열의 개수가 많고 범위가 넓기 때문에 메모리 사용률 및 처리 시간의 단점을 가지고 있다.The AutoSig algorithm, another algorithm for extracting a specific pattern from a string, is a method of generating a signature by extracting all the common strings with the potential of signatures and structuring the extracted string. In this method, too many strings are included in the calculation when all possible common strings are extracted. For example, if you extract a string of length 4 from a string of 20 characters, 16 substrings are extracted and all 16 strings are included in the calculation. Therefore, the number of extracted substrings is large and the range is wide, which has disadvantages of memory utilization and processing time.

마지막으로, 폴리그래프(PolyGraph) 알고리즘과 LASER(LCS-based Application Signature ExtRaction) 알고리즘은 샘플링 후 최장 공통 부분 문자열(LCS)를 추출하는 방식이다. 두 개의 패킷의 최장 공통 부분 문자열(LCS)을 계산하고, 또 다른 계산 값과 최장 공통 부분 문자열(LCS)을 계산하는 방식으로 알고리즘의 정확도를 향상시킨다. 폴리그래프 알고리즘과 LASER 알고리즘은 네트워크 패킷 트렁크 내의 두 개의 패킷 비교를 반복해서 수행한다. 그러나, 비교되는 패킷의 수, 패킷의 크기 등이 정의되지 않으면 문제 해결에 지수적인 시간을 요구한다. Finally, the PolyGraph algorithm and the LCS-based Application Signature ExtRaction (LASER) algorithm extract the longest common substring (LCS) after sampling. The algorithm improves the accuracy of the algorithm by calculating the longest common substring (LCS) of two packets and calculating another calculation value and the longest common substring (LCS). The polygraph algorithm and the LASER algorithm repeatedly perform comparisons of two packets in a network packet trunk. However, if the number of packets to be compared, the packet size, etc. are not defined, exponential time is required for problem solving.

이에 대응하기 위해 LASER 알고리즘은 입력 데이터에 대한 제약사항으로 패킷 크기를 적용하고 있다. 그로 인해, LASER 알고리즘은 일정 크기를 초과하는 패킷들에 대해서는 적용될 수 없고(또는 일정 크기를 초과하는 패킷 중 일부만을 추출), 다른 기능을 수행하는 트래픽이 동일한 패킷 크기를 갖는 경우 시그니처 생성이 불가하거나 잘못된 시그니처를 추출할 수 있다.To counteract this, the LASER algorithm applies a packet size as a constraint on the input data. As a result, the LASER algorithm cannot be applied to packets exceeding a certain size (or extract only some of the packets exceeding a certain size), and signatures cannot be generated if traffic performing different functions has the same packet size. Bad signatures can be extracted.

뿐만 아니라, 앞서 설명한 알고리즘들은 패킷 전체를 대상으로 최장 공통 부분 문자열(LCS)을 산출해야 하므로, 계산 복잡도가 높고 계산에 시간이 많이 소요되며, 시스템 부하도 높은 문제점이 있다.In addition, the algorithms described above are required to calculate the longest common substring (LCS) for the entire packet, which leads to high computational complexity, time-consuming computation, and high system load.

도 3a 및 도 3b는 일반적인 시그니처 추출 방법의 계산 복잡도를 설명하기 위한 개념도를 도시한다. 앞서 설명한 것처럼 일반적인 시그니처 추출 방법은 전수 조사법을 통해 모든 패킷들에 대하여, 네트워크 패킷 트렁크 내의 두 개의 패킷을 반복하여 비교한다. 3A and 3B show conceptual diagrams for explaining the computational complexity of a typical signature extraction method. As described above, the general signature extraction method repeatedly compares two packets in a network packet trunk with respect to all the packets through the whole number inspection method.

도 3a를 참조하면, 네트워크를 통해 수집된 또는 네트워크 패킷 트렁크로부터 수집된 패킷(p₁ 내지 p₈)들이 도시된다. 각 패킷(p₁ 내지 p₈)은 헤더와 페이로드(도 3에서 점선블록(B)으로 표시됨)로 구성된다. 예를 들어, 제1 패킷(p₁)의 페이로드는 "XMJABCD", 제2 패킷(p₂)의 페이로드는 "MJQWJAZD", 제3 패킷(p₃)의 페이로드는 "MABCDUEXX", 제4 패킷(p₄)의 페이로드는 "DFHYNCE", 제5 패킷(p₅)의 페이로드는 "MZDDABCD", 제6 패킷(p₆)의 페이로드는 "MYHRDDVUEXX", 제7 패킷(p₇)의 페이로드는 "VBVKRFOF", 제8 패킷(p₈)의 페이로드는 "ABCDDVWD"인 것으로 가정한다. 이 예시에서 페이로드에 포함된 문자열은 본 발명의 설명을 돕기 위해, 문자의 개수가 7 내지 11개인 것으로 기재하였다. 다만 이는 예시일 뿐이고, 다양한 개수의 문자들이 각 패킷의 페이로드에 포함될 수 있다.Referring to FIG. 3A, packets p ₁ to p ₈ collected through a network or collected from a network packet trunk are shown. Each packet p ₁ to p ₈ consists of a header and a payload (indicated by dashed block B in FIG. 3). For example, the payload of the first packet p ₁ is "XMJABCD", the payload of the second packet p ₂ is "MJQWJAZD", the payload of the third packet p ₃ is "MABCDUEXX", and The payload of 4 packets p ₄ is "DFHYNCE", the payload of 5th packet p ₅ is "MZDDABCD", the payload of 6th packet p ₆ is "MYHRDDVUEXX", and the 7th packet (p ₇₎ ) the payload of the "VBVKRFOF", the payload of the packet 8 _{(p. 8)} is assumed to be "ABCDDVWD". In this example, the string included in the payload is described as 7 to 11 characters for the purpose of explanation of the present invention. However, this is only an example and various numbers of characters may be included in the payload of each packet.

도 3b는 일반적인 전수 조사법의 복잡도를 설명하기 위한 도면이다. 전수 조사법은 도 3b에 도시된 것처럼, 페이로드에 포함된 모든 패킷들에 대해 두 개의 패킷 간에 이루어진다. 본 예시에서, 제1 패킷(p₁)의 공통 부분 문자열은 'MJ', 'JA', 'ABCD'이고, 제2 패킷(p₂)의 공통 부분 문자열은 "MJ", "JA", "ZD"이고, 제3 패킷(p₃)의 공통 부분 문자열은 "ABCD", "UEXX"이고, 제5 패킷(p₅)의 공통 부분 문자열은 "JA", "DD", "ZD", "ABCD"이고, 제6 패킷(p₆)의 공통 부분 문자열은 "DD", "DDV", "UEXX"이며, 제8 패킷(p₈)의 공통 부분 문자열은 "DD", "DDV", "ABCD"임을 알 수 있다. 그리고, 제4 패킷(p₄) 및 제7 패킷(p₇)의 공통 부분 문자열은 존재하지 않음을 알 수 있다. 3B is a view for explaining the complexity of the general transfer inspection method. The total number check is made between two packets for all packets included in the payload, as shown in FIG. 3B. In this example, the common substring of the first packet p ₁ is 'MJ', 'JA', 'ABCD', and the common substring of the second packet p ₂ is "MJ", "JA", " ZD ", and the common substring of the third packet (p ₃₎ is" ABCD "," UEXX ", and the fifth common substring of the packet _{(p. 5)} is" JA "," DD ", " ZD "," ABCD ", and the sixth common substring of the packet _{(p. 6)} is" DD "," DDV ", " UEXX " is a common substring of the eighth packet _{(p. 8)} is" DD "," DDV ", " ABCD ". In addition, it can be seen that the common substring of the fourth packet p ₄ and the seventh packet p ₇ does not exist.

일반적인 전수 조사법에 따르면, 도 3a 및 도 3b에 도시된 바와 같이, 모든 패킷들에 대한 전수 조사법을 수행함으로써 모든 공통 부분 문자열을 도출하고, 도출 결과를 분석하여 최장 공통 부분 문자열인 "ABCD" 및 "UEXX"를 도출할 수 있다. According to the general transmission method, as shown in FIGS. 3A and 3B, all the common substrings are derived by performing the entire investigation method on all packets, and the derivation results are analyzed to determine the longest common substrings "ABCD" and ". UEXX "can be derived.

도 3a 및 도 3b에 도시된 예시는 패킷이 8개인 상황을 가정하였음에도, 최장 공통 부분 문자열(LCS)을 추출하기 위해서는 상당한 연산량이 요구된다. 구체적으로, 전수 조사법에 따르면, 패킷의 개수를 n이라고 할 때, n²-n회(본 예시의 경우, 58회)의 연산을 해야 한다. 또한, 실제 네트워크 환경과 같이 패킷의 개수가 많을 때(예를 들어, 수백 내지 수천 개) 최장 공통 부분 문자열(LCS)을 추출하기 위해서는 그에 비례하여 훨씬 더 많은 연산량(예를 들어, 수백만 내지 수천만 회의 연산)이 요구됨을 알 수 있다. Although the example shown in FIGS. 3A and 3B assumes a situation of eight packets, a considerable amount of computation is required to extract the longest common substring (LCS). Specifically, according to the total number inspection method, when the number of packets is n, n ² -n operations (58 times in this example) must be performed. In addition, in order to extract the longest common substring (LCS) when the number of packets is large (for example, hundreds to thousands), such as in a real-world network environment, there is a much larger amount of computation (for example, millions to tens of millions of times). Operation) is required.

패턴을 추출하기 위한 알고리즘은 앞서 설명한 알고리즘들 외에, n개의 문자열에 대해서 최장 공통 부분 문자열(LCS)을 계산하는 접미사 트리(Suffix Tree) 알고리즘도 존재한다. 접미사 트리 알고리즘은 문자열 S의 비어있지 않은 접미사를 키로 사용하고, 텍스트의 위치를 값으로 갖는 압축된 트리를 나타낸다. 트리의 루트에서 터미널 노드까지의 모든 경로에 있는 레이블은 텍스트의 접미사를 표시하며, 모든 접미사에 대한 경로가 나타나 있다. 접미사 트리 알고리즘의 복잡도는 이 알고리즘이 생성시 모든 텍스트에 대해 표현하기 때문에, 상당히 큰 상수 K 에 대하여 O(KN³) 으로 주어진다. 여기서 N 은 전체 네트워크 캡처 파일의 페이로드(패킷) 전체의 크기를 의미하며, 대용량 네트워크 캡처 파일에 적용하여 신속하게 시그니처를 탐색하는 것은 매우 어렵다.In addition to the algorithms described above, an algorithm for extracting a pattern also includes a Suffix Tree algorithm that calculates the longest common substring (LCS) for n strings. The suffix tree algorithm uses a non-empty suffix of the string S as a key and represents a compressed tree whose position is the text. Labels on all paths from the root of the tree to the terminal node indicate the suffix of the text, and the paths to all suffixes. The complexity of the suffix tree algorithm is given by O (KN ³ ) for a fairly large constant K, because it represents all the text at creation. Where N is the size of the entire payload (packet) of the entire network capture file, and it is very difficult to quickly find a signature by applying it to a large network capture file.

앞서 설명한 것처럼, 일반적으로 알려진 기술들은 복수의 패킷들에서 패턴을 탐지할 때, 그 전체 패킷에 대해 전수 조사를 수행하므로 계산 복잡도가 높다. 트래픽이 흐르는 망의 경우 데이터 양이 방대하므로, 일반적으로 알려진 기술들을 그대로 네트워크 환경에 적용하면, 처리량이 낮을 뿐만 아니라 시스템에 과부하가 발생하게 될 것이다. 또한, 일반적으로 네트워크에서 흐르는 패킷들의 대다수는 정상적인 클라이언트들에서 송신된 패킷이고, 일부 패킷들 만이 악의적인 공격자 클라이언트로부터 송신된 패킷이다. 이러한 상황에서 종래 기술을 그대로 적용하면 악의적인 공격이 발생하지 않은 상황에서도 정상 패킷들에 대한 패턴 추출 과정으로 인해 시스템에 과부하만 야기할 것이다.As described above, in general, when the pattern is detected in a plurality of packets, generally known techniques perform computational inspection on the entire packet, thereby increasing computational complexity. In the case of a network where traffic flows, the amount of data is enormous. Therefore, if the known techniques are applied to the network environment as it is, the throughput will be low and the system will be overloaded. Also, in general, the majority of packets flowing in the network are packets sent by normal clients, and only some packets are packets sent from malicious attacker clients. In this situation, if the prior art is applied as it is, even when no malicious attack occurs, the system will only cause an overload due to the pattern extraction process for normal packets.

한편, 본 발명의 일 실시예에 따른 시그니처 추출 장치(100)는 상술한 문제점을 해소하기 위해, 모든 패킷들에 대해 검사를 수행하는 것이 아닌, 일부 패킷들만을 추출하여 최장 공통 부분 문자열(LCS)을 추출한다. On the other hand, the signature extraction apparatus 100 according to an embodiment of the present invention, in order to solve the above-mentioned problem, instead of performing a check on all packets, only some packets are extracted and the longest common substring (LCS) is extracted. Extract

또한, 본 발명의 일 실시예에 따른 시그니처 추출 장치(100)는 최장 공통 부분 문자열(LCS)을 도출할 때, 각 패킷에 포함된 페이로드(예를 들어, 문자열) 중 일부가 아닌, 하나의 패킷에 포함된 페이로드 전체와, 다른 패킷에 포함된 페이로드 전체를 이용한다. 이는 LCS 추출의 신뢰도를 증가시키기 위함이다. 예를 들어, 공격 패턴이 어떤 패킷에는 페이로드의 초반에, 그리고 어떤 패킷에는 페이로드의 후반에 포함될 수도 있는데, 페이로드 중 일부만을 이용한다면 공격 패턴을 탐지하지 못하는 상황도 발생할 수 있다. 따라서, 본 발명의 일 실시예에 따른 시그니처 추출 장치(100)는 페이로드 중 일부를 필터링하지 않고, 최장 공통 부분 문자열(LCS)의 추출 과정을 수행한다.In addition, when the signature extracting apparatus 100 according to an embodiment of the present invention derives the longest common substring (LCS), one of the payloads (eg, the string) included in each packet is not one. The entire payload included in the packet and the entire payload included in the other packet are used. This is to increase the reliability of the LCS extraction. For example, an attack pattern may be included at the beginning of the payload in some packets, and at the end of the payload in some packets. If only part of the payload is used, the attack pattern may not be detected. Therefore, the signature extracting apparatus 100 according to an embodiment of the present invention performs the extraction process of the longest common substring LCS without filtering a part of the payload.

또한, 본 발명의 일 실시예에 따른 시그니처 추출 장치(100)는 추출한 최장 공통 부분 문자열(LCS)의 출현 빈도를 이용하여 공격 패턴을 나타내는 시그니처 여부를 판단할 수 있다.In addition, the signature extracting apparatus 100 according to an embodiment of the present invention may determine whether the signature representing the attack pattern is based on the appearance frequency of the extracted longest common substring LCS.

도 4는 본 발명의 제1 실시예에 따른 시그니처 추출 장치(100)에 대한 블록도이다. 본 발명의 제1 실시예에 따른 시그니처 추출 장치(100)는 LCS 도출부(110), 출현 빈도 계산부(130), 판단부(140) 및 처리부(150)를 포함하여 구성된다. 여기서, LCS 도출부(110), 출현 빈도 계산부(130), 판단부(140) 및 처리부(150)는 본 발명의 이해를 돕기 위해 기능별로 구분한 것이고, 시그니처 추출 장치(100)는 하나의 제어부 또는 소프트웨어의 형태로 구현될 수 있다. 또한, 시그니처 추출 장치(100)는 단일 코어 또는 다중 코어로 이루어진 CPU, MPU 등과 같은 처리 장치를 통해 구현될 수도 있다.4 is a block diagram of the signature extraction apparatus 100 according to the first embodiment of the present invention. The signature extracting apparatus 100 according to the first embodiment of the present invention includes an LCS derivation unit 110, an appearance frequency calculating unit 130, a determination unit 140, and a processing unit 150. Here, the LCS derivation unit 110, the appearance frequency calculation unit 130, the determination unit 140 and the processing unit 150 are divided by function to help the understanding of the present invention, the signature extraction apparatus 100 is one It may be implemented in the form of a control unit or software. In addition, the signature extraction apparatus 100 may be implemented through a processing device such as a CPU, an MPU, or the like composed of a single core or multiple cores.

LCS 도출부(110)는 복수의 패킷들을 수신하고, 페이로드 리스트를 생성한다. 복수의 패킷들은 적어도 하나의 클라이언트로부터 송신된 패킷일 수 있다. 예를 들어, LCS 도출부(110)는 복수의 패킷들의 트렁크를 시스템으로 로드하고, 각 패킷에서 헤더를 제외한 페이로드(예를 들어, 문자열)를 추출함으로써 페이로드 리스트를 생성할 수 있다.The LCS derivation unit 110 receives a plurality of packets and generates a payload list. The plurality of packets may be packets transmitted from at least one client. For example, the LCS derivation unit 110 may generate a payload list by loading a trunk of a plurality of packets into a system and extracting a payload (eg, a string) excluding a header from each packet.

LCS 도출부(110)는 페이로드 리스트에서 두 개의 페이로드를 선택한다. LCS 도출부(110)에서 이루어지는 두 개의 페이로드 선택은 무작위로 이루어질 수 있다. 앞서 설명한 것처럼 일 실시예에 따른 시그니처 추출 장치(100)는 일반적인 패턴 추출 기법이 비해, 적은 연산량으로 유사한 공격 패턴 탐지율을 제공하는 것을 목적으로 한다. LCS 도출부(110)에서 미리 정해진 순서의 패킷들만 선택하는 경우, 그리고 이 순서가 외부에 노출될 경우, 이는 공격 패턴 탐지율에 영향을 미칠 수 있다. 따라서, LCS 도출부(110)는 무작위로 두 개의 페이로드를 선택한다.The LCS derivation unit 110 selects two payloads from the payload list. The two payload selections made by the LCS derivation unit 110 may be random. As described above, the signature extracting apparatus 100 according to the exemplary embodiment has an object of providing a similar attack pattern detection rate with a small amount of computation, compared to a general pattern extraction technique. When the LCS derivation unit 110 selects only packets of a predetermined order, and when the order is exposed to the outside, this may affect the attack pattern detection rate. Therefore, the LCS derivation unit 110 randomly selects two payloads.

또한, LCS 도출부(110)는 LCS 도출부(110)를 통해 선택한 페이로드가 너무 짧을 경우(예를 들어, 페이로드(또는 문자열)가 미리 설정된 문자열 최소 길이 미만인 경우), 다른 페이로드를 더 선택할 수 있다.In addition, the LCS derivator 110 may add another payload when the payload selected through the LCS derivator 110 is too short (for example, when the payload (or string) is less than a preset string minimum length). You can choose.

그 후, LCS 도출부(110)는 무작위로 선택한 두 개의 페이로드에 대한 최장 공통 부분 문자열(LCS)을 도출한다. 최장 공통 부분 문자열(LCS)은 두 개의 페이로드를 분석함으로써 적어도 하나의 부분 문자열을 구하고, 부분 문자열 중 가장 긴 부분 문자열을 찾음으로써 도출될 수 있다.Thereafter, the LCS derivation unit 110 derives the longest common substring (LCS) for two randomly selected payloads. The longest common substring (LCS) may be derived by analyzing at least one substring by analyzing two payloads and finding the longest substring of the substring.

또한, LCS 도출부(110)는 최장 공통 부분 문자열(LCS)의 길이와 미리 설정된 부분 문자열 최소 길이를 비교하는 과정을 더 수행할 수 있다. 이는 아래에서 설명되는 출현 빈도 계산부(130)를 통해 이루어지는 출현 빈도 계산을 최소화하기 위함이다. 예를 들어, LCS 도출부(110)를 통해 이루어지는 최장 공통 부분 문자열(LCS)을 도출하는 과정은 O(1)의 계산 복잡도를 갖는 반면, 아래에서 설명하는 최장 공통 부분 문자열(LCS)의 출현 빈도 계산은 O(N)(N은 페이로드 리스트에 포함된 전체 페이로드들의 개수)의 계산 복잡도를 갖는다. 따라서, 최장 공통 부분 문자열(LCS)의 출현 빈도 계산 과정은 가급적 최소화되는 것이 바람직하다.In addition, the LCS derivation unit 110 may further perform a process of comparing the length of the longest common substring LCS to the preset minimum substring length. This is to minimize the appearance frequency calculation made through the appearance frequency calculation unit 130 described below. For example, the process of deriving the longest common substring (LCS) through the LCS derivation unit 110 has a computational complexity of O (1), while the occurrence frequency of the longest common substring (LCS) described below is described. The calculation has a computational complexity of O (N), where N is the total number of payloads included in the payload list. Therefore, it is desirable to minimize the appearance frequency calculation process of the longest common substring (LCS).

또한, 최장 공통 부분 문자열(LCS)의 길이가 너무 짧으면, 그 최장 공통 부분 문자열(LCS)의 공격 패턴 여부를 알기 어려울 뿐만 아니라, 길이가 너무 짧은(예를 들어, 5 바이트 미만인) 최장 공통 부분 문자열(LCS)을 이용하여 시그니처 기반 비정상 트래픽 탐지를 하면, 과탐지 가능성이 있어서 이는 배제되는 것이 바람직하다. 따라서, LCS 도출부(110)는 최장 공통 부분 문자열(LCS)의 길이와 미리 설정된 부분 문자열 최소 길이를 비교한 후, 최장 공통 부분 문자열(LCS)의 길이가 미리 설정된 부분 문자열 최소 길이 미만이면, 이를 제거한다. 그 후, LCS 도출부(110)는 다른 두 개의 페이로드를 선택하고, 상술한 과정을 재수행할 수 있다.In addition, if the length of the longest common substring (LCS) is too short, it is not only difficult to know whether the longest common substring (LCS) has an attack pattern, but the longest common substring is too short (for example, less than 5 bytes). When signature-based abnormal traffic detection is performed using (LCS), there is a possibility of overdetection, which is preferably excluded. Accordingly, the LCS derivation unit 110 compares the length of the longest common substring (LCS) with the preset minimum length of the substring, and if the length of the longest common substring (LCS) is less than the preset minimum length of the substring, Remove Thereafter, the LCS derivation unit 110 may select two other payloads and perform the above-described process again.

출현 빈도 계산부(130)는 페이로드 리스트에서 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산한다. 예를 들어, 출현 빈도 계산부(130)는 페이로드 리스트에서 최장 공통 부분 문자열(LCS)을 포함하는 페이로드들의 개수를 확인하고, 최장 공통 부분 문자열(LCS)을 포함하는 페이로드들의 개수를 페이로드 리스트에 포함된 전체 페이로드들의 개수로 나눔으로써 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산할 수 있다. 출현 빈도 계산부(130)를 통해 이루어지는 출현 빈도의 계산 방법은 아래의 수학식 1로 표현될 수 있다.The appearance frequency calculator 130 calculates the appearance frequency of the longest common substring LCS in the payload list. For example, the appearance frequency calculator 130 checks the number of payloads including the longest common substring (LCS) in the payload list, and pays the number of payloads including the longest common substring (LCS). The frequency of appearance of the longest common substring (LCS) can be calculated by dividing by the total number of payloads included in the load list. The calculation method of the appearance frequency made through the appearance frequency calculating unit 130 may be expressed by Equation 1 below.

수학식 1에서, f _LCS 는 최장 공통 부분 문자열(LCS)의 출현 빈도를 나타내고, N은 페이로드 리스트에 포함된 전체 페이로드들의 개수를 나타내며, I _LCS 는 페이로드 리스트에서 최장 공통 부분 문자열(LCS)을 포함하는 페이로드들의 개수를 나타낸다.In Equation 1, f _LCS denotes the frequency of appearance of the longest common substring (LCS), N denotes the total number of payloads included in the payload list, and I _LCS denotes the longest common substring (LCS) in the payload list. ) Indicates the number of payloads including).

판단부(140)는 LCS 도출부(110)를 통해 도출한 최장 공통 부분 문자열(LCS)을 시그니처로 결정할 지의 여부를 판단하는 기능을 한다. 이를 위해, 판단부(140)는 출현 빈도 계산부(130)를 통해 계산한 최장 공통 부분 문자열(LCS)의 출현 빈도를 이용한다. 예를 들어, 판단부(140)는 최장 공통 부분 문자열(LCS)의 출현 빈도와 최소 빈도값을 비교하고, 최장 공통 부분 문자열(LCS)의 출현 빈도가 최소 빈도값을 초과하면, 최장 공통 부분 문자열(LCS)을 시그니처로 결정할 수 있다. 그렇지 않은 경우, 판단부(140)는 LCS 도출부(110)를 통해 도출한 최장 공통 부분 문자열(LCS)을 제거한다.The determination unit 140 functions to determine whether to determine the longest common substring LCS derived through the LCS derivation unit 110 as a signature. To this end, the determination unit 140 uses the appearance frequency of the longest common substring LCS calculated by the appearance frequency calculating unit 130. For example, the determination unit 140 compares the appearance frequency of the longest common substring LCS and the minimum frequency value, and when the appearance frequency of the longest common substring LCS exceeds the minimum frequency value, the longest common substring (LCS) can be determined by signature. Otherwise, the determination unit 140 removes the longest common substring LCS derived through the LCS derivation unit 110.

처리부(150)는 판단부(140)를 통해 시그니처로 결정된(예를 들어, 출현 빈도가 최소 빈도값을 초과하는) 최장 공통 부분 문자열(LCS)을 탐지 패턴 리스트에 추가한다. 또한, 처리부(150)는 탐지 패턴 리스트를 저장부(20)에 저장함으로써, 저장부(20)에 저장된 탐지 패턴 리스트(예를 들어, 시그니처 리스트)를 갱신할 수 있다. 이로 인해, 보안 장치(10)는 처리부(150)를 통해 지속적으로 갱신되는 탐지 패턴 리스트를 이용하여 시그니처 기반 비정상 트래픽 탐지를 수행할 수 있게 된다.The processor 150 adds, to the detection pattern list, the longest common substring LCS determined by the signature unit 140 (eg, the occurrence frequency exceeds the minimum frequency value). In addition, the processor 150 may store the detection pattern list in the storage unit 20 to update the detection pattern list (for example, the signature list) stored in the storage unit 20. As a result, the security device 10 may perform signature-based abnormal traffic detection using a list of detection patterns continuously updated through the processor 150.

또한, 본 발명의 일 실시예에 따른 시그니처 추출 장치(100)는 페이로드 리스트에 대해, 상술한 과정을 미리 설정된 반복 횟수만큼 반복하여 시그니처를 추출할 수 있다. 예를 들어, LCS 도출부(110)는 미리 결정된 반복 횟수만큼 페이로드 리스트에 대해, 반복적으로 페이로드들을 선택할 수 있고, 그 후 상술한 과정들이 더 수행될 수 있다.In addition, the signature extracting apparatus 100 according to an embodiment of the present invention may extract the signature by repeating the above-described process by a predetermined number of repetitions for the payload list. For example, the LCS derivation unit 110 may repeatedly select payloads for the payload list by a predetermined number of repetitions, and then the above-described processes may be further performed.

다만, 반복을 통해, 어느 정도까지는 보다 많은 공격 패턴(예를 들어, 시그니처)을 탐지해낼 수 있는 반면, 어느 수준을 넘어 반복이 이루어지는 경우(예를 들어, 이하에서 설명되는 반복 횟수를 초과할 경우) 위에서 설명한 바와 같이 그 이상의 공격 패턴을 탐지할 수는 없다. 다시 말해, 상술한 과정들이 어느 정도의 수준만큼 반복될 경우 그 반복량에 따라 발견 가능한 시그니처들도 증가한다. 한편, 상술한 과정들이 어느 정도 반복된 후, 그를 초과하여 더 반복 수행될 경우, 앞선 반복 과정에서 탐지된 공격 패턴들에 중복되는 공격 패턴들만 탐지하게 되어, 비효율적인 결과를 초래하게 된다. However, through repetition, more attack patterns (eg, signatures) can be detected to some extent, while repetitions are made beyond a certain level (for example, when the number of repetitions described below is exceeded. As described above, no further attack pattern can be detected. In other words, if the above-described processes are repeated by a certain level, the signatures that can be found increase according to the amount of repetition. On the other hand, if the above-described process is repeated to some extent, if more than that is repeated, only the attack patterns overlapping the attack patterns detected in the previous iteration process is detected, resulting in inefficient results.

따라서, 처리부(150)는 페이로드 리스트에 포함된 페이로드들의 개수에 따라 반복 횟수를 설정하고, 그 반복 횟수만큼 앞선 과정들을 반복 수행하도록 제어할 수 있다. 예를 들어, 반복 횟수는 LCS 도출부(110)를 통해 이루어지는 최장 공통 부분 문자열(LCS)을 도출하는 과정에서 이용되는 페이로드들의 개수가 페이로드 리스트에 포함된 전체 페이로드들의 개수보다 작거나 같도록 설정된 값일 수 있다.Therefore, the processor 150 may set the number of repetitions according to the number of payloads included in the payload list, and may control to repeat the foregoing processes by the number of repetitions. For example, the number of repetitions is that the number of payloads used in the process of deriving the longest common substring LCS made through the LCS derivation unit 110 is less than or equal to the total number of payloads included in the payload list. The value may be set to be.

위의 설명에서는 시그니처 추출 장치(100)가 패킷 중 일부 페이로드를 추출하고 페이로드에 대해 상술한 과정들을 수행하는 것으로 설명하였다. 다만 이는 예시일 뿐이고, 시그니처 추출 장치(100)는 페이로드를 추출하는 과정 없이, 페이로드를 포함하는 패킷들에 대해 상술한 과정을 수행함으로써 시그니처를 추출하는 것도 가능하다.In the above description, the signature extracting apparatus 100 extracts some payloads from the packet and performs the above-described processes on the payload. However, this is only an example, and the signature extracting apparatus 100 may extract the signature by performing the above-described process on the packets including the payload without extracting the payload.

본 발명의 일 실시예에 따른 시그니처 추출 장치(100)를 통해 이루어지는 알고리즘을 수학적으로 분석하면 다음과 같다. 또한, 본 발명의 일 실시예에 따른 시그니처 추출 장치(100)를 통해 이루어지는 알고리즘을 수학적으로 나타내기 위해 다음과 같이 네 개의 가정이 존재한다.The mathematical analysis of the algorithm made through the signature extraction apparatus 100 according to an embodiment of the present invention is as follows. In addition, the following four assumptions exist to mathematically represent the algorithm made through the signature extraction apparatus 100 according to an embodiment of the present invention.

가정 1) 네트워크 캡처 파일(예를 들어, pcap파일)은 N개의 패킷을 포함한다.Assumption 1) A network capture file (eg, a pcap file) contains N packets.

가정 2) 알고리즘은 네트워크 캡처파일의 N개의 패킷 중 δ이상의 비율로 발견되는 I개의 문자열 S_O, S₁, ... , S_I-1를 탐색하는 것을 목적으로 한다.Assumption 2) The algorithm aims to search for I strings S _O , S ₁ , ..., S _I-1 found at a rate of δ or more among N packets of the network capture file.

가정 3) 일반성을 잃지 않고 각 문자열의 길이는 S_O < S₁ < ... < S_I-1라고 가정한다.Assumption 3) Without loss of generality, the length of each string is assumed to be S _O <S ₁ <... <S _I-1 .

가정 4) 계산편의를 위해 특정 패킷에서 탐색의 대상이 되는 두 개의 문자열 S_i, S_j(i≠j)는 서로 독립적이라고 가정한다. 즉,

이다.Assumption 4) For the convenience of calculation, it is assumed that the two strings S _i and S _j (i ≠ j) that are to be searched in a particular packet are independent of each other. In other words,

to be.

또한, 문자열 S_i(단, 0≤i≤I-1)가 본 발명의 일 실시예에 따른 알고리즘을 통해서 발견될 확률(LCS 문자열로 채택될 확률)은 다음의 조건을 만족해야 한다.Further, the string S _i (stage, 0≤i≤I-1) likely to be detected by the algorithm according to one embodiment of the present invention (the probability to be adopted to the LCS string) must satisfy the following conditions.

문자열 S_i를 포함하는 패킷을 두 번 연속으로 선택하되 해당 패킷에는 문자열 S_j, j>i가 두 번 모두 포함되지 않아야 한다. 왜냐하면 LCS 문자열로 S_i가 선택되는 대신에 더긴 공통문자열 S_j가 선택되기 때문이다. 이는 아래의 수학식 2와 같이 표현될 수 있다.A packet containing the string S _i is selected twice in succession, but the packet must not contain both strings S _j and j> i. This is because the longer common string S _j is selected instead of S _i as the LCS string. This may be expressed as Equation 2 below.

또한, 위의 가정 4에 의해 수학식 2는 아래의 수학식 3 및 수학식 4와 같이 정리될 수 있다.In addition, according to the assumption 4 above, Equation 2 may be arranged as Equation 3 and Equation 4 below.

긍정적 접근으로 볼 때, 문자열 S_i보다 긴 문자열 S_j의 출현빈도가 δ라고 가정하면, 수학식 4는 아래 수학식 5에 의하여 lower bound값을 가지게 된다.As a positive approach, assuming that the occurrence frequency of the string S _j longer than the string S _i is δ , Equation 4 has a lower bound by Equation 5 below.

보수적 접근으로 볼 때, 문자열 S_i보다 긴 문자열 S_j의 출현빈도가 mδ라고 가정하면, 수학식 4는 아래의 수학식 6에 의하여 upper bound값을 가지게 된다.From a conservative approach, assuming that the frequency of occurrence of the string S _j longer than the string S _i is m δ , Equation 4 has an upper bound value by Equation 6 below.

여기서 문자열 S_j의 출현빈도가 δ보다 크게 작은 경우의 확률은 무시할 수 있다. 왜냐하면 문자열 S_j가 LCS문자열로 채택될 확률은 출현빈도의 제곱으로 주어지기 때문이다. 또한 LCS탐색에 활용되는 인자의 개수를 a개로 늘리면 출현빈도의 a승으로 주어질 수 있다.Here, the probability when the occurrence frequency of the string S _j is smaller than δ can be ignored. This is because the probability that the string S _j is adopted as the LCS string is given by the square of the frequency of occurrence. In addition, increasing the number of factors utilized for LCS search to a can be given by the a power of the frequency of appearance.

위의 가정1 내지 가정4, 그리고 상술한 조건들에 따라 실험을 해본 결과, m 값(즉, 긴 문자열의 짧은 문자열의 발견에 영향을 미치는 정도)의 영향은 상대적으로 미미하였고, 문자열의 출현 빈도에 따라 본 발명의 일 실시예에 따른 알고리즘을 통한 반복 횟수의 차이가 있었다. 예를 들어, 복수의 패킷들 내에서 10% 정도 포함된 문자열을 탐색할 때, 5% 정도 포함된 문자열을 탐색할 때, 1% 정도 포함된 문자열을 탐색할 때에 대해 실험을 수행하였고, 문자열의 출현빈도(또는 발견빈도)가 높을수록 알고리즘의 반복 횟수가 낮음을 확인하였다. 또한, 복수의 패킷에 적은 빈도의 패킷이(예를 들어, 1%) 포함되더라도, 시스템에 가해지는 부하는 전수조사법을 통해 최장 공통 부분 문자열(LCS)을 도출하는 방식에 비해 적음을 확인하였다.According to the above assumptions 1 to 4 and the above conditions, the influence of the m value (that is, the degree of influence on the discovery of short strings of long strings) was relatively insignificant. According to the difference in the number of iterations through the algorithm according to an embodiment of the present invention. For example, experiments were conducted on searching for a string containing about 10% in a plurality of packets, searching for a string containing about 5%, and searching for a string containing about 1%. The higher the frequency of occurrence (or discovery frequency), the lower the number of iterations of the algorithm. In addition, even if a plurality of packets contain a small number of packets (for example, 1%), the load applied to the system was found to be less than the method of deriving the longest common substring (LCS) through the whole survey method.

도 5 및 도 6은 본 발명의 제1 실시예에 따른 시그니처 추출 장치(100)를 통해 시그니처를 추출하는 방법을 설명하기 위한 개념도이다. 이제, 도 5 및 도 6을 참조로, 본 발명의 제1 실시예에 따른 시그니처 추출 장치(100)를 통해 시그니처를 추출하는 방법을 더 설명한다. 도 5 및 도 6은 본 발명의 이해를 돕기 위해 패킷이 8개 있는 것으로 가정되나, 이는 예시일 뿐이고 패킷의 개수는 다양하게 변경 가능하다.5 and 6 are conceptual views illustrating a method of extracting a signature through the signature extraction apparatus 100 according to the first embodiment of the present invention. 5 and 6, a method of extracting a signature through the signature extracting apparatus 100 according to the first embodiment of the present invention will be further described. 5 and 6, it is assumed that there are eight packets for the purpose of understanding the present invention, but this is only an example and the number of packets can be variously changed.

도 5를 참조하면, 네트워크를 통해 수집된 또는 네트워크 패킷 트렁크로부터 수집된 패킷(p₁ 내지 p₈)들이 도시된다. 각 패킷(p₁ 내지 p₈)은 헤더와 페이로드(도 5에서 점선블록(B)으로 표시됨)로 구성된다. 예를 들어, 제1 패킷(p₁)의 페이로드는 "XMJABCD", 제2 패킷(p₂)의 페이로드는 "MJQWJAZD", 제3 패킷(p₃)의 페이로드는 "MABCDUEXX", 제4 패킷(p₄)의 페이로드는 "DFHYNCE", 제5 패킷(p₅)의 페이로드는 "MZDDABCD", 제6 패킷(p₆)의 페이로드는 "MYHRDDVUEXX", 제7 패킷(p₇)의 페이로드는 "VBVKRFOF", 제8 패킷(p₈)의 페이로드는 "ABCDDVWD"인 것으로 가정한다. 이 예시에서 페이로드에 포함된 문자열은 본 발명의 설명을 돕기 위해, 문자의 개수가 7 내지 11개인 것으로 기재하였다. 다만 이는 예시일 뿐이고, 다양한 개수의 문자들이 각 패킷의 페이로드에 포함될 수 있다.Referring to FIG. 5, packets p ₁ to p ₈ collected through a network or collected from a network packet trunk are shown. Each packet p ₁ to p ₈ is composed of a header and a payload (indicated by dashed block B in FIG. 5). For example, the payload of the first packet p ₁ is "XMJABCD", the payload of the second packet p ₂ is "MJQWJAZD", the payload of the third packet p ₃ is "MABCDUEXX", and The payload of 4 packets p ₄ is "DFHYNCE", the payload of 5th packet p ₅ is "MZDDABCD", the payload of 6th packet p ₆ is "MYHRDDVUEXX", and the 7th packet (p ₇₎ ) the payload of the "VBVKRFOF", the payload of the packet 8 _{(p. 8)} is assumed to be "ABCDDVWD". In this example, the string included in the payload is described as 7 to 11 characters for the purpose of explanation of the present invention. However, this is only an example and various numbers of characters may be included in the payload of each packet.

LCS 도출부(110)는 복수의 패킷(p₁ 내지 p₈)의 전체 페이로드들을 포함하는 페이로드 리스트를 생성한다. 그 후, LCS 도출부(110)는 도 6에 도시된 바와 같이, 복수의 페이로드들 중 임의의 두 페이로드(또는 복수의 패킷들 중 임의의 두 패킷)을 선택한다. 도 6의 예시는 LCS 도출부(110)가 제1 패킷(p₁)과 제3 패킷(p₃)을 선택한 상황을 가정한다.The LCS derivation unit 110 generates a payload list including all payloads of the plurality of packets p ₁ to p ₈ . Thereafter, the LCS derivation unit 110 selects any two payloads (or any two packets of the plurality of packets) among the plurality of payloads, as shown in FIG. 6. 6 illustrates a situation in which the LCS derivation unit 110 selects a first packet p ₁ and a third packet p ₃ .

그 후, LCS 도출부(110)는 선택한 패킷(p₁, p₃)들의 페이로드의 길이를 확인한다. 페이로드의 길이가 미리 설정된 문자열 최소 길이 미만이면, LCS 도출부(110)는 다른 패킷을 선택한다. 그렇지 않은 경우 LCS 도출부(110)는 다음 과정을 진행한다. 예를 들어, 문자열 최소 길이가 5로 설정된 경우, 제1 패킷(p₁)과 제3 패킷(p₃)의 페이로드 길이는 모두 문자열 최소 길이를 초과하므로, 아래의 과정을 수행할 수 있다. 반대로, 문자열 최소 길이가 8로 설정된 경우, 제1 패킷(p₁) 의 페이로드 길이는 문자열 최소 길이 미만이므로, LCS 패킷 추출부(110)는 새로운 패킷들을 선택할 것이다. 아래에서는 문자열 최소 길이가 5로 설정되어, 다음 과정이 수행되는 상황을 가정한다.Thereafter, the LCS derivation unit 110 checks the length of the payload of the selected packets p ₁ and p ₃ . If the length of the payload is less than the preset string minimum length, the LCS derivation unit 110 selects another packet. If not, the LCS derivation unit 110 proceeds to the next process. For example, when the minimum string length is set to 5, since the payload lengths of the first packet p ₁ and the third packet p ₃ both exceed the minimum string length, the following process may be performed. On the contrary, when the string minimum length is set to 8, since the payload length of the first packet p ₁ is less than the string minimum length, the LCS packet extracting unit 110 will select new packets. In the following, it is assumed that the minimum length of the string is set to 5 so that the following process is performed.

LCS 도출부(110)는 선택한 두 패킷(p₁, p₃)의 페이로드들을 분석함으로써 두 페이로드(또는 두 패킷)들에 대한 최장 공통 부분 문자열(LCS)을 도출한다. 예를 들어, 본 예시에서는 제1 패킷(p₁)의 페이로드가 "XMJABCD"이고, 제3 패킷(p₃)의 페이로드가 "MABCDUEXX"인 것을 가정하므로, LCS 도출부(110)를 통해 도출된 최장 공통 부분 문자열(LCS)은 "ABCD"이다.The LCS derivation unit 110 derives the longest common substring LCS for the two payloads (or two packets) by analyzing the payloads of the selected two packets p ₁ and p ₃ . For example, in this example, since the payload of the first packet p ₁ is "XMJABCD" and the payload of the third packet p ₃ is "MABCDUEXX", the LCS derivation unit 110 is performed. The longest common substring (LCS) derived is "ABCD".

그 후, LCS 도출부(110)는 도출한 최장 공통 부분 문자열(LCS)의 길이와 미리 설정된 부분 문자열 최소 길이를 비교한다. 앞서 설명한 것처럼, 최장 공통 부분 문자열(LCS)의 길이와 부분 문자열 최소 길이의 비교 과정은 아래에서 설명될 출현 빈도의 계산 과정을 최소화하기 위함이다. 예를 들어, 부분 문자열 최소 길이가 3으로 설정되면, 다음의 과정이 더 수행된다. 그렇지 않은 경우, LCS 도출부(110)는 추출한 최장 공통 부분 문자열(LCS)을 제거하고, 위에서 설명된 과정을 재수행한다. 아래에서는 부분 문자열 최소 길이가 최장 공통 부분 문자열(LCS)의 길이 미만인 상황을 가정한다.Thereafter, the LCS derivation unit 110 compares the length of the derived longest common substring LCS with the preset substring minimum length. As described above, the comparison process between the length of the longest common substring (LCS) and the minimum length of the substring is to minimize the calculation of the appearance frequency, which will be described below. For example, if the substring minimum length is set to 3, the following process is further performed. If not, the LCS derivation unit 110 removes the extracted longest common substring LCS and performs the above-described process again. In the following, it is assumed that the minimum length of the substring is less than the length of the longest common substring (LCS).

그 후, 출현 빈도 계산부(130)는 페이로드 리스트에서 LCS 도출부(110)를 통해 도출한 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산한다. 이를 위해, 출현 빈도 계산부(130)는 먼저, 전체 페이로드들의 개수와 최장 공통 부분 문자열(LCS)을 포함하는 페이로드의 개수를 확인한다.Thereafter, the appearance frequency calculator 130 calculates an appearance frequency of the longest common substring LCS derived through the LCS derivation unit 110 in the payload list. To this end, the appearance frequency calculator 130 first checks the total number of payloads and the number of payloads including the longest common substring LCS.

도 6에 도시된 예시에서 제1 패킷(p₁)의 페이로드는 "XMJ ABCD ", 제2 패킷(p₂)의 페이로드는 "MJQWJAZD", 제3 패킷(p₃)의 페이로드는 "M ABCD UEXX", 제4 패킷(p₄)의 페이로드는 "DFHYNCE", 제5 패킷(p₅)의 페이로드는 "MZDD ABCD ", 제6 패킷(p₆)의 페이로드는 "MYHRDDVUEXX", 제7 패킷(p₇)의 페이로드는 "VBVKRFOF", 제8 패킷(p₈)의 페이로드는 " ABCD DVWD"인 것으로 가정하므로, LCS 도출부(110)를 통해 추출한 최장 공통 부분 문자열(LCS)은 제1 패킷(p₁), 제3 패킷(p₃), 제5 패킷(p₅) 및 제8 패킷(p₈)에 포함됨을 알 수 있다.In the example shown in FIG. 6, the payload of the first packet p ₁ is “XMJ ABCD ”, the payload of the second packet p ₂ is “MJQWJAZD”, and the payload of the third packet p ₃ is “. M ABCD UEXX ", the fourth packet (p ₄₎ the payload of the" DFHYNCE ", the fifth packet (p ₅₎ the payload of the" MZDD ABCD ", the payload of a sixth packet _{(p. 6)} is" MYHRDDVUEXX " Since the payload of the seventh packet p ₇ is “VBVKRFOF” and the payload of the eighth packet p ₈ is “ ABCD DVWD”, the longest common substring extracted through the LCS derivation unit 110 ( The LCS may be included in the first packet p ₁ , the third packet p ₃ , the fifth packet p ₅ , and the eighth packet p ₈ .

그 후, 출현 빈도 계산부(130)는 최장 공통 부분 문자열(LCS)을 포함하는 페이로드의 개수를 전체 페이로드 개수로 나눔으로써 출현 빈도를 계산한다. 본 예시의 경우, 출현 빈도는 0.5이다.Thereafter, the appearance frequency calculator 130 calculates the appearance frequency by dividing the number of payloads including the longest common substring (LCS) by the total number of payloads. For this example, the frequency of appearance is 0.5.

판단부(140)는 출현 빈도와 미리 설정된 최소 빈도값을 비교한다. 출현 빈도가 미리 설정된 최소 빈도값을 초과하면 위에서 설명된 최장 공통 부분 문자열(LCS)인 "ABCD"는 시그니처로 결정되고, 그렇지 않은 경우 제거된다. 예를 들어, 최소 빈도값이 0.7로 설정되면, 출현 빈도가 최소 빈도값 미만이 되므로, "ABCD"는 제거될 것이다. 예를 들어, 최소 빈도값이 0.3으로 설정되면, "ABCD"는 공격 패턴으로 결정될 수 있다.The determination unit 140 compares the appearance frequency with a preset minimum frequency value. If the occurrence frequency exceeds the preset minimum frequency value, the longest common substring (LCS) described above is determined as a signature, otherwise it is removed. For example, if the minimum frequency value is set to 0.7, the "ABCD" will be removed since the frequency of appearance will be less than the minimum frequency value. For example, if the minimum frequency value is set to 0.3, "ABCD" may be determined as an attack pattern.

또한, 최소 빈도값은 네트워크의 상태에 따라 다양한 값으로 설정될 수 있다. 이는 패킷의 개수가 적어서 공통 부분 문자열을 찾지 못하거나, 반대로 패킷의 개수가 너무 많아서, 신빙성이 떨어지는 공통 부분 문자열들이 다수 발견되는 상황을 방지하기 위함이다. 시험을 통해 파일 크기가 6 MB이고 패킷수가 8201 개인 샘플 캡처 파일에 본 발명의 일 실시예에 따른 알고리즘을 적용해본 결과, 최소 빈도값이 0.1로 설정할 때 공통 부분 문자열을 찾지 못하는 상황이 발생하였다. 이 경우, 최소 빈도값을 이보다 낮은 값으로 설정함으로써(또는 패킷의 수가 적을 때(일정 개수 미만일 때) 최소 빈도값이 이보다 낮게 조정되도록 설정함으로써) 해결될 수 있다.In addition, the minimum frequency value may be set to various values according to the state of the network. This is to prevent a situation where a common number of packets is not found due to the small number of packets, or, on the contrary, the number of packets is too large, where a large number of common substrings having low reliability are found. As a result of applying the algorithm according to an embodiment of the present invention to a sample capture file having a file size of 6 MB and a packet number of 8201, a common substring was not found when the minimum frequency value was set to 0.1. In this case, it can be solved by setting the minimum frequency value to a lower value (or setting the minimum frequency value to be adjusted lower than this when the number of packets is small (less than a certain number)).

처리부(150)는 판단부(140)를 통해 최장 공통 부분 문자열(LCS)이 시그니처로 결정되면, 결정된 최장 공통 부분 문자열(LCS)을 탐지 패턴 리스트에 추가하고, 이를 저장부에 저장할 수 있다.When the longest common substring LCS is determined as a signature through the determination unit 140, the processor 150 may add the determined longest common substring LCS to the detection pattern list and store it in the storage unit.

또한, 위에서 설명한 것처럼, 본 발명의 일 실시예에 따른 시그니처 추출 장치(100)는 상술한 과정을 미리 설정된 반복 횟수만큼 반복하여 수행할 수 있다. 예를 들어, 반복 횟수가 3번으로 설정된 경우, LCS 도출부(110)를 통해 두 개의 페이로드를 임의로 설정하는 과정부터 그 후의 과정들을 세 번 더 반복하여 수행될 수 있다. 이후에 대한 과정은 앞서 설명된 사항과 실질적으로 동일하므로, 중복되는 설명은 생략한다.In addition, as described above, the signature extraction apparatus 100 according to an embodiment of the present invention may repeat the above-described process by a predetermined number of repetitions. For example, when the number of repetitions is set to 3, the process may be repeated three more times from the process of arbitrarily setting two payloads through the LCS derivation unit 110. Since the following process is substantially the same as the above-described matters, redundant description is omitted.

또한, 실험 결과, 파일 크기가 131 MB이고 패킷수가 93076 개인 샘플 캡처 파일에 본 발명의 일 실시예에 따른 알고리즘을 1분간 수행해본 결과, 약 2천회의 반복이 가능하였다. 수학적 분석에 근거하여, 출현빈도(또는 발생빈도)가 0.03 이상인 문자열의 경우 대부분 탐색 가능하였다.본 발명의 일 실시예에 따른 시그니처 추출 장치를 통한 알고리즘에서, 패킷 파일의 크기는 임계치 탐색시간에 영향을 미치고, 이는 주어진 시간(예시에서는 1분으로 지정하였음) 동안 상술한 알고리즘(임의의 두 패킷의 LCS 문자열을 찾는 과정)의 수행 횟수가 줄어들게 됨을 의미한다.In addition, as a result of the experiment, an algorithm according to an embodiment of the present invention was performed on a sample capture file having a file size of 131 MB and a packet number of 93076 for 1 minute, and about 2,000 repetitions were possible. Based on a mathematical analysis, most of the strings with an occurrence frequency (or occurrence frequency) of 0.03 or more were searchable. In an algorithm through a signature extracting apparatus according to an embodiment of the present invention, the size of a packet file affects a threshold search time. This means that the number of executions of the above algorithm (the process of finding the LCS string of any two packets) is reduced for a given time (in the example, 1 minute).

또한, 문자열의 탐지확률은 문자열의 발견빈도에 영향을 미친다. 예를 들어, 전체 패킷에서 문자열의 발견 빈도가 0.1정도라고 하면 상술한 과정들을 120 내지 150번정도 수행할 때 해당 문자열을 발견할 수 있다. 그러나 어떤 문자열의 발견 빈도가 0.01정도라고 하면 대략, 상술한 과정들을 10200 내지 11000번정도 수행할 때 해당 문자열을 발견할 수 있다.Also, the probability of detection of a string affects the frequency of discovery of the string. For example, if the discovery frequency of the string in the entire packet is about 0.1, the string may be found when the above-described processes are performed about 120 to 150 times. However, if the frequency of finding a string is about 0.01, the string can be found when the above-described processes are performed about 10200 to 11000 times.

하지만, 프로그램 실행은 상술한 과정들의 횟수가 아닌, 프로그램 수행 시간을 기준으로 제어되므로, 상술한 과정들의 수행 횟수(즉, 반복 횟수)는 패킷 파일의 크기에 영향을 받는다. 또한, 용량이 상당히 큰 파일(예를 들어, 수백MB 내지 수GB)이라고 하더라도 특정 문자열의 발견빈도가 매우 높다면, 높은 확률로 해당 문자열을 신속하게 탐색할 수 있다.However, since the program execution is controlled based on the program execution time, not the number of the above-described processes, the number of executions (ie, the number of repetitions) of the above-described processes is affected by the size of the packet file. Also, even if the file is quite large (for example, hundreds of megabytes to several gigabytes), if the frequency of finding a particular string is very high, the string can be quickly searched with a high probability.

도 7은 본 발명의 제2 실시예에 따른 시그니처 추출 장치(200)에 대한 블록도이다. 본 발명의 제2 실시예에 따른 시그니처 추출 장치(200)는 LCS 도출부(210), 중복 검사부(220), 출현 빈도 계산부(230), 판단부(240) 및 처리부(250)를 포함하여 구성된다. 여기서, LCS 도출부(210), 중복 검사부(220), 출현 빈도 계산부(230), 판단부(240) 및 처리부(250)는 본 발명의 이해를 돕기 위해 기능별로 구분한 것이고, 시그니처 추출 장치(200)는 하나의 제어부 또는 소프트웨어의 형태로 구현될 수 있다. 또한, 시그니처 추출 장치(200)는 단일 코어 또는 다중 코어로 이루어진 CPU, MPU 등과 같은 처리 장치를 통해 구현될 수도 있다. 또한, 본 발명의 제2 실시예에 따른 시그니처 추출 장치(200)는 본 발명의 제1 실시예에서 중복 검사 기능을 수행하는 중복 검사부(220)를 포함하여 구성된다는 점을 제외하고, 제1 실시예와 실질적으로 동일하다. 따라서, 아래에서는 중복 검사부(220)를 중심으로 그 설명이 이루어진다.7 is a block diagram of a signature extraction apparatus 200 according to a second embodiment of the present invention. The signature extracting apparatus 200 according to the second embodiment of the present invention includes an LCS derivation unit 210, a duplicate inspection unit 220, an appearance frequency calculator 230, a determination unit 240, and a processing unit 250. It is composed. Here, the LCS derivation unit 210, the overlapping inspection unit 220, the appearance frequency calculation unit 230, the determination unit 240 and the processing unit 250 are divided by function to help the understanding of the present invention, the signature extraction device The 200 may be implemented in the form of one controller or software. In addition, the signature extraction apparatus 200 may be implemented through a processing device such as a CPU, an MPU, or the like composed of a single core or multiple cores. In addition, the signature extracting apparatus 200 according to the second embodiment of the present invention may be configured to include the redundant inspection unit 220 that performs the redundant inspection function in the first embodiment of the present invention. It is substantially the same as the example. Therefore, the description is made below with respect to the overlapping inspection unit 220.

중복 검사부(220)는 LCS 도출부(210)를 통한 두 개의 페이로드들의 최장 공통 부분 문자열(LCS)을 도출한 후 수행되는 구성으로서, 최장 공통 부분 문자열(LCS)과 탐지 패턴 리스트에 포함된 탐지 패턴들을 비교함으로써, 최장 공통 부분 문자열(LCS)의 중복 여부를 검사하는 기능을 한다. 구체적으로, 중복 검사부(220)는 최장 공통 부분 문자열(LCS)과 탐지 패턴 리스트에 포함된 탐지 패턴들 간 유사도에 따라 최장 공통 부분 문자열(LCS)의 중복 여부를 검사하는 기능을 한다. 이를 위해, 중복 검사부(220)는 LCS 도출부(210)를 통해 도출한 최장 공통 부분 문자열(LCS)이 탐지 패턴 리스트에 존재하는 탐지 패턴(예를 들어, 시그니처)에 포함되는지 확인함으로써 중복 검사를 수행할 수 있다. 또한, 중복 검사부(220)는 탐지 패턴 리스트에 포함된 탐지 패턴(예를 들어, 시그니처)이 LCS 도출부(210)를 통해 도출한 최장 공통 부분 문자열(LCS)에 포함되는지 확인함으로써 중복 검사를 수행할 수 있다.The redundancy check unit 220 is a configuration that is performed after deriving the longest common substring (LCS) of the two payloads through the LCS derivation unit 210, the detection included in the longest common substring (LCS) and the detection pattern list By comparing the patterns, it checks whether the longest common substring (LCS) is duplicated. In detail, the overlapping checker 220 performs a function of checking whether the longest common substring LCS is duplicated according to the similarity between the longest common substring LCS and the detection patterns included in the detection pattern list. To this end, the redundancy check unit 220 checks whether the longest common substring (LCS) derived through the LCS derivation unit 210 is included in a detection pattern (eg, a signature) existing in the detection pattern list, and performs a duplicate check. Can be done. In addition, the redundancy checker 220 performs the redundancy check by checking whether a detection pattern (eg, a signature) included in the detection pattern list is included in the longest common substring (LCS) derived through the LCS derivation unit 210. can do.

일반적으로, 최장 공통 부분 문자열(LCS)은 일정 크기 이상으로(예를 들어, 8 바이트 이상의 크기로) 구성될 수 있고, 이는 최장 공통 부분 문자열(LCS)의 길이가 다른 패턴과의 포함관계가 다수 발생하지 않음을 의미한다. 다른 실시예로서, 중복 검사부(220)를 통해 이루어지는 두 개의 중복 검사 중 하나만 선택적으로 이루어지는 방식도 가능하나, 두 개의 중복 검사를 모두 수행하는 것이 전체적인 효율 면에서 우수할 수 있다.In general, the longest common substring (LCS) may consist of more than a certain size (eg, more than 8 bytes in size), which means that the longest common substring (LCS) has a large number of inclusion relationships with patterns having different lengths. It does not occur. In another exemplary embodiment, only one of two overlapping checks may be selectively performed through the overlapping checker 220, but performing both overlapping checks may be excellent in terms of overall efficiency.

위에서 설명한 것처럼, 출현 빈도 계산부(230)를 통해 이루어지는 출현 빈도 계산은 그 복잡도가 O(N)으로서, 다른 과정들에 비해 그 복잡도가 높다. 따라서, 본 발명의 제2 실시예에서는 출현 빈도 계산부(230)를 통한 계산 과정을 최소화하기 위해, LCS 도출부(210)를 통해 이루어지는 최장 공통 부분 문자열(LCS)의 길이 확인은 물론, 최장 공통 부분 문자열(LCS)의 중복 여부를 더 확인한다. 중복 검사부(220)를 통한 확인 결과, 최장 공통 부분 문자열(LCS)이 중복된 것으로 판단된 경우, 중복 검사부(220)는 LCS 도출부(210)를 통해 도출된 최장 공통 부분 문자열(LCS)을 제거한다. 이 경우, 출현 빈도 계산부(230) 및 판단부(240)를 통한 과정은 생략되며, 처리부(250)를 통한 반복 횟수 확인 과정이 수행된다. 반대로, 최장 공통 부분 문자열(LCS)이 중복되지 않은 것으로 판단되면, 출현 빈도 계산부(230)를 통한 과정이 수행된다.As described above, the appearance frequency calculation made through the appearance frequency calculating unit 230 is O (N), and its complexity is higher than that of other processes. Accordingly, in the second embodiment of the present invention, in order to minimize the calculation process through the appearance frequency calculator 230, the length of the longest common substring LCS made through the LCS derivation unit 210 is, of course, the longest common. Check if the partial string (LCS) is duplicated. When it is determined that the longest common substring LCS is duplicated as a result of the checking through the duplicate checker 220, the duplicated checker 220 removes the longest common substring LCS derived through the LCS derivation unit 210. do. In this case, the process through the appearance frequency calculator 230 and the determiner 240 is omitted, and the process of checking the number of repetitions through the processor 250 is performed. On the contrary, if it is determined that the longest common substring LCS is not duplicated, the process through the appearance frequency calculator 230 is performed.

이처럼, 본 발명의 제1 및 제2 실시예에 따른 시그니처 추출 장치는 시그니처의 추출 정확도를 보장하면서, 계산 복잡도를 감소시킴으로써, 처리량을 증가시킬 수 있다. 따라서, 다량의 패킷들이 송수신되는 실제 네트워크 환경에서도 적용 가능한 장점을 갖는다.As such, the signature extraction apparatus according to the first and second embodiments of the present invention can increase the throughput by reducing the computational complexity while ensuring the extraction accuracy of the signature. Therefore, there is an advantage that can be applied in the actual network environment in which a large number of packets are transmitted and received.

다시 말해, 본 발명의 실시예들에 따른 시그니처 추출 장치는 두 개의 페이로드를 무작위로 선택하여 최장 공통 부분 문자열(LCS)을 구한다. 이 방식은 최장 공통 부분 문자열(LCS) 추출에 시간 또는 횟수의 제약을 둠으로써 근사해를 구하는데 소요되는 시간적 리소스를 제어할 수 있으며, 길이와 발생 빈도를 적용하여 근사해를 통해 더 적은 리소스를 투입하면서도 결정해와 동일한 추출 효과를 얻을 수 있다. 따라서, 주어진 시간 동안 무작위 탐색의 횟수를 감소시킬 수 있고, 이는 시그니처 추출의 처리량을 극대화시킬 수 있다.In other words, the signature extraction apparatus according to embodiments of the present invention randomly selects two payloads to obtain the longest common substring (LCS). This method can control the time resources required to approximate the solution by limiting the time or number of times to extract the longest common substring (LCS), while applying less resources through the approximate solution by applying the length and frequency of occurrence. The same extraction effect as the crystal solution can be obtained. Thus, it is possible to reduce the number of random searches for a given time, which can maximize the throughput of signature extraction.

도 8은 본 발명의 제1 실시예에 따른 시그니처 추출 방법에 대한 흐름도이다. 이제, 도 8을 참조로 본 발명의 제1 실시예에 따른 시그니처 추출 방법에 대한 설명이 이루어진다. 또한, 앞서 설명된 부분과 중복되는 사항은 생략된다.8 is a flowchart illustrating a signature extraction method according to a first embodiment of the present invention. Now, a description will be given of the signature extraction method according to the first embodiment of the present invention with reference to FIG. In addition, the description overlapping with the above-described portion is omitted.

S110 단계는 LCS 도출부에 의해, 페이로드 리스트를 생성하는 단계이다. 예를 들어, S110 단계는 복수 패킷들의 트렁크를 시스템으로 로드하고, 각 패킷에서 헤더를 제외한 페이로드(예를 들어, 문자열)를 추출함으로써 페이로드 리스트를 생성함으로써 이루어질 수 있다.In step S110, the payload list is generated by the LCS derivation unit. For example, step S110 may be performed by loading a trunk of a plurality of packets into the system and generating a payload list by extracting a payload (eg, a string) excluding a header from each packet.

S120 단계는 LCS 도출부에 의해, 복수의 페이로드들을 포함하는 페이로드 리스트에서 두 개의 페이로드들을 선택하는 단계이다. S120 단계에서 LCS 도출부는 복수의 페이로드들 중 무작위로 두 페이로드를 선택할 수 있다. 도 4를 참조로 설명한 것처럼, 본 발명의 일 실시예에 따른 시그니처 추출 방법은 전체 페이로드들을 대상으로 하는 것이 아닌 전체 페이로드들 중 일부 페이로드를 선택하고 최장 공통 부분 문자열(LCS)을 도출한다. S120 단계에서 두 개의 페이로드들을 선택할 때, 선택 순서가 미리 정의되어 있다면, 이는 공격 패턴의 탐지율에 영향을 미칠 수 있다. 따라서, S120 단계는 무작위로 두 개의 페이로드를 선택함으로써 이루어진다.In step S120, the LCS derivation unit selects two payloads from a payload list including a plurality of payloads. In step S120, the LCS derivation unit may randomly select two payloads from among a plurality of payloads. As described with reference to FIG. 4, the signature extraction method according to an embodiment of the present invention selects some of the payloads rather than all the payloads and derives the longest common substring (LCS). . When selecting two payloads in step S120, if the selection order is predefined, this may affect the detection rate of the attack pattern. Therefore, step S120 is achieved by randomly selecting two payloads.

S130 단계는 S120 단계를 통해 선택한 페이로드들의 길이와 미리 설정된 문자열 최소 길이를 비교하는 단계이다. S130 단계를 통한 판단 결과 페이로드의 길이가 미리 설정된 문자열 최소 길이를 초과하면, S140 단계가 수행된다. 그렇지 않은 경우, S120 단계를 통해 문자열 최소 길이보다 짧은 페이로드 대신 다른 페이로드를 선택하는 과정이 수행되고, S130 단계가 더 수행될 수 있다.In step S130, the lengths of the payloads selected in step S120 are compared with a preset string minimum length. As a result of the determination through step S130, if the length of the payload exceeds a preset string minimum length, step S140 is performed. Otherwise, a process of selecting another payload instead of a payload shorter than the string minimum length may be performed through step S120, and step S130 may be further performed.

S140 단계는 LCS 도출부에 의해, 두 개의 페이로드들의 최장 공통 부분 문자열(LCS)을 도출하는 단계이다.In step S140, the LCS derivation unit derives the longest common substring LCS of the two payloads.

S150 단계는 LCS 도출부에 의해, 최장 공통 부분 문자열(LCS)의 길이와 미리 설정된 부분 문자열 최소 길이를 비교하는 단계이다. 상술한 것처럼, S140 단계를 통해 이루어지는 최장 공통 부분 문자열(LCS)의 계산 복잡도는 O(1)인 한편, S160 단계를 통해 이루어지는 과정의 계산 복잡도는 O(N)이다. 여기서, N은 페이로드 리스트에 포함된 전체 페이로드들의 개수를 나타낸다. 즉, 전체 과정들 중 상대적으로 계산 복잡도가 높은 S160 단계는 최소화되는 것이 바람직하다. 다만, 최장 공통 부분 문자열(LCS)의 길이가 너무 짧을 경우 이는 공격 패턴을 정상적으로 탐지하기 어려우므로, S150 단계를 통해 최장 공통 부분 문자열(LCS)의 길이가 미리 설정된 조건에 충족하는지 판단이 이루어진다. S150 단계를 통한 판단 결과, 최장 공통 부분 문자열(LCS)의 길이가 부분 문자열 최소 길이를 초과하면, S160 단계가 수행된다. 그렇지 않은 경우, S190 단계가 수행된다.In step S150, the LCS derivation unit compares the length of the longest common substring LCS with the preset substring minimum length. As described above, the computational complexity of the longest common substring LCS made through step S140 is O (1), while the computational complexity of the process made through step S160 is O (N). Here, N represents the total number of payloads included in the payload list. That is, it is preferable that the step S160, which has a relatively high computational complexity, is minimized among the entire processes. However, when the length of the longest common substring LCS is too short, it is difficult to detect an attack pattern normally. Therefore, it is determined whether the length of the longest common substring LCS meets a preset condition through step S150. As a result of the determination through step S150, if the length of the longest common substring LCS exceeds the substring minimum length, step S160 is performed. If not, step S190 is performed.

S160 단계는 S150 단계를 통한 판단 결과, 최장 공통 부분 문자열(LCS)의 길이가 부분 문자열 최소 길이를 초과할 때 수행되는 단계로서, 출현 빈도 계산부에 의해, 페이로드 리스트에서 최장 공통 부분 문자열(LCS)의 출현 빈도를 계산하는 단계이다. 예를 들어, S160 단계는 페이로드 리스트에서 최장 공통 부분 문자열(LCS)을 포함하는 페이로드들의 개수를 확인하고, 최장 공통 부분 문자열(LCS)을 포함하는 페이로드들의 개수를 페이로드 리스트에 포함된 전체 페이로드들의 개수로 나눔으로써 이루어질 수 있다. S160 단계는 위에서 수학식 1을 참조로 상세히 언급하였으므로, 중복되는 설명은 생략한다.Step S160 is performed when the length of the longest common substring (LCS) exceeds the minimum length of the substring as a result of the determination through step S150. The appearance frequency calculator determines the longest common substring (LCS) in the payload list. ) Is calculating the frequency of appearance. For example, step S160 is to check the number of payloads including the longest common substring (LCS) in the payload list, the number of payloads including the longest common substring (LCS) in the payload list By dividing by the total number of payloads. Since step S160 has been described in detail with reference to Equation 1 above, redundant description will be omitted.

S170 단계는 판단부에 의해, S160 단계를 통해 계산된 최장 공통 부분 문자열(LCS)의 출현 빈도와 최소 빈도값을 비교하는 단계이다. S170 단계를 통한 판단 결과, 출현 빈도가 최소 빈도값을 초과하면 S180 단계를 통해 최장 공통 부분 문자열(LCS)를 시그니처로 결정하는 과정이 수행된다. 그 후, S185 단계가 수행될 수 있고, S185 단계는 처리부에 의해, 출현 빈도가 최소 빈도값을 초과하는 최장 공통 부분 문자열(LCS)을 탐지 패턴 리스트에 추가하는 단계이다.In step S170, the determination unit compares the appearance frequency and the minimum frequency value of the longest common substring LCS calculated in step S160. As a result of the determination in step S170, when the appearance frequency exceeds the minimum frequency value, a process of determining the longest common substring LCS as the signature is performed in step S180. Thereafter, step S185 may be performed, and step S185 is a step of adding, by the processing unit, the longest common substring LCS whose appearance frequency exceeds the minimum frequency value to the detection pattern list.

S160 단계를 통해 계산된 최장 공통 부분 문자열(LCS)의 출현 빈도가 최소 빈도값 이하인 경우, S190 단계가 수행된다.When the frequency of appearance of the longest common substring LCS calculated at step S160 is less than or equal to the minimum frequency value, step S190 is performed.

S190 단계는 미리 결정된 횟수만큼 S120 단계 내지 S185 단계가 수행되었는지 확인하는 단계이다. 구체적으로, S190 단계는 상술한 단계들의 진행 횟수와 미리 결정된 반복 횟수를 비교함으로써 이루어질 수 있다. 앞서 설명한 것처럼, 반복 횟수는 S120 단계 내지 S140 단계에서 이용되는 페이로드들의 개수가 페이로드 리스트에 포함된 전체 페이로드들의 개수보다 작거나 같도록 미리 설정된 횟수이다. 확인 결과, 상술한 단계들의 진행 횟수가 미리 결정된 반복 횟수만큼 반복된 경우, 시그니처 추출 방법은 종료된다. 그렇지 않은 경우 S120 단계로부터 상술한 과정들이 재수행된다.The step S190 is a step of checking whether the steps S120 to S185 have been performed a predetermined number of times. Specifically, step S190 may be performed by comparing the number of advances of the above-described steps with a predetermined number of repetitions. As described above, the number of repetitions is a preset number such that the number of payloads used in steps S120 to S140 is less than or equal to the total number of payloads included in the payload list. As a result of the check, when the number of times of the aforementioned steps is repeated by a predetermined number of repetitions, the signature extraction method is terminated. Otherwise, the above-described processes are performed again from the step S120.

또한, S190 단계에서 진행 횟수가 미리 결정된 반복 횟수만큼 반복된 경우, S185 단계를 통해 갱신된 탐지 패턴 리스트를 저장부에 저장하는 과정이 더 수행될 수 있다. 물론, 저장부에 탐지 패턴 리스트를 저장하는 과정은 S185 단계를 진행할 때마다 반복적으로 이루어질 수도 있다.In addition, when the number of progresses is repeated by a predetermined number of times in step S190, the process of storing the updated detection pattern list in the storage unit may be further performed in step S185. Of course, the process of storing the detection pattern list in the storage unit may be repeatedly performed every time step S185.

도 9는 본 발명의 제2 실시예에 따른 시그니처 추출 방법에 대한 흐름도이다. 본 발명의 제2 실시예에 따른 시그니처 추출 방법은 본 발명의 제1 실시예에서 중복 검사 기능을 수행하는 과정을 더 수행한다는 점을 제외하고, 제1 실시예와 실질적으로 동일하다. 따라서, 아래에서는 중복 검사 기능이 이루어지는 S255 단계를 중심으로 그 설명이 이루어진다.9 is a flowchart illustrating a signature extraction method according to a second embodiment of the present invention. The signature extraction method according to the second embodiment of the present invention is substantially the same as the first embodiment, except that the process of performing the redundancy check function is further performed in the first embodiment of the present invention. Therefore, the description will be given below based on the step S255 in which the overlapping checking function is performed.

S255 단계는 S260 단계 전에 수행되는 단계로서, 중복 검사부에 의해, S240 단계를 통해 도출한 최장 공통 부분 문자열(LCS)과 저장부에 저장된 탐지 패턴 리스트에 포함된 탐지 패턴들을 비교함으로써, 최장 공통 부분 문자열(LCS)의 중복 여부를 판단하는 단계이다. 예를 들어, S255 단계는 도출한 최장 공통 부분 문자열(LCS)이 탐지 패턴 리스트에 포함된 탐지 패턴들 중 적어도 하나의 탐지 패턴에 포함되는지 확인하는 단계와, 탐지 패턴 리스트에 포함된 탐지 패턴들 중 적어도 하나의 탐지 패턴이 S240 단계를 통해 도출한 최장 공통 부분 문자열(LCS)에 포함되는지 확인하는 단계를 포함할 수 있다.Step S255 is a step performed before step S260, by comparing the longest common substring (LCS) derived through the step S240 with the detection patterns included in the detection pattern list stored in the storage unit by the redundancy checker, and thus, the longest common substring. It is a step of determining whether or not the LCS is overlapped. For example, step S255 may include checking whether the derived longest common substring (LCS) is included in at least one of the detection patterns included in the detection pattern list, and among the detection patterns included in the detection pattern list. The method may include checking whether the at least one detection pattern is included in the longest common substring LCS derived through the operation S240.

본 발명에 따른 상기 예시적인 방법들은 프로세서에 의해 실행되는 프로그램 명령들, 소프트웨어 모듈, 마이크로코드, 컴퓨터(정보 처리 기능을 갖는 장치를 모두 포함함)로 읽을 수 있는 기록 매체에 기록된 컴퓨터 프로그램 제품, 애플리케이션, 논리 회로들, 주문형 반도체, 또는 펌웨어 등 다양한 방식으로 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD, DVD, 자기 테이프, 하드 디스크, 플로피 디스크, 하드 디스크, 광데이터 저장 장치 등이 있으며, 이에 제한되는 것은 아니다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The exemplary methods according to the present invention are computer program products recorded on a recording medium readable by a computer (including an apparatus having an information processing function), program instructions executed by a processor, a software module, microcode, a computer, Applications, logic circuits, application specific semiconductors, or firmware can be implemented in a variety of ways. Examples of the computer-readable recording medium include, but are not limited to, a ROM, a RAM, a CD, a DVD, a magnetic tape, a hard disk, a floppy disk, a hard disk, an optical data storage device, and the like. The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

이상의 설명은 본 발명을 예시적으로 설명한 것에 불과하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술적 사상에서 벗어나지 않는 범위에서 다양한 변형이 가능할 것이다.The above description is merely illustrative of the present invention, and various modifications may be made by those skilled in the art without departing from the spirit of the present invention.

100: 시그니처 추출 장치 110: LCS 도출부
130: 출현 빈도 계산부 140: 판단부
150: 처리부 200: 시그니처 추출 장치
220: 중복 검사부100: signature extraction unit 110: LCS derivation unit
130: appearance frequency calculation unit 140: determination unit
150: processing unit 200: signature extraction device
220: duplicate inspection unit

Claims

Selecting two payloads from a payload list comprising a plurality of payloads;
Deriving a longest common subsequence (LCS) of the two payloads;
Calculating a frequency of appearance of the longest common substring (LCS) in the payload list;
Comparing the appearance frequency calculated in the calculating step with a predetermined minimum frequency value; And
If the frequency of occurrence of the longest common substring (LCS) exceeds the minimum frequency value as a result of the comparison, determining the longest common substring (LCS) as a signature;
The calculating of the frequency of appearance of the longest common substring (LCS) may include checking the number of payloads including the longest common substring (LCS) in the payload list, and determining the longest common substring (LCS). The number of payloads included is calculated based on a value obtained by dividing the total number of payloads included in the payload list.

The method of claim 1,
Selecting the two payloads is selecting randomly two payloads from among the plurality of payloads.

The method of claim 1,
The selecting, the deriving, the calculating and the comparing; And the determining step is performed repeatedly in this order by a predetermined number of iterations.

The method of claim 3,
The number of payloads used in deriving the longest common substring (LCS) is less than or equal to the total number of payloads included in the payload list.

The method of claim 1,
And adding to the detection pattern list the longest common substring (LCS) whose appearance frequency exceeds the minimum frequency value.

The method of claim 1,
Comparing the length of the longest common substring (LCS) with a preset substring minimum length,
Calculating the frequency of appearance of the longest common substring (LCS) is performed when the length of the longest common substring (LCS) exceeds the preset minimum length of the substring.

The method of claim 1,
After deriving the longest common substring (LCS) of the two payloads, further comprising comparing the longest common substring (LCS) with detection patterns included in a pre-stored detection pattern list,
Calculating a frequency of appearance of the longest common substring (LCS) in the payload list is not included in any of the detection patterns among the detection patterns included in the detection pattern list; Signature detection method, wherein none of the detection patterns included in the detection pattern list is included in the longest common substring (LCS).

delete

An LCS derivation unit for selecting two payloads from a payload list including a plurality of payloads and deriving a longest common substring (LCS) of the two payloads;
Confirm the number of payloads including the longest common substring (LCS) in the payload list, and count the total number of payloads including the longest common substring (LCS) in the payload list An appearance frequency calculator configured to calculate an appearance frequency of the longest common substring (LCS) based on a value divided by the number of times; And
The appearance frequency calculated by the appearance frequency calculating unit compares the minimum frequency value, and when the appearance frequency of the longest common substring LCS exceeds the minimum frequency value, the longest common substring LCS is set as a signature. Signature extraction apparatus comprising a determination unit for determining.

As a signature extraction device,
A signature extraction device comprising a control unit for performing the method according to any one of claims 1 to 7.