KR101003502B1

KR101003502B1 - Signature String clustering Method Based on the Resemblance and Containment in the Sequence

Info

Publication number: KR101003502B1
Application number: KR1020070132808A
Authority: KR
Inventors: 박상길; 이성원; 문화신; 오진태; 장종수
Original assignee: 한국전자통신연구원
Priority date: 2007-12-17
Filing date: 2007-12-17
Publication date: 2010-12-30
Also published as: KR20090065317A

Abstract

본 발명은 문자열의 유사성과 포함성을 바탕으로 하는 시그니처 스트링 생성방법에 관한 것이다. 본 발명은 IDS에서 사용되는 시그니처 스트링이 온라인 상에서 자동적으로 생성 가능하도록 유입데이터의 문자열의 유사성과 포함성을 바탕으로 문자열을 클러스트링하고 시그니처 스트링을 생성함으로서 시그니처 스트링이 빠르고 효율적으로 이루어지도록 한다. The present invention relates to a signature string generation method based on similarity and inclusion of a string. The present invention enables the signature string to be quickly and efficiently by clustering the string and generating the signature string based on the similarity and inclusion of the string of the incoming data so that the signature string used in IDS can be automatically generated online.

시그니처, 문자열, 유사성, 포함성, 클러스트링 Signature, string, similarity, inclusion, clustering

Description

Signature String clustering Method Based on the Resemblance and Containment in the Sequence}

본 발명은 정보보호분야의 침입탐지시스템에서 사용하는 시그니처의 유사성과 포함성에 기반한 시그니처 클러스터링을 통한 엔트리 제어에 관한 것이다.The present invention relates to entry control through signature clustering based on similarity and inclusion of signatures used in intrusion detection systems in the field of information security.

본 발명은 정보통신부 및 정보통신연구진흥원의 IT성장동력 기술개발로 수행한 연구로부터 도출된 것이다[과제관리번호: 2006-S-042-02, 과제명: Network 위협의 Zero-Day Attack 대응을 위한 실시간 공격 시그니처 생성 및 관리 기술개발].The present invention is derived from the research carried out by the IT growth engine technology development of the Ministry of Information and Communication and the Ministry of Information and Communication Research and Development. [Task Management No .: 2006-S-042-02, Task name: Real time attack signature generation and management technology development].

종래의 침입탐지와 관련된 시그니처의 클러스터링은 자동시그니처 생성관련 기술에서 시그니처를 계층적 클러스트링(Hierarchical Clustering) 기법에 의해 적용되었다. 이 기법의 경우 생성하고자 하는 시그니처의 컨텐츠를 각각의 토큰화 한 후, 각 토큰간의 서브시퀀스와 결합에 해당하는 각각의 규칙을 이용한다.The signature clustering associated with the conventional intrusion detection is applied by hierarchical clustering technique to the signature in the technology related to automatic signature generation. In this technique, each tokenized content of the signature to be generated is used, and then each rule corresponding to subsequence and combination between each token is used.

S 개의 플로우가 있으면, 종래 폴리그래프에서 이용한 기술은 S개의 클러스터를 구성하고, 각각의 클러스터는 하나의 플로우를 포함하도록 한다. 이 기술은 off-line으로 운영되므로, on-line상의 새로운 시그니처를 실시간으로 처리하지는 못한다. 이후에 계층적 클러스터링 방법을 이용하여 반복적으로 클러스터들을 병합한다. 이때 사용하는 클러스터링 방법은 그리디(greedy) 클러스터링 방법이다. If there are S flows, the technique used in the conventional polygraph constitutes S clusters, and each cluster includes one flow. Because the technology is off-line, it cannot process new signatures on-line in real time. Afterwards, clusters are repeatedly merged using a hierarchical clustering method. The clustering method used here is a greedy clustering method.

본 발명은 이러한 시그니처 클러스터링 알고리즘을, on-line으로 실시간 처리하기 위한 기술을 제시한다.The present invention proposes a technique for processing such signature clustering algorithm on-line in real time.

오픈소스로서 일반적인 침입탐지 시스템의 시그니처로서 많이 사용되는 SNORT와 같은 침입탐지 시스템이나, 상용 시스템의 경우 적용된 시그니처에 대한 포함성, 유사성을 염두에 두지 않는다. 이런 이유로 인하여 사건 당 생성되는 시그니처가 심한 경우 10000 여개 이상의 시그니처가 존재하기도 한다. 이러한 많은 시그니처 간의 연관성과 포함성을 판단하여 시그니처가 적용되는 룰 수를 제어할 수 있는 방법을 제공하고자 한다. 이러한 룰 수의 제어를 통하여 S/W에서의 규칙 관리에 관한 오버헤드를 줄일 수 있고, 하드웨어 어플라이언스 형태로 제공되는 보안시스템에 사용되는 TCAM, SRAM의 용량을 줄일 수 있는 장점을 제공한다.Intrusion detection systems such as SNORT, which are widely used as signatures of general intrusion detection systems as open source, and in the case of commercial systems, do not take into consideration the inclusion and similarity of the applied signatures. For this reason, there are more than 10,000 signatures in case of severe signatures per event. The purpose of the present invention is to provide a method of controlling the number of rules to which a signature is applied by determining the association and inclusion of many signatures. Through the control of the number of rules, the overhead of rule management in S / W can be reduced and the capacity of TCAM and SRAM used in the security system provided in the form of hardware appliance can be reduced.

본 발명에 의한 문자열의 유사성과 포함성을 바탕으로 하는 시그니처 스트링 생성방법은 유입되는 패킷으로부터 시그니처를 생성하기 위하여, 상기 패킷을 미리 설정된 임의의 크기를 갖는 컨텐츠 단위로 토큰화 하는 과정; 상기 토큰화된 개별 컨텐츠의 집합으로 시그니처 컨텐츠를 생성하고, 상기 시그니처 컨텐츠에 대한 축약데이터를 생성하는 과정; 상기 축약데이터를 기존 축약데이터와 비교하여, 상기 축약데이터가 상기 기존 축약데이터에 포함되는지 여부, 또는 상기 축약데이터가 소정 기준 이상 상기 기존 축약데이터와 유사한지 여부에 따라, 상기 시그니처 컨텐츠를 기존 트리에 추가하거나 또는 새로운 트리에 삽입하여, 상기 시그니처 컨텐츠를 트리에 분류하는 과정; 및 상기 시그니처 컨텐츠로 이루어진 트리를 바탕으로 생성된, 시그니처 스트링의 기록을 별도로 저장하고, 미리 설정된 소정 주기의 시간마다 상기 트리를 리셋하는 과정을 포함한다.The signature string generation method based on the similarity and inclusion of a string according to the present invention comprises the steps of: tokenizing the packet into a content unit having a predetermined size in order to generate a signature from an incoming packet; Generating signature content with the set of tokenized individual contents and generating abbreviated data for the signature content; The signature content may be compared to the existing tree according to whether the contract data is included in the existing contract data, or whether the contract data is similar to the existing contract data by a predetermined criterion or more. Classifying the signature content into a tree by adding or inserting into a new tree; And separately storing a record of the signature string generated based on the tree of the signature content, and resetting the tree at predetermined time intervals.

또한, 본 발명에 의한 문자열의 유사성과 포함성을 바탕으로 하는 시그니처 스트링 생성방법은 유입된 데이터 패킷으로부터 토큰화된, 시그니처 컨텐츠의 축약데이터간의 포함성 또는 소정 기준 이상의 유사성에 대응하여 시그니처 스트링을 이룰 트리에 상기 시그니처 컨텐츠를 배치하는 과정; 각 트리별로 포함된 적어도 하나의 시그니처 컨텐츠에 대하여 공통되는 순열을 추출하는 과정; 상기 추출된 순열을 바탕으로 멀티컨텐츠 형태의 시그니처 스트링을 생성하는 과정; 및 상기 시그니처 스트링의 축약데이터를 생성하여, 기존의 시그니처 스트링의 축약데이터와 비교하고, 비교 결과에 대응하여 상기 시그니처 스트링의 축약데이터가 상기 기존의 시그니처 스트링의 축약데이터와 동일하지 않은 경우 상기 시그니처 스트링을 유해패킷 판별을 위한 데이터로 등록하는 과정을 포함한다.In addition, the signature string generation method based on the similarity and inclusion of the character string according to the present invention forms a signature string corresponding to the inclusion between the abbreviated data of the signature content or similarity above a predetermined criterion, which is tokenized from the incoming data packet. Placing the signature content in a tree; Extracting a common permutation for at least one signature content included in each tree; Generating a multi-content signature string based on the extracted permutations; And generate abbreviated data of the signature string, compare the abbreviated data of the existing signature string, and if the abbreviated data of the signature string is not the same as abbreviated data of the existing signature string in response to a comparison result. It includes the process of registering as a data for identifying harmful packets.

클러스터링을 하기 위해서 기본적으로 S개의 문서가 있으면, 종래에는 초기에 S개의 클러스터를 만든 후, 더 이상 새로운 문서를 처리할 수 없도록 블러킹 상태에서 S개의 클러스터간의 연관성을 분석하고, 부합하면 각각의 클러스터를 반복적으로 머징하여 하나의 클러스터에서 최종적으로 하나의 시그니처가 발생할 수 있도록 하는 방법이다. If there are basically S documents for clustering, conventionally, S clusters are initially created, and then the associations between the S clusters are analyzed in the blocking state so that no new documents can be processed anymore. By merging repeatedly, one signature can be finally generated in one cluster.

이에 반해 본 발명은 시그니처가 발생할 때마다 해당하는 축약데이터를 생성하여 새롭게 생성된 시그니처의 축약데이터와 기존에 존재하는 축약데이터와의 유사성과 포함성을 기준요소로 하여 비교한다. 이에 대해 유사성과 포함성이 만족하면 동일한 트리에 클러스터링을 하고, 만족하지 못하면 새로운 트리를 생성하고 데이터를 삽입함으로써 논리적으로 다른 클러스터에 배정하는 효과를 갖는다.In contrast, the present invention generates the corresponding abbreviation data each time a signature occurs, and compares the similarity and inclusion of the newly generated signature abbreviation data with the existing abbreviation data as a reference element. On the other hand, if similarity and inclusion are satisfied, clustering is performed on the same tree. If the similarity and inclusion are not satisfied, a new tree is created and data is inserted into logically different clusters.

종래 방법은 오프라인으로 수행되면서, 클러스터링 후에는 유입되는 데이터에 대하여 처리를 제공하지 못하는 방법에 반해, 본 발명은 데이터의 삽입시 이미 유사도를 기반으로 클러스터링을 제공한다. 이런 절차에 의해 하나의 트리에는 동질성을 갖는 데이터만 존재하게 된다.While the conventional method is performed offline, in contrast to a method that does not provide processing for incoming data after clustering, the present invention already provides clustering based on similarity upon insertion of data. By this procedure, only homogeneous data exists in one tree.

본 발명을 통하여 자동시그니처 생성시스템은 실시간 처리시스템으로 온라인상으로 운영되면서, 생성되는 시그니처에 대한 클러스터링을 온라인으로 제공한다.Through the present invention, the automatic signature generation system operates online as a real-time processing system, and provides online clustering for the generated signatures.

기타 실시예들의 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Specific details of other embodiments are included in the detailed description and the drawings.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Advantages and features of the present invention and methods for achieving them will be apparent with reference to the embodiments described below in detail with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but can be implemented in various different forms, and only the embodiments make the disclosure of the present invention complete, and the general knowledge in the art to which the present invention belongs. It is provided to fully inform the person having the scope of the invention, which is defined only by the scope of the claims. Like reference numerals refer to like elements throughout.

참고로, 본 발명에서 예시하는 시그니처 스트링 생성방법은 본 발명의 권리범위를 한정하는 것이 아니며 단지 하나의 실시예에 불과하다. 본 발명의 시그니처 스트링 생성방법은 종래와 달리 데이터 처리 시 발생하는 오버헤드를 감소시키고 따라서 온라인 상에서 시그니처를 자동으로 생성하도록 하는데 그 목적이 있는 바, 시그니처를 바탕으로 하는 유해패킷 검사시스템(IDS)와 연동될 수도 있다.For reference, the signature string generation method illustrated in the present invention is not intended to limit the scope of the present invention and is merely one embodiment. Unlike the conventional method, the signature string generation method of the present invention reduces the overhead incurred during data processing, and thus automatically generates signatures online. Therefore, a signature packet inspection system (IDS) based on signatures is provided. It may be interlocked.

도 1은 본 발명에 의한 문자열의 유사성과 포함성을 바탕으로 하는 시그니처 스트링 생성방법의 흐름이 도시된 도이다. 도 1을 바탕으로 본 발명을 대략적으로 설명한다. 1 is a flow diagram illustrating a signature string generation method based on similarity and inclusion of a string according to the present invention. Based on FIG. 1, the present invention will be described in general.

시그니처의 내용은 대부분 아스키코드 값이거나 문자열이다. 본 발명은 이러한 데이터의 값을 모두 보관하기보다는 이러한 값을 대신할 수 있는 축약된 비트의 객체를 이용한다. 이러한 방법을 이용하면, Jaccard index(Jaccard similarity coefficient)와 같이 스트링의 문자열을 모두 가지고 있지 않고, 축약된 정보만을 이용하여 유사성과 포함성을 판단할 수 있다.The contents of the signature are mostly ASCII code values or strings. The present invention utilizes an abbreviated bit object that can substitute for this value rather than storing all of the value of this data. Using this method, similarity and inclusion can be determined using only the abbreviated information without having all the strings of strings such as Jaccard index (Jaccard similarity coefficient).

시그니처 자동생성 시스템이 시그니처를 생성하기 위한 네트워크망에 접속되고 데이터가 유입되면(S100), 유입된 패킷은 문자, 워드, 한 줄 등 미리 설정된 임의의 크기를 가진 토큰(컨텐츠)으로 분리된다(S110).When the signature auto-generation system is connected to a network for generating a signature and data is introduced (S100), the incoming packet is divided into tokens (contents) having a predetermined arbitrary size such as a character, a word, and a line (S110). ).

분리된 토큰들은 후에 생성될 시그니처 스트링의 후보군인 시그니처 컨텐츠를 완성한다(S120). 시그니처 컨텐츠에 라빈 핑커프린트를 적용하여 축약데이터를 생성한다(S130). 그 후, 생성된 축약데이터가 기존에 생성된 축약데이터에 포함되는지 여부를 확인한다(S140). 만일 포함된 것으로 확인되면 기존 축약데이터의 시그니처 컨텐츠의 노드에 저장된 트리의 힛(hit)수가 증가된다(S150).The separated tokens complete the signature content which is a candidate group of the signature string to be generated later (S120). The abbreviated data is generated by applying the Rabin Pinkerprint to the signature content (S130). Thereafter, it is checked whether the generated abbreviated data is included in the previously generated abbreviated data (S140). If it is determined to be included, the number of hits of the tree stored in the node of the signature content of the existing reduced data is increased (S150).

만일 포함된 것으로 확인되지 않았으나, 유사한 것으로 확인되면(S160), 기존 트리에 새로운 노드로서 상기 시그니처 컨텐츠를 추가하고, 기존 트리 노드의 제일 마지막 위치에 존재하는 힛(hit)수를 증가시킨다(S170).If it is not confirmed to be included, but is found to be similar (S160), the signature content is added as a new node to the existing tree, and the number of hits existing at the last position of the existing tree node is increased (S170). .

그러나, 새로 생성된 시그니처 컨텐츠가 기존 축약데이터에 포함되지도 않고, 유사하지도 않은 경우 새로운 트리를 생성한다(S180).However, if the newly generated signature content is not included in the existing contract data or is not similar, a new tree is generated (S180).

이하, 도 2와 3을 바탕으로 트리 생성과정을 보다 상세히 설명하면 다음과 같다. Hereinafter, the tree generation process will be described in detail with reference to FIGS. 2 and 3.

앞서 설명한 바와 같이, 각 문서를 문자, 워드, 한 줄등의 단위의 토큰으로 분리하는 작업이 선행되어야 하고. 이러한 tokenizer를 구성하여 각 문서를 일정할 패턴의 연속적인 토큰으로 인식한다. 여기에서 말하는 토큰을 시그니처에 적용하기 위하여 도 2의 (210~217)과 같은 하나의 컨텐츠라 정하고, 전체 시그니처인 멀티컨텐츠를 도 2의 (200)과 같이 문서로 맵핑하여 적용한다.As mentioned earlier, the task of separating each document into tokens of characters, words, lines, etc. must be preceded. Configure these tokenizers to recognize each document as a sequence of tokens in a certain pattern. In order to apply the token described here to the signature, it is determined as one content as shown in (210 to 217) of FIG. 2, and the multi-content, which is the entire signature, is mapped and applied as a document as shown in (200) of FIG.

이를 통하여 도 2와 같이 개별 컨텐츠(210~217)인 서브시퀀스(서브시퀀스)의 집합인 멀티 컨텐츠로서 하나의 시그니처(200)를 구성할 수 있다. 본 발명에서는 컨텐츠간 유사성과 포함성, 그리고 거리(distance) 값 등을 이용하여 클러스터링을 구성하고자 한다.As such, as shown in FIG. 2, one signature 200 may be configured as a multi-content that is a set of subsequences (subsequences) that are individual contents 210 to 217. In the present invention, clustering is constructed using similarity, inclusion, and distance value between contents.

기존의 발명에서는 모든 컨텐츠별로 클러스터링을 한 후, bottom-up 방식의 계층적 클러스트링을 수행하였다. 이러한 방법을 이용하기 위해서는 각 클러스터가 구성되어 있어야 하므로, 오프라인으로 처리하여 클러스트링 한 후 결과에 대하여 시그니처로서 적용가능한지 false positive rate등을 판정하였다.In the existing invention, after clustering for all contents, hierarchical clustering of bottom-up was performed. In order to use this method, each cluster must be configured. Therefore, after clustering by offline processing, it is determined whether it is applicable as a signature to the result and false positive rate.

본 발명에서는 도 3과 같이 새로운 시그니처 컨텐츠(301)가 유입되면, 이 시그니처 컨텐츠에 대하여 라빈 핑거프린트(RF)(302)를 적용하여 유입된 시그니처에 매칭되는 축약데이터(303)를 생성한다. 이런 축약데이터가 새로 생성되면 생성된 축약데이터에 연결되는 트리(304)에 시그니처 컨텐츠를 새로운 노드로 삽입(insert)하고, 트리의 마지막 노드에 시그니처 컨텐츠에 대한 hit수를 기록한다. 이때 축약데이터는 축약데이터 리스트에 추가되고, 축약데이터 엔트리 별로 축약데이터에 연결되는 트리에 대하여 링크드 리스트 형태로 트리의 주소가 맵핑된다. In the present invention, when the new signature content 301 flows in as shown in FIG. 3, the signature data 303 is applied to the signature content to generate the abbreviated data 303 matching the incoming signature. When such contract data is newly generated, the signature content is inserted as a new node in the tree 304 connected to the generated contract data, and the hit number of the signature content is recorded in the last node of the tree. In this case, the abbreviated data is added to the abbreviated data list, and the addresses of the trees are mapped in the form of a linked list with respect to the tree connected to the abbreviated data for each abbreviated data entry.

즉, 시그니처를 생성하기 위해 시그니처 후보군인 패킷데이터의 패턴들을 일정한 트리에 저장한다. 이를 통해 동일한 서브시퀀스가 X회 이상 발생하는 서브시퀀스들의 조합에 의해 다수의 컨텐츠로 구성된 시그니처 스트링을 생성한다.That is, in order to generate a signature, patterns of packet data that is a signature candidate group are stored in a predetermined tree. This generates a signature string composed of a plurality of contents by a combination of subsequences in which the same subsequence occurs X or more times.

마지막 노드에 해당 컨텐츠의 hit 수는 root 노드에서 시작해서, 현재 노드에 이르는 컨텐츠에 대하여 총 그 횟수만큼의 데이터가 삽입되었다는 것을 의미한다. 이는 특정 포트내에 유사한 패턴끼리 군집화된 상태에서, 클러스터 내부에 공통적인 특성을 추출하기 위하여 레퍼런스 카운트로 이용되며, 시그니처 스트링을 생성하기 위하여 사용된다.The number of hits of the content at the last node means that the total number of times data has been inserted for the content starting at the root node and reaching the current node. This is used as a reference count to extract characteristics common within the cluster, with similar patterns clustered within a specific port, and used to generate a signature string.

이후에 두번째 시그니처(311)가 유입되면 위와 동일한 과정(302)을 통하여 축약데이터(313)를 구한 후, 기존에 시그니처에 대한 정보를 가지고있는 축약데이터(312)의 값과의 유사성, 거리, 포함성을 계산한다.Then, when the second signature 311 is introduced, the abbreviation data 313 is obtained through the same process 302 as above, and the similarity, distance, and the like with the value of the abbreviation data 312 having information on the signature are included. Calculate the last name.

포함성(303)이 1에 가까우면 새롭게 추가된 시그니처(311)는 기존의 시그니처(301)에 포함되거나 일치한다고 할 수 있다. 또한 유사성(301)이 정해진 한계치 이상일 경우 기존에 검색된 축약데이터(312)가 가리키는 트리(313)에 해당 시그니처 컨텐츠를 삽입하고, 마지막 노드에 해당 컨텐츠의 hit수를 기록한다.If the inclusion 303 is close to 1, the newly added signature 311 may be included in or match the existing signature 301. In addition, when the similarity 301 is greater than or equal to a predetermined threshold value, the signature content is inserted into the tree 313 indicated by the previously abbreviated data 312, and the hit number of the content is recorded in the last node.

위와 달리, 두번째 시그니처(311)의 축약데이터(313)의 포함성(303)과 유사성(301)이 한계치 값보다 낮은 값을 갖는 경우 축약데이터 리스트에 새롭게 현재 축약데이터(313)를 삽입하고, 새로운 트리(315)를 생성한다. 이렇게 사전에 유사성 을 판단하여 트리를 생성하는 절차를 통하여 클러스터링을 수행한다.Unlike the above, when the inclusion 303 and the similarity 301 of the abbreviation data 313 of the second signature 311 have a value lower than the threshold value, the current abbreviation data 313 is newly inserted into the abbreviation data list, and the new Create a tree 315. In this way, clustering is performed through the procedure of creating a tree by determining similarity in advance.

위와 같은 유사성, 포함성 여부에 의하여 시그니처를 매치되는 트리에 인서트 한다. 일반적인 네트워크에서 지속적인 공격이 반복되는 경우, 3분이면 비슷한 유형의 시그니처가 다량 발생한다. 자동 시그니처 생성시스템을 제공하는 본 발명에서는 각 포트별로 축약데이터 리스트가 있고, 각 축약데이터 엔트리 별로 링크드 리스트 형태로 트리의 주소를 맵핑한다.Based on the above similarity and inclusion, the signature is inserted into the matching tree. If a continuous attack is repeated in a typical network, three minutes will produce a lot of similar types of signatures. In the present invention, which provides an automatic signature generation system, there is an abbreviated data list for each port, and the address of the tree is mapped in the form of a linked list for each abbreviated data entry.

즉, 포트별로 유입되는 트래픽의 페이로드에서 추출된 유해하다 판단되는 패턴(멀티 서브시컨스)에 대하여, Karp-Rabin finger Print등을 이용하여 특정사이트의 축약데이터를 생성하고, 이 축약데이터를 동일한 패턴과 이에 포함되는 패턴을 계산할 수 있는 Factor로 사용한다. 하나의 클러스터인 트리에 저장된 각각의 엔트리의 특성은 상기 축약데이터 리스트를 통하여 제시할 수 있다.That is, with respect to the harmful pattern (multi subsequence) extracted from the payload of traffic flowing for each port, the contract data of a specific site is generated by using Karp-Rabin finger print and the like, and the contract data is the same pattern. And it is used as a factor to calculate the pattern included in it. The characteristics of each entry stored in the tree, which is one cluster, can be presented through the abbreviated data list.

다음 수학식들은 본 발명에 의해 문자열간 유사성과 거리, 포함성을 계산하는데 Jaccard Index를 사용되는 수식을 도시한 도이다.The following equations illustrate the equations used by the Jaccard Index to calculate similarity, distance, and inclusion between strings by the present invention.

스트링을 포함하고 있는 문서 A와 문서 B의 유사성은 0과 1 사이의 값을 가지며, 수학식1과 같다. 수학식1의 수식에 기반하여 문서 A와 문서 B의 논리적 Distance는 수학식2와 같은 수식으로 구한다The similarity between Document A and Document B, which includes a string, has a value between 0 and 1, as shown in Equation 1. Based on the formula in Equation 1, the logical distance between Document A and Document B is obtained by the formula shown in Equation 2.

문서 A가 문서 B에 포함될 확률은 수학식3과 같은 수식을 통하여 계산된다. 포함성은 0과 1 사이의 값을 갖는다.The probability that the document A is included in the document B is calculated through the equation shown in Equation (3). Inclusion has a value between 0 and 1.

R(A,B)=|S(A)∩S(B)|/|S(A)∪S(B)|R (A, B) = | S (A) ∩S (B) | / | S (A) ∪S (B) |

D(A,B)=1-R(A,B)D (A, B) = 1-R (A, B)

C(A,B)=|S(A)∩S(B)|/|S(A)|C (A, B) = | S (A) ∩S (B) | / | S (A) |

도 4는 도 1의 과정을 거쳐 생성된 각 트리에서 시그니처 스트링을 생성하는 방법의 흐름이 도시된 도인 바, 각 트리에서 시그니처 스트링을 생성하는 방법을 대략적으로 설명하면 다음과 같다. FIG. 4 is a flowchart illustrating a method of generating a signature string in each tree generated through the process of FIG. 1. A method of generating a signature string in each tree is described as follows.

먼저, 각 트리에서 공통되는 LCS(Longest Common Sequence)를 찾기 위한 기법을 적용하여(S200) 공통되는 순열을 추출한다(S210).First, a common permutation is extracted by applying a technique for finding a common long sequence (LCS) in each tree (S200).

상기 추출된 순열을 바탕으로 멀티컨텐츠 형태의 시그니처 스트링을 생성(S220)하고, 생성된 시그니처 스트링에 대한 축약데이터를 생성한다(S230).On the basis of the extracted permutation, a multi-content type signature string is generated (S220), and abbreviated data for the generated signature string are generated (S230).

생성된 축약데이터가 기존에 생성된 시그니처 스트링의 축약데이터와 동일한지 판단하여(S240), 동일하지 않은 경우 상기 생성된 시그니처 시트링을 유해패킷 판단 시 비교할 시그니처 스트링으로 등록하고 축약데이터를 저장한다(S250). 그러나, 동일한 경우 상기 축약데이터가 동일한 기존 시그니처 스트링의 중복생성수를 증가시킨다. It is determined whether the generated abbreviated data is the same as the abbreviated data of the previously generated signature string (S240), and if it is not the same, the generated signature sheeting is registered as a signature string to be compared when the harmful packet is determined and the abbreviated data is stored ( S250). However, in the same case, the shortened data increases the number of duplicate generations of the same existing signature string.

즉, 각 트리에 대하여 LCS(Longest Common 서브시퀀스)를 적용하여, 트리내에 존재하는 공통되는 순열(서브시퀀스)값을 추출하고, 이에 해당하는 값을 멀티컨텐츠 형태의 시그니처 스트링으로 생성한다.In other words, by applying a LCS (Longest Common Subsequence) to each tree, a common permutation (subsequence) value present in the tree is extracted, and the corresponding value is generated as a multi-content signature string.

이렇게 생성된 시그니처 스트링에 대한 축약데이터를 생성하여 중복된 시그니 처가 일정 기간동안 반복 생성되지 않도록 시그니처 생성 기록을 갖는다. By generating the abbreviated data for the signature string generated in this way, the signature generation record is generated so that duplicate signatures are not repeatedly generated for a predetermined period of time.

별도로 생성된 시그니처 스트링의 축약데이터를 생성하여 관리하는 이유는 다음과 같다. The reason for generating and managing the abbreviated data of the separately generated signature string is as follows.

시그니처의 특징을 갖는 축약데이터는 하나의 메시지에 대하여 여러 개가 존재 할 수 있다. 패이로드에서 추출된 유해패턴을 나타내는 여러 개의 서브시퀀스가 클러스터링 엔진에 유입되면, 각각의 클러스터에 존재하는 축약데이터 리스트와 비교하여 현재의 패턴(시그니처 후보)가 기존에 트리에 삽입된 내용과 같거나, 포함되는지 판단하여, 이에 해당하면 트리에 삽입하지 않는다. 비슷하지만, 포함되지 않는 경우에 한하여, 같은 클러스터로서 사용되는 트리에 인서트 한다.There may be several pieces of abbreviated data with signature characteristics for one message. When multiple subsequences representing harmful patterns extracted from the payload are introduced into the clustering engine, the current pattern (signature candidate) is the same as the content inserted into the tree compared to the list of abbreviated data existing in each cluster. If it does, it will not be inserted into the tree. Similarly, but not included, insert into a tree used as the same cluster.

축약데이터는 유입되는 시그니처 후보(패턴)에 대하여 일종의 헤싱을 통하여 생성되는 값이다. 이를 이용하여 유사성이나 포함되는지 여부를 알 수 있지만, 최종적으로 시그니처 스트링을 생성하기 위해서는, 유입되었던 시그니처 후보들을 이용하여야 한다. 이를 이용하기 위하여, 축약데이터로서는 클러스터의 배정까지만 사용하고, 실제 클러스터안에서 핸들링되는 데이터는 시그니처 후보로서 유입되는 메시지를 이용한다.The abbreviation data is a value generated through a kind of hessing on the signature candidates (patterns) to be introduced. This can be used to determine whether or not similarity is included. However, in order to finally generate a signature string, signature signatures that have been introduced must be used. In order to use this, only the allocation of clusters is used as abbreviated data, and the data handled in the actual cluster uses a message flowing as a signature candidate.

이러한 트리형태의 시그니처를 이용할 경우, 네트워크에서 유입되는 트래픽의 양에 따라 다르겠지만, 트리의 용량의 한계를 극복하기 위해 일정시간(15분, 30분)마다 트리를 리셋한다. 시그니처 컨텐츠의 저장소인 트리는 리셋하더라도, 이 결과물로서 생성된 시그니처의 기록을 별도의 메모리에 축약데이터 형태로 보관하고 있다.In the case of using the tree-type signature, the tree is reset every predetermined time (15 minutes, 30 minutes) to overcome the limitation of the capacity of the tree, depending on the amount of traffic flowing from the network. Even if the tree, which is a repository of the signature contents, is reset, a record of the signature generated as a result is stored in a separate memory in the form of abbreviated data.

트리를 리셋하는 이유는, 네트워크 트래픽이 시간의 흐름에 따라 다른 특성을 보이므로, 현재 네트워크의 특성을 시그니처 후보를 관리하는 트리들이 가지고 있어야 한다. 이를 위해 주기적인 트리의 RESET을 통하여, 최근 입력되는 트래픽에 대한 특성에 기반하여 시그니처를 생성할 수 있게 된다. 단, 생성된 시그니처 스트링의 축약데이터를 리셋하지 않는 바, 클러스터링을 통하여 시그니처로서 생성되었지만, 기존의 시그니처 축약데이터와 일치한다면 해당 시그니처에 대해 생성하지 않고, 중복생성 카운트수를 기록한다.The reason for resetting the tree is that the network traffic has different characteristics over time, so the trees managing signature candidates must have the characteristics of the current network. To this end, a signature can be generated based on the characteristics of recently input traffic through RESET of a periodic tree. However, since the abbreviation data of the generated signature string is not reset, it is generated as a signature through clustering, but if it matches the existing signature abbreviation data, it is not generated for the signature and the number of duplicate generation counts is recorded.

중복생성 카운트수를 기록하는 이유는, 클러스트링을 통하여 시그니처로서 생성되었지만, 시스템이 가동중인 상태에서 트리와 클러스터를 구성하는 요소들만 리셋된다. 시스템 전반적으로 이러한 특정 시그니처가 빈번하게 생성되었다는 것을 나타내게 되면, 네트워크에 대하여 동일형태의 패킷이 지속적으로 발생하였다는 것을 의미한다. 이는 IDS등의 GUI에서 특정 패턴에 대하여 몇번 공격이 발생되었다는 것을 표현하는 것과 비슷한 의미를 갖는다.The reason for recording the duplicate count count is generated as a signature through the clustering, but only the elements constituting the tree and cluster are reset while the system is running. Indicating that these particular signatures were generated frequently throughout the system means that packets of the same type continue to occur for the network. This is similar to expressing that several times an attack has occurred for a specific pattern in GUI such as IDS.

위 단계를 거치면, 시그니처에 대한 컨텐츠는 생성되었지만, 이와 연계된 패킷 데이터(5-tuple)값은 알 수 없다. 이러한 데이터의 생성을위해 클러스터링을 통하여 시그니처가 생성되었으면, 패킷정보수집 플래그를 enable하여 유입되는 시그니처 컨텐츠와 생성된 시그니처 스트링과 연관성과 포함성이 만족하면, 새롭게 유입된 시그니터 정보에서 5-tuple을 추출하여 시그니처 스트링과 결합하여 시그니처 정보로서 제공한다. 이러한 5-tuple을 제공함으로써, 실시간 검증과 같은 방법을 적용하여 패킷의 컨텐츠를 기반으로 생성된 시그니처에 대한 payload를 dump함 으로써 시그니처의 신뢰도를 높일 수 있다. Through the above steps, the content for the signature is generated, but the packet data (5-tuple) value associated with the signature is unknown. When the signature is generated through clustering for the generation of such data, the packet information collection flag is enabled, and when the incoming signature content and the generated signature string are associative and included, the 5-tuple is newly generated. It is extracted and combined with the signature string to provide the signature information. By providing such a 5-tuple, the reliability of the signature can be improved by dumping the payload for the signature generated based on the contents of the packet by applying a method such as real-time verification.

즉, 현 시스템에서 시그니처를 생성할 때 사용되는 패킷들을 세션 재조합을 통하여 실제 공격이라 판단되는 부분이 패킷에 존재하는지 검증을 하는 기능이 있다. 이러한 기능들에 대한 기본데이터가 패킷의 세션데이터인데, 이를 위해 현재 생성하는 시그니처와 일치(유사)하는 패킷 패이로드를 가지는 5-TUPLE 데이터를 전달하면, 이 5-TUPLE에 일치하는 패이로드 데이터(세션에 포함되는 패이로드 데이터)를 DUMP하여 비교할 수 있다. 전달받은 5-TUPLE을 통하여 세션을 DUMP하면, 시그니처를 생성할 때 참조하였던 패이로드와 거의 유사한 값을 가지므로, 이를 통하여 시그니처를 올바르게 생성했는지 여부를 파악할 수 있다.That is, there is a function of verifying whether a packet that is determined to be an actual attack exists in the packet through session recombination of packets used when generating a signature in the current system. The basic data of these functions is the session data of the packet. For this purpose, if 5-TUPLE data having a packet payload matching (similar to the signature) currently generated is passed, the payload data corresponding to this 5-TUPLE ( Payload data included in the session) can be compared by DUMP. When the session is dumped through the received 5-TUPLE, it has almost the same value as the payload referenced when the signature is generated. Therefore, it is possible to determine whether the signature is correctly generated.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD_ROM, 자기테이프 플로피디스크, 광 데이터 정장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함된다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The invention can also be embodied as computer readable code on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD_ROM, magnetic tape floppy disks, optical data suits, and the like, and may also be implemented in the form of carrier waves (for example, transmission over the Internet). The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

이상 첨부된 도면을 참조하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described above with reference to the accompanying drawings, those skilled in the art to which the present invention pertains may implement the present invention in other specific forms without changing the technical spirit or essential features thereof. I can understand that. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

도 1은 본 발명에 의한 문자의 유사성과 포함성을 사용하는 시그니처 생성방법 중 트리를 생성하는 방법의 흐름이 도시된 도,1 is a flow diagram illustrating a method of generating a tree among a signature generation method using similarity and inclusion of characters according to the present invention;

도 2와 도 3은 시그니처 생성 시 문서를 토큰화하여 축약데이터를 생성하는 바가 도시된 도,2 and 3 are diagrams illustrating the generation of abbreviated data by tokenizing a document upon signature generation;

도 4는 본 발명에 의한 문자의 유사성과 포함성을 사용하는 시그니처 생성방법 중 생성된 트리로부터 시그니처 스트링을 생성하는 방법의 흐름이 도시된 도이다.4 is a flowchart illustrating a method of generating a signature string from a generated tree among the signature generation methods using the similarity and inclusion of characters according to the present invention.

Claims

Tokenizing the packet into content units having a predetermined random size to generate a signature from an incoming packet;

Generating signature content with the set of tokenized individual contents and generating abbreviated data for the signature content;

The signature content may be compared to the existing tree according to whether the contract data is included in the existing contract data, or whether the contract data is similar to the existing contract data by a predetermined criterion or more. Classifying the signature content into a tree by adding or inserting into a new tree; And

Generating a signature string based on the similarity and inclusion of a string, including separately storing a record of the signature string generated on the basis of the tree of signature contents and resetting the tree at predetermined time intervals. Way.

According to claim 1, wherein the classification process,

If the abbreviated data is not included in the existing abbreviated data but is similar by a predetermined criterion or more, the signature content is added as a new node to the tree including the existing abbreviated data.

And if not included or similar, adding the abbreviated data to the abbreviated data list, generating a new tree, and inserting the signature content.

The method of claim 2,

When the signature content is added to the tree of the existing contract data as a new node, the similarity and inclusion of the character string may be increased by increasing the number of hits recorded in the last node of the tree of the existing contract data. Signature string generation method based on.

The method of claim 2,

When the abbreviated data is included in the existing abbreviated data, the number of hits recorded in the last node of the tree including the existing abbreviated data is added without adding the signature content to the tree including the existing abbreviated data. Signature string generation method based on the similarity and inclusion of the string characterized by increasing.

The method of claim 1,

A signature string generation method based on the similarity and inclusion of a character string, characterized in that the inclusion indicating whether the abbreviated data is included in the existing abbreviated data is calculated based on the following equation using a Jaccard Index.

C (A, B) = | S (A) ∩S (B) | / | S (A) |

(However, the C function indicates whether the A and B signature contents are included, and has a value between 0 and 1, and 1 means it is included.)

The method of claim 1,

The similarity that indicates whether the abbreviated data is similar to the existing abbreviated data over a predetermined criterion is calculated based on the following equation using the Jaccard Index. Way.

R (A, B) = | S (A) ∩S (B) | / | S (A) ∪S (B) |

(However, the R function indicates whether the A and B signature contents are similar, and has a value between 0 and 1, and the closer to 1, the higher the similarity).

The method of claim 1,

And generating the abbreviated data by applying a rabbin fingerprint (RF) to the signature content.

The method of claim 2,

The abbreviation data list is a signature string generation method based on the similarity and inclusion of the character string, characterized in that the address of the tree is mapped in the form of a linked list, for each contained abbreviation data.

Placing the signature content in a tree that will form a signature string in response to the inclusion between the abbreviated data of the signature content or similarity above a predetermined criterion, tokenized from the incoming data packet;

Extracting a common permutation for at least one signature content included in each tree;

Generating a multi-content signature string based on the extracted permutations; And

Generate the abbreviated data of the signature string, compare the abbreviated data of the existing signature string, and if the abbreviated data of the signature string is not the same as the abbreviated data of the existing signature string in response to the comparison result Signature string generation method based on the similarity and inclusion of the string including the process of registering as harmful packet identification data.

The method of claim 9,

The method of extracting permutations extracts a common permutation of the signature content by determining a longest common subsequence (LCS) for each of the trees.

The method of claim 9,

In response to the comparison result, if the contract data of the signature string is not the same as the contract data of the existing signature string, separately storing the contract data of the signature string as a record of the signature for the tree to be reset. Signature string generation method based on the similarity and inclusion of the string.

The method of claim 11,

In response to the comparison result, when it is determined that the abbreviation data of the signature string is the same as the abbreviation data of the existing signature string, in order to prevent duplication of the signature string, the number of duplicate generation counts for the existing signature string is determined. Signature string generation method based on the similarity and inclusion of the string characterized by increasing.