KR100877911B1

KR100877911B1 - Method for detection of p2p-based botnets using a translation model of network traffic

Info

Publication number: KR100877911B1
Application number: KR1020080010034A
Authority: KR
Inventors: 노봉남; 김동국; 김용민; 노상균; 문길종
Original assignee: 전남대학교산학협력단
Priority date: 2008-01-31
Filing date: 2008-01-31
Publication date: 2009-01-12

Abstract

A method for detecting a P2P(Peer to Peer)-based bot-net by using a network traffic transfer model is provided to detect a bot-net quickly and correctly by generating and matching a detection model with a suspicious traffic in the P2P network. A traffic flow group is generated by collecting and clustering network connection flow on the network. Information, which represents that a state of the traffic flow group is changed, is collected and a multi-frequency matrix is formed and trained based on the collected information. A multi-state transfer probability model is obtained by generating a probability matrix for all attacks of the bot-net and the probability matrix for normality through a learning process. The probability model is matched with the model for the state transfer information, which is collected from the traffic of the network in real time. It is detected whether the malicious bot-net is found by comparing a model matching result value with a preset threshold.

Description

Method for detection of P2P-based botnets using a translation model of network traffic}

본 발명은 네트워크 트래픽 전이 모델을 이용한 P2P 기반 봇넷 탐지방법에 관한 것이다. 상세하게 본 발명은, 봇넷의 초기 형성과 구축 단계부터 나타나는 단계적인 행위특성들을 정형화하기 위해 네트워크 레벨에서 P2P 봇넷의 행위 기반 정보들을 추출하고 분석하므로써 봇넷의 공격 특징을 찾아내어 모델링하고, 의심되는 트래픽의 매칭을 통한 탐지과정을 제공함에 따라 봇넷에 의해 발생되는 P2P 기반의 네트워크 시스템의 피해를 신속하고 정확하게 대응할 수 있도록 한 네트워크 트래픽 전이 모델을 이용한 피투피 기반 봇넷 탐지방법에 관한 것이다.The present invention relates to a P2P based botnet detection method using a network traffic transition model. In detail, the present invention finds and models the attack characteristics of the botnet by extracting and analyzing the behavior-based information of the P2P botnet at the network level in order to formalize the gradual behavior characteristics that appear from the initial formation and construction of the botnet. The present invention relates to a peer-to-peer-based botnet detection method using a network traffic transition model that can quickly and accurately cope with the damage of a P2P-based network system caused by a botnet by providing a detection process through matching.

최초의 인터넷 웜(worm) 바이러스로 불리는 Morris 웜이 1988년에 출현한 이래로 현재까지 많은 웜들에 의한 인터넷 공격이 끊임없이 발생하고 있다. 과거 특정한 목적 없이 이루어지는 악성 행위 및 단순히 빠른 전파에만 급급했던 공격 형태와 달리, 근래 발견되고 있는 다수의 웜들은 금전적 이득을 목적으로 개인정보의 수집과 유출을 도모하고 있으며, 이후 더욱 치명적이고 조직적인 추가 공격을 위한 경유지로서 악용하기 위하여 감염시킨 시스템의 제어 권한을 공격자에게 부여하는 양상으로 진화하는 모습을 보인다. 이러한 경향으로 악성 봇(bot)이라 불리는 새로운 공격 형태가 기존의 웜으로부터 파생되었으며 이미 그 피해가 급증하여 능동적인 대응책 마련을 위한 연구가 활발히 진행되고 있다.Since the emergence of the Morris worm, the first Internet worm virus, in 1988, there have been constant attacks on the Internet by many worms. In contrast to malicious behaviors that have been used for no specific purpose in the past and attacks that have only been rapidly spreading, many worms that have been discovered in recent years are trying to collect and leak personal information for financial gain. It is evolving to give the attacker control over the infected system to exploit as a waypoint for the attack. As a result of this trend, a new type of attack called a malicious bot is derived from an existing worm, and the damage has soared, and research for active countermeasures is being actively conducted.

악성 봇 바이러스에 의해 감염된 시스템들은 그들만의 네트워크인 봇넷(botnet)을 구축하게 되며 이는 스팸 메일, 분산 서비스 거부 공격 등과 같은 다양한 형태의 방대한 공격을 매우 신속하고 효과적으로 확산시킬 수 있는 위협으로 발전하였다. 공격자는 봇에 감염된 시스템들을 유지관리하기 위하여 별도의 봇넷 C&C(Command and Control) 서버를 필요로 하고, 이를 목적으로 IRC(Internet Relay Chat) 서버를 주로 활용해 오고 있으며, 이 경우 개별 봇들은 IRC 클라이언트로 작용한다.Systems infected by malicious bot viruses have built their own network, the botnet, which has evolved into a threat that can rapidly and effectively spread various types of attacks such as spam mail and distributed denial of service attacks. An attacker needs a separate botnet command and control (C & C) server to maintain bot infected systems, and has been using IRC (Internet Relay Chat) server for this purpose. In this case, individual bots are IRC clients. Acts as.

현재 봇은 소스코드의 유출로 인하여 9천여개 이상의 변종 봇들이 존재하고 있으며, 바이러스 백신 업체에서 조차 모든 봇들을 탐지해내기에는 불가능한 현실이다. 이미 전 세계적으로 알려진 봇넷 C&C 서버만 1천 5백개 이상이 존재하며, 2007년도 세계경제포럼(World Economic Forum, WEF)에서는 '전 세계적인 유행병'인 봇넷은 인터넷 미래의 최대 위협이라고 경고하였고 인터넷에 접속하는 6억대의 PC 가운데 약 1억 내지 1억 5천만대가 이미 봇넷으로 이용되고 있다고 밝혔다.Currently, there are more than 7,000 variants of bots due to the leak of source code, and even antivirus companies are unable to detect all bots. There are more than 1,500 botnet C & C servers already known worldwide, and the 2007 World Economic Forum (WEF) warned that the `` net pandemic '' is the biggest threat to the future of the Internet. Of the 600 million PCs, about 100 million to 150 million are already used as botnets.

이와 같은 봇넷의 피해를 방지하기 위해, 2005년도부터 악성 봇 대응을 위한 협조 체계를 국내 ISP/IDC 사업자 및 해외 CERT 팀들과 구축하여 DNS 싱크 홀(sinkhole)을 운영함으로써 봇에 감염된 PC를 제어하는 봇넷 C&C 서버로의 연결을 차단하고 있다. DNS 싱크홀이란, 봇넷 C&C 서버에 접속하려는 봇이 서버의 유동적인 IP 주소를 얻기 위하여 도메인 이름을 질의하는 과정을 역이용하는 정책으로, 해당 도메인에 대한 DNS 질의를 차단함으로써 봇 감염 PC가 봇넷 C&C 서버에 연결되는 과정을 원천적으로 방지하고자 하는 기술이다. 이 경우, 봇넷 C&C 서버에 의한 제어 모델은 봇넷의 치명적인 약점이 된다.In order to prevent such botnet damage, a botnet has been established since 2005 to cope with malicious bots by operating DNS sinkholes with domestic ISP / IDC operators and overseas CERT teams to control bot-infected PCs. It is blocking the connection to the C & C server. DNS Sinkhole is a policy that reverses the process by which a bot attempting to connect to a botnet C & C server queries a domain name to obtain a server's dynamic IP address. This is a technique to prevent the process that is connected to the source. In this case, the control model by the botnet C & C server is a fatal weakness of the botnet.

그러나, 위와 같은 방법에 가로막히게 되는 봇들이 늘어나면서 공격자들은 점차 서버가 필요하지 않는 봇넷을 필요하게 되었으며, 결국 P2P(Peer-to-Peer) 기반의 봇넷이 등장하는 계기가 되었다. P2P형 봇넷 모델을 구축할 경우 봇의 확산과 제어에는 다소 어려움이 따를 수 있지만 중앙집중적인 서버에 의존할 필요가 없이 네트워크를 구축할 수 있다는 커다란 이점을 제공하기에 봇 개발자들 사이에서 그 인기가 더욱 증가할 것으로 예상된다.However, as the number of bots blocked by these methods increased, attackers increasingly needed botnets that didn't require servers, which eventually led to the emergence of peer-to-peer-based botnets. Building a peer-to-peer botnet model can be difficult to spread and control, but it is popular among bot developers because it offers the great benefit of building a network without relying on a centralized server. It is expected to increase further.

일반적으로, 네트워크 레벨에서의 바이러스의 탐지 기법들은 유입 트래픽의 정량적인 분석과 개별 패킷들에 대한 데이터 영역의 시그니처(signature) 분석 및 DPI(Deep Packet Inspection) 기법 등이 이용되어 왔다. 또한 보편적인 침입 탐지를 위하여 여러 패킷들의 통합 정보 및 흐름 단위로 네트워크 트래픽의 정보를 추출하고 분석하려는 시도들이 소개되었다. 하지만 봇넷에서는 웜과 같은 다량의 분산된 트래픽 발생 특성과 공격자와 또는 개별 피어들과의 통신에 따른 독립적이고 특성화된 트래픽 발생 특성이 공존한다. 웜의 확산은 봇넷 성장을 위한 한 단계에 속하며 큰 흐름을 보지 못한다면 자칫 악성코드의 전파 정도로 오인할 수도 있다. 그리고 P2P 기반 봇넷이 발생시키는 피어들 간의 통신 트래픽은 암호화를 수반하기도 하므로 데이터 영역에 대한 의존을 배제해야만 한다.In general, virus detection techniques at the network level include quantitative analysis of incoming traffic, signature analysis of data areas for individual packets, and deep packet inspection (DPI) techniques. In addition, attempts have been made to extract and analyze network traffic information in the unit of integrated information and flow of multiple packets for universal intrusion detection. In botnets, however, a large number of distributed traffic generation features, such as worms, and independent and specialized traffic generation characteristics due to communication with attackers or individual peers coexist. The spread of worms is a step in the growth of botnets, and if you don't see big flows, you may be mistaken for spreading malware. In addition, communication traffic between peers generated by P2P-based botnets may involve encryption, thus excluding dependence on data areas.

따라서, P2P를 기반으로 하는 봇넷을 보다 신속하고 정확하게 탐지하기 위해서는 포괄적인 봇넷 트래픽의 단계적인 정의와 구분에 의한 분석과 데이터 영역이 포함하는 특성에 구애받지 않는 탐지 방법을 요구하게 된다.Therefore, in order to detect P2P-based botnets more quickly and accurately, analysis by step-by-step definition and classification of comprehensive botnet traffic and detection methods irrespective of the characteristics of the data area are required.

본 발명은 상기 요구에 부응하기 위해 발명한 것이다.The present invention has been invented to meet the above requirements.

이에 본 발명은, P2P 기반 봇넷이 갖는 트래픽 특성에 따라 탐지모델을 생성하고, 생성된 탐지모델을 이용하여 P2P 네트워크 상에서 의심되는 트래픽의 매칭을 통해 봇넷을 보다 신속하고 정확하게 탐지할 수 있도록 한 봇넷 탐지방법을 제공함에 그 목적이 있다.Accordingly, the present invention generates a detection model according to the traffic characteristics of a P2P-based botnet, and uses the generated detection model to detect botnets more quickly and accurately by matching suspected traffic on a P2P network. The purpose is to provide a method.

상기 목적을 달성하기 위해 본 발명은, 네트워크 상에서 트래픽 연결흐름을 수집하고 군집화하여 트래픽 흐름군을 생성한 후 트래픽 흐름군의 상태가 전이되는 정보를 집계하고, 상기 집계된 정보를 기반으로 다중 빈도수 행렬을 형성하여 학습과정을 수행하는 단계와; 상기 학습과정을 통해 봇넷의 모든 공격에 대한 확률 행렬과 모든 정상의 확률 행렬을 생성하여 다중 상태전이 확률 모델을 얻어내는 단계와; 상기 확률 모델과 네트워크의 트래픽에서 실시간 수집한 상태전이 정보에 대한 모델매칭을 수행하여 모델매칭을 한 결과값과 이미 설정된 임계값을 비교하여 악성 봇넷이 존재하는 지의 여부를 탐지하는 단계;를 수행한다.In order to achieve the above object, the present invention collects and aggregates traffic connection flows on a network to generate a traffic flow group, and then aggregates information to which the state of the traffic flow group is transferred, and multi-frequency matrix based on the aggregated information. Forming a and performing a learning process; Obtaining a multi-state transition probability model by generating a probability matrix for all attacks of the botnet and all normal probability matrices through the learning process; Performing model matching on the state transition information collected from the probability model and the traffic of the network in real time and comparing the result of model matching with a preset threshold to detect whether a malicious botnet exists; .

또한, 상기 다중 상태전이 확률 모델은 이미 발견된 봇넷에 대한 트래픽 상태전이에 따른 확률 모델군과 발견되지 않은 봇넷에 대한 트래픽 상태전이에 따른 확률 모델군을 구분하고, 상기 모델매칭을 수행한 결과로 봇넷의 존재를 탐지하였 을 때, 탐지된 봇넷이 이미 발견된 봇넷에 대한 트래픽 상태전이 모델군에 속할 경우 알려진 공격으로 판단하여 오용탐지로 간주하고, 그렇지 않을 경우 탐지된 봇넷이 발견되지 않은 봇넷에 대한 트래픽 상태전이 모델군에 속하도록 하여 알려지지 않은 공격으로 판단하여 비정상 탐지로 간주하도록 한 것을 특징으로 한다.In addition, the multi-state transition probability model classifies the probability model group according to the traffic state transition for the already discovered botnet and the probability model group according to the traffic state transition for the undetected botnet, and performs the model matching. When detecting the existence of a botnet, if the detected botnet belongs to a model group of traffic state transitions to the already discovered botnets, it is considered a known attack and is considered a misuse detection. Otherwise, the detected botnet is found on an undetected botnet. It is characterized in that the traffic state war for the traffic belongs to the model group to determine it as an unknown attack and regard it as an abnormal detection.

여기서, 상기 다중 상태전이 확률 모델은 마르코프 연쇄 과정에 의해 생성하며, 상기 모델매칭은 우도의 일반화를 위하여 정상 우도와의 비율을 계산하도록 한 로그-우도비 검정에 의해 수행된다.Here, the multi-state transition probability model is generated by a Markov chain process, and the model matching is performed by a log-likelihood ratio test for calculating a ratio of normal likelihood for generalization of likelihood.

이상에서와 같이 본 발명은, P2P 네트워크 상에서 봇넷의 탐지가 보다 신속하고 정확하게 수행되어, 봇넷에 의해 발생될 피해에 대한 대응을 신속하게 수행할 수 있게 되므로써 안전하고 안정적인 P2P 네트워크 환경을 제공할 수 있는 효과를 얻게 된다.As described above, the present invention can provide a safe and stable P2P network environment by detecting the botnet more quickly and accurately on the P2P network, so that it can quickly respond to the damage caused by the botnet. You get an effect.

상기와 같은 본 발명이 적용된 실시예를 첨부된 도면을 참조하여 상세히 설명한다.An embodiment to which the present invention as described above is applied will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 의한 봇넷 탐지과정의 수행을 위한 시스템의 블록도이다.1 is a block diagram of a system for performing a botnet detection process according to the present invention.

도면을 참조하면, 상기 시스템은 학습모듈과, 다중 상태전이 확률 모델과, 탐지 모듈로 이루어진다. 이와 같은 구성은 P2P 네트워크에 접속된 컴퓨터에 의해 프로그램으로 구축될 수 있다.Referring to the drawings, the system consists of a learning module, a multi-state transition probability model, and a detection module. Such a configuration can be built into a program by a computer connected to the P2P network.

상기 학습모듈은 네트워크 상에서 트래픽 연결흐름을 수집하고 군집화하여 트래픽 흐름군을 생성한 후 트래픽 흐름군의 상태가 전이되는 정보를 집계하고, 상기 집계된 정보를 기반으로 다중 빈도수 행렬을 형성하여 학습과정을 수행하게 된다.The learning module collects and aggregates the traffic connection flows on the network to generate a traffic flow group, and then aggregates information to which the state of the traffic flow group is transitioned, and forms a multi-frequency matrix based on the aggregated information. Will perform.

상기 다중 상태전이 확률 모델은 상기 학습과정을 통해 봇넷의 모든 공격에 대한 확률 행렬과 모든 정상의 확률 행렬을 생성하여 형성된다. 특히, 이와 같은 확률 모델은 이미 발견된 봇넷에 대한 트래픽 상태전이에 따른 확률 모델군과 발견되지 않은 봇넷에 대한 트래픽 상태전이에 따른 확률 모델군으로 구분될 수 있다.The multi-state transition probability model is formed by generating a probability matrix for all attacks of the botnet and all normal probability matrices through the learning process. In particular, such a probability model may be classified into a probability model group according to a traffic state transition for an already discovered botnet and a probability model group according to a traffic state transition for an undiscovered botnet.

상기 탐지모듈은 다중 상태전이 확률 모델과 네트워크의 트래픽에서 실시간 수집한 상태전이 정보에 대한 모델매칭을 수행하여 모델매칭을 한 결과값과 이미 설정된 임계값을 비교하여 악성 봇넷이 존재하는 지의 여부를 탐지하게 된다.The detection module detects whether or not a malicious botnet exists by performing a model matching on the state transition information collected from the traffic of the network and the state transition probability model in real time by comparing the result of model matching with a threshold value already set. Done.

상기 시스템에 의해 봇넷이 탐지되는 과정을 첨부된 도면을 참조하여 상세히 설명하면 다음과 같다.The process of detecting the botnet by the system will be described in detail with reference to the accompanying drawings.

1. 네트워크 트래픽의 단계적 축약을 수행하기 위한 과정을 설명한다.1. Describe the process for performing the stepwise reduction of network traffic.

P2P 기반 봇넷에서 하나의 단일 피어 봇은 가능한 많은 외부의 피어들과 연결을 맺고 네트워크를 형성하기 위하여 연결된 피어들의 수만큼의 피어 발견 및 정보 교환 과정에 따른 단계적인 트래픽을 다량 발생시키는 성질을 보인다. 다시 말 해, 동일한 흐름 방식을 가지는 트래픽들이 시간차를 두고 계속적으로 발생함을 의미한다.In a peer-to-peer botnet, a single peer bot generates a large amount of phased traffic according to the number of peer discovery and information exchange processes in order to connect and form a network with as many external peers as possible. In other words, traffic having the same flow scheme is continuously generated with time difference.

도 2는 이미 알려진 SpamThru 봇이 개별 피어들과 통신하며 5분 단위로 발생시키는 패킷 증가량을 보이고 있다. 연결을 맺는 순서별로 시간차가 있지만 트래픽 변화량의 모형이 서로 유사함을 알 수 있다.Figure 2 shows the packet growth that a known SpamThru bot communicates with individual peers and generates every five minutes. Although there is a time difference according to the order of connection, it can be seen that the models of traffic variation are similar.

이들은 적절히 분류할 필요가 있으며, 봇넷의 초기 형성과 구축 단계부터 나타나는 단계적인 행위 특성별로 유사 트래픽을 그룹화할 필요가 있다. 본 논문에서는 P2P 기반 봇넷이 발생시키는 방대한 양의 TCP 및 UDP 트래픽을 단계적 연결흐름들의 군집들로 축약하기 위하여, 각 개별 트래픽에 대한 패턴을 생성하고 유사성 비교 알고리즘에 기반한 군집화를 수행한다.They need to be properly classified and group similar traffic by the staged behavioral characteristics that emerge from the initial formation and deployment of botnets. In this paper, to reduce the massive amount of TCP and UDP traffic generated by P2P-based botnets into clusters of phased connection flows, we generate a pattern for each individual traffic and perform clustering based on a similarity comparison algorithm.

가. 네트워크 트래픽 연결흐름 분할end. Network traffic link flow split

오늘날 P2P 응용프로그램들이 차지하는 트래픽의 양은 이미 HTTP, FTP의 양을 훨씬 넘어서고 있다. 특히 P2P를 이용한 공유 프로그램들이 개발되면서 네트워크 트래픽은 그 흐름의 방향과 종류에 있어 큰 변화를 가져 왔다. 이러한 트래픽 분석을 위하여 단위 트래픽 흐름들의 특징적 분할이 요구된다.The amount of traffic that P2P applications take today is already well above that of HTTP and FTP. In particular, with the development of P2P sharing programs, network traffic has brought a great change in the direction and type of the flow. Characteristic segmentation of unit traffic flows is required for such traffic analysis.

1) TCP 연결흐름 분할1) TCP connection flow split

TCP는 연결지향형 프로토콜로서, 신뢰되는 전송체계를 위하여 기본적으로 세션으로 관리되는 전송단위가 제공된다. 이는 세션의 연결 확립 절차를 위한 3단계 핸드쉐이크(3-way handshake)와 연결 종료를 위한 과정이 명확하므로 연결흐름 분할이 비교적 용이하다.TCP is a connection-oriented protocol. Basically, a session-managed transmission unit is provided for a trusted transmission system. This is a relatively easy three-way handshake for the session establishment process and a procedure for terminating the connection, making connection flow splitting relatively easy.

하지만 더욱 정확한 트래픽 분석을 위하여, 프로토콜 범람(flooding) 공격과 단순한 패킷 재전송을 구분할 필요가 있다. 도 3의 (a)는 초기 연결 확립 절차에서 동일한 SYN 패킷이 중복 전송되었음을 볼 수 있다. 이 경우는 SYN 패킷 범람 공격 속성의 일부분으로서 상당히 중요한 의미를 지니므로, 단일의 패킷이라 하더라도 하나의 독립된 트래픽 흐름으로 구분해야 할 것이다. 그렇지만 도 3의 (b)에서와 같이 이후 수신측으로부터 SYN, ACK 패킷에 의한 응답이 이루어진다면 앞선 중복된 SYN 패킷은 결국 그 의미를 상실할 것이고, 본 분할 기법에서도 이러한 패킷들은 상쇄시켜 트래픽 분할의 정확성을 향상시켰다.But for more accurate traffic analysis, it is necessary to distinguish between protocol flooding attacks and simple packet retransmissions. 3 (a) shows that the same SYN packet is repeatedly transmitted in the initial connection establishment procedure. This case is of significant importance as part of the SYN packet flood attack attribute, so even a single packet should be separated into one independent traffic flow. However, if a response by the SYN and ACK packet is subsequently received from the receiver as shown in (b) of FIG. Improved accuracy

2) UDP 연결흐름 분할2) UDP connection flow split

UDP 프로토콜은 TCP와 같은 신뢰성 있는 전송규약을 따르지 않는다. 따라서 트래픽 연결흐름 분할에 이용될 기준이 모호하므로 다소 어려움이 내재하며, 본 연구에서는 동일 소켓쌍(socket pair)을 가지는 UDP 트래픽을 시간단위로 분할한다.The UDP protocol does not follow a reliable transport protocol like TCP. Therefore, since the criteria to be used for traffic link flow segmentation are ambiguous, some difficulties are inherent. In this study, UDP traffic having the same socket pair is divided by time unit.

물론 여기에서도 TCP에서와 같은 특성이 나타날 수 있으므로, 도 4의 (a)에서와 같이 재전송되어 중복된 접속 요청 패킷은, 이후 도 4의 (b)에서와 같이 응답을 받음으로써 그 의미가 상쇄되고 정상적인 재전송 패킷으로 구분되며 트래픽 연결흐름 분할 과정에서 그 존재성을 상실한다.Of course, since the same characteristics as in TCP may be present, the meaning of the connection request packet retransmitted and duplicated as shown in (a) of FIG. 4 is later canceled by receiving a response as shown in (b) of FIG. It is classified as a normal retransmission packet and loses its existence in the traffic link flow segmentation process.

나. 트래픽 연결흐름 단위의 유사성 비교I. Comparison of Similarity of Traffic Flow Units

앞서 다루었던 네트워크 프로토콜의 트래픽 연결흐름 분할 기법에 기반하여 분류된 각 트래픽 연결흐름들에 대한 패턴을 생성한다. 그리고 이들 패턴들의 유사성을 비교하여 같은 부류의 연결흐름 패턴들을 군집화하고 이들 군집들의 상태를 정의하고 전이 과정을 분석함으로써 전체 P2P 봇넷 트래픽의 상태전이 정보를 추출하고자 한다.Based on the traffic link flow segmentation scheme of the network protocol, we generate a pattern for each traffic link flow. By comparing the similarities of these patterns, we attempt to extract the state transition information of the entire P2P botnet traffic by clustering the same class of connection flow patterns, defining the states of these clusters, and analyzing the transition process.

1) 트래픽 연결흐름 패턴 생성1) Create traffic link flow pattern

본 실시예에서 구분 짓고자 하는 단위 네트워크 흐름이란, 한 가지 목적의 의사소통을 위한 두 기종의 컴퓨터간의 활성화된 접속을 의미한다. 이러한 연결된 흐름을 통하여 컴퓨터들은 서로의 데이터를 주고받을 수 있으며, 통신에 있어서 가장 작은 단위의 일을 수행하고 처리하게 된다.In the present embodiment, the unit network flow to be distinguished means an active connection between two computers for one purpose of communication. Through this connected flow, computers can exchange data with each other and perform and process the smallest unit of communication.

표 1에서 트래픽 연결흐름 분석을 위해 사용되는 척도들을 보인다. TCP 프로토콜은 연결 지향형 프로토콜로서 그 전송상의 특성을 이용하여 비교적 정확하게 각 흐름을 구분 지을 수 있다. TCP 프로토콜의 초기 연결 설정은 3단계 핸드쉐이크에 의하여 이루어진다. 또한 세션의 끝을 나타내는 연결 종료는 양단간 호스트 사이에서 플래그 필드에 FIN 또는 RST 비트가 설정된 패킷을 주고받아 통신이 종료되었음을 인식한다. 반면 UDP 프로토콜에서는 플래그가 존재하지 않으므로 제외된다.Table 1 shows the metrics used for traffic link flow analysis. The TCP protocol is a connection-oriented protocol that can distinguish each flow relatively accurately using its transport characteristics. The initial connection setup of the TCP protocol is accomplished by a three-stage handshake. In addition, the connection termination indicating the end of the session recognizes that the communication is terminated by exchanging a packet having the FIN or RST bit set in the flag field between the hosts. On the other hand, the UDP protocol is excluded because no flag exists.

TCPTCP UDPUDP 척도Measure 값의 범위Range of values 척도Measure 값의 범위Range of values 전송방향Transmission direction 0, 10, 1 전송방향Transmission direction 0, 10, 1 플래그flag 1 ~ 631 to 63 데이터길이Data length 0 ~ MTU0 to MTU 데이터길이Data length 0 ~ MTU0 to MTU

표 2는 서버의 25번 포트로 송수신되는 SMTP 서비스에 대한 하나의 TCP 트래픽 연결흐름을 패킷별로 각 척도에 기반하여 나열하였다. 처음 3단계 핸드쉐이크 과정에서, 프로토콜 범람 공격과 패킷 재전송을 구분하기 위하여 연결 성립 전에 재전송되었던 연결 요청 패킷 기록은 제거한다. 또한 모든 과정으로서의 ACK 응답 패킷은 무시하고, 연결 종료를 위하여 FIN 또는 RST 플래그가 설정된 패킷이 전송된다면 그 이후 종료단계의 트래픽도 무시하도록 한다. 이는 모든 TCP 트래픽이 가지는 공통된 속성들로 차별성이 없기 때문이다. Table 2 lists one TCP traffic connection flow for each SMTP service sent and received on port 25 of the server based on each measure. During the first three phase handshake, the connection request packet record that was retransmitted prior to connection establishment is removed to distinguish between protocol flooding attacks and packet retransmissions. In addition, the ACK response packet is ignored, and if the packet having the FIN or RST flag is transmitted to terminate the connection, the subsequent traffic is ignored. This is because there is no difference in common properties of all TCP traffic.

(전송방향)(Transmission direction) :: (플래그)(flag) :: (데이터크기)(Data size) pkt__1_pkt__1_ 00 :: 22 :: 00 ← SYN← SYN 00 :: 22 :: 00 ← SYN← SYN pkt__2_pkt__2_ 1One :: 1818 :: 00 → SYN,ACK→ SYN, ACK 00 :: 1616 :: 00 ← ACK← ACK pkt__3_pkt__3_ 1One :: 2424 :: 8484 → PSH,ACK→ PSH, ACK 00 :: 1616 :: 00 ← ACK← ACK pkt__4_pkt__4_ 00 :: 2424 :: 2626 ← PSH,ACK← PSH, ACK pkt__5_pkt__5_ 1One :: 2424 :: 2626 → PSH,ACK→ PSH, ACK pkt__6_pkt__6_ 00 :: 2424 :: 2626 ← PSH,ACK← PSH, ACK pkt__7_pkt__7_ 1One :: 2424 :: 4848 → PSH,ACK→ PSH, ACK pkt__8_pkt__8_ 00 :: 2424 :: 4141 ← PSH,ACK← PSH, ACK pkt__9_pkt__9_ 1One :: 2424 :: 4848 → PSH,ACK→ PSH, ACK pkt_10_pkt_10_ 00 :: 2424 :: 3535 ← PSH,ACK← PSH, ACK pkt_11_pkt_11_ 1One :: 2424 :: 3434 → PSH,ACK→ PSH, ACK pkt_12_pkt_12_ 00 :: 2424 :: 66 ← PSH,ACK← PSH, ACK pkt_13_pkt_13_ 1One :: 2424 :: 5050 → PSH,ACK→ PSH, ACK pkt_14_pkt_14_ 00 :: 2424 :: 10241024 ← PSH,ACK← PSH, ACK 1One :: 1616 :: 00 → ACK→ ACK pkt_15_pkt_15_ 00 :: 2424 :: 295295 ← PSH,ACK← PSH, ACK pkt_16_pkt_16_ 1One :: 2424 :: 1919 → PSH,ACK→ PSH, ACK pkt_17_pkt_17_ 00 :: 2424 :: 66 ← PSH,ACK← PSH, ACK pkt_18_pkt_18_ 1One :: 2424 :: 2424 → PSH,ACK→ PSH, ACK pkt_19_pkt_19_ 00 :: 1717 :: 00 ← FIN,ACK← FIN, ACK 1One :: 1616 :: 00 → ACK→ ACK 1One :: 1717 :: 00 → FIN,ACK→ FIN, ACK 00 :: 1616 :: 00 ← ACK← ACK

<프로토콜>|<시작시간>|<내부포트>_<외부포트> | <송수신패턴> <Protocol> | <Start time> | <Internal port> _ <external port> | <Transmission and reception pattern> 헤더부Header 내용부Contents T|1179824184.443993|1379_25|0:2:0-0+1,1:18:0-0+2,1:24:19-84+3;5;7;9;11;13;16;18,0:24:6-295+4;6;8;10;12;15;17,0:24:1024-1024+14,0:17:0-0+19 T | 1179824184.443993 | 1379_250 : 2: 0-0 + 1,1: 18: 0-0 + 2,1: 24: 19-84 + 3; 5; 7; 9; 11; 13; 16; 18,0: 24: 6-295 + 4; 6; 8; 10; 12; 15; 17,0: 24: 1024-1024 + 14,0: 17: 0-0 + 19

본 실시예에서는 이러한 단위 트래픽을 순차패턴으로 정형화 하였다. 정형화된 패턴은 헤더 부분과 내용 부분으로 나뉘는데, 헤더 부분은 프로토콜 구분, 연결 요청이 시작된 시간(마이크로초), 그리고 통신한 양단간의 포트 번호들을 포함하며, 내용 부분에는 실제 트래픽을 구성하고 있는 패킷들의 순차적 흐름 정보가 척도값들에 기반하여 정렬된다. 표 3에서는 표 2에서 예시된 연결흐름에 대한 순차패턴 생성의 예를 보인다. 순차패턴은 표 4와 같은 구문기호들을 바탕으로 구성된다. 내용부는 기본적으로 동일한 패킷들의 발생 순서번호를 '+' 기호 뒤에 나열함으로써 구성된다.In this embodiment, such unit traffic is formalized into a sequential pattern. The formal pattern is divided into a header part and a content part. The header part includes a protocol classification, a time when a connection request is started (microseconds), and a port number between both ends of communication. The content part includes packets of the actual traffic. Sequential flow information is sorted based on the scale values. Table 3 shows an example of sequential pattern generation for the connection flow illustrated in Table 2. The sequential pattern is constructed based on the syntax symbols shown in Table 4. The content part is basically constructed by listing the occurrence sequence numbers of the same packets after the '+' sign.

기호sign 의미meaning || 헤더부 속성들과 내용부의 구분기호Header attributes and content separators __ 헤더부 내부/외부 포트들의 구분기호Header symbol inside / outside port of header part :: 패킷 내의 각 척도 구분기호Each measure delimiter in the packet -- 연속적인 속성값을 가지는 척도의 범위 구분기호Scale delimiter for scales with consecutive attribute values ++ 패킷 순서번호 나열을 위한 구분기호Delimiter for Listing Packet Sequence Numbers ;; 각 패킷 순서번호 구분기호Each packet sequence number separator ,, 상호 다른 패킷들의 구분기호Separators for Different Packets

2) 순차적 데이터를 위한 유사도 측정 함수2) Similarity Measurement Function for Sequential Data

앞에서, 각 트래픽 흐름들에 대한 순차패턴 정형화 방법을 알아보았다. 이는 궁극적으로 유사성 검사를 위한 비교 단위로 활용하기 위함이다. 본 논문에서는 단위 트래픽 흐름간 유사성 비교를 위하여 동적 프로그래밍 방법에 기반한 갭 감점 행렬(gap penalty matrix)을 이용한다.In the previous section, we looked at how to format the sequential patterns for each traffic flow. This is ultimately used as a comparison unit for similarity test. In this paper, we use the gap penalty matrix based on dynamic programming to compare the similarity between unit traffic flows.

예를 들어, 중복 가능한 요소들의 집합으로 구성된 두 서열(sequence) S _i ={A,B,G,C,C,B,A,E,C,F}와 S _j ={C,B,C,D,D,E,E,B,D,C,F,F}가 있다고 하자. 두 개의 상이한 언어 사이에서 단어순서의 차이점을 설명하기 위한, 정방향 정렬(수학식 1과 도 5)과 역방향 정렬(수학식 2와 도 6)의 개념을 적용해 볼 수 있다.For example, two sequences consisting of a set of overlapping elements S _i = {A, B, G, C, C, B, A, E, C, F} and S _j = {C, B, C Suppose there are, D, D, E, E, B, D, C, F, F}. To illustrate the differences in word order between two different languages, the concepts of forward alignment (Equations 1 and 5) and reverse alignment (Equations 2 and 6) can be applied.

Regular alignment: i → j = S _i Regular alignment: i → j = S _i

Inverted alignment: j → i = S _j Inverted alignment: j → i = S _j

반면에 실시예에서는, 패킷 흐름의 특성상 정렬 경로상에서 한 서열의 요소가 다른 서열의 요소에 정확히 일대일 매칭될 수는 없기 때문에 혼합 정렬(combined alignment) 방식을 제안한다. 도 7과 같이 이 정렬 방식은 좌우 대칭적이다. 여기서, 두 B¹에 연결된 화살표는 좌우 대칭적 매칭을 위하여 이중 화살촉을 가진다.On the other hand, the embodiment proposes a combined alignment method because an element of one sequence cannot be exactly one-to-one matched to an element of another sequence on the alignment path due to the nature of the packet flow. As shown in Fig. 7, this alignment scheme is symmetrical. Here, an arrow connected to two B ¹ has a double arrowhead for symmetric matching.

이 혼합 정렬 방식에서는, B¹|C¹ 및 B²|E¹과 같은 요소들과 같이 독립된 부분 순서(partial order)들을 위한 교차 정렬(cross alignment)을 필요로 한다. 그래서 뒤따르는 요소들인 C²과 C³은 도 8과 같이 중복된 이전 요소로의 경로를 가지게 된다. 이것은 두 서열 사이의 유사성 비교를 위한 점수 할당을 어렵게 만든다. 이러한 문제는 미리 정의된 규칙들에 기반하여 해결한다.This mixed alignment scheme requires cross alignment for independent partial orders, such as elements such as B ¹ | C ¹ and B ² | E ¹ . Thus, the following elements C ² and C ³ have a path to the previous element duplicated as shown in FIG. 8. This makes it difficult to assign scores for comparing similarities between two sequences. This problem is solved based on predefined rules.

두 서열의 유사성 비교를 위해서는 범주형 데이터의 유사도 비교 방식을 적용시키는 방법이 효율적이다. 서열 S _i 와 S _j 에 대한 유사도 비교 함수는 식 3과 같다.In order to compare the similarity of two sequences, the method of applying the similarity comparison method of categorical data is effective. The similarity comparison function for the sequences S _i and S _j is shown in Equation 3.

두 서열 사이의 유사도는 기본적으로 합집합과 교집합의 비로서 얻어질 수 있다. 그러나 여기서 고려되어야 할 사항은 비교하려는 서열을 하나의 집합으로 보았을 때, 집합의 원소들은 상호간에 순서를 가지고 있다는 것이다. 그러므로 서열의 교집합은 단순히 서로 중복되는 요소들의 개수로만 얻어질 수 없고 순서에 대한 처리가 포함되어야 한다. 이러한 문제를 해결하기 위하여 본 논문에서 제안하고자 하는 유사성 비교 알고리즘은 동적 프로그래밍 기법을 응용한다.The similarity between the two sequences can basically be obtained as the ratio of union and intersection. However, one thing to consider here is that when you look at the sequence you want to compare as a set, the elements of the set are in order. Therefore, the intersection of sequences cannot be obtained simply by the number of elements overlapping each other, but should include the processing of the order. To solve this problem, the similarity comparison algorithm proposed in this paper applies a dynamic programming technique.

교집합 계산을 위하여 먼저 갭 감점 행렬을 구성하였다. 도 9에서 각 교차되는 블록들에서의 점수 계산 방식을 보인다. 최종 교집합의 점수는 이들 점수들을의 합으로 결정된다. 블록 (x_prev, y_prev)이 블록 (x, y)의 이전 교차된 블록이라고 할 때, (x_prev < x)과 (y_prev < y)가 만족되어야 한다. 그 점수는 이전 교차된 블록으로부터의 거리에 대한 역수로 구해진다. 본 논문에서는 거리를 계산하기 위하여 갭 감점 개념을 이용한다.To calculate the intersection, we first construct a gap deduction matrix. In FIG. 9, a score calculation method is shown for each intersecting block. The score of the final intersection is determined by the sum of these scores. If block (x _prev , y _prev ) is the previous crossed block of block (x, y), then (x _prev <x) and (y _prev <y) must be satisfied. The score is found as the inverse of the distance from the previously crossed block. In this paper, we use the concept of gap deduction to calculate distance.

,

BS _(x,y) 는 교차 블록 (x, y)의 블록 점수(Block Scoring)를 의미한다. D _(x,y) 는 가장 인접한 이전 교차 블록 (x_prev, y_prev)로부터 블록 (x, y)까지의 거리이다. 이 거리는 (x_prev, y_prev)과 (x, y) 사이에 떨어져 있는 블록의 개수를 의미하기도 한다. 현재 교차 블록 (x, y)의 가장 인접한 이전 교차 블록 (x_prev, y_prev)을 구하기 위하여 표 5에 정의된 세 가지 규칙에 의존한다. BS _{(x, y)} means Block Scoring of the intersection block (x, y). D _{(x, y)} is the distance from the nearest previous intersection block (x _prev , y _prev ) to block (x, y). This distance also means the number of blocks between (x _prev , y _prev ) and (x, y). Rely on the three rules defined in Table 5 to find the nearest previous intersection block (x _prev , y _prev ) of the current intersection block (x, y).

번호number 규칙rule 규칙 1Rule 1 이전 일치된 좌표는 |x_any * y_any| 값이 |x * y| 값보다는 작은 모든 값들 중에서 최대값을 가지는 좌표여야 한다.The previous matched coordinate is | x _any * y _any | The value is | x * y | The coordinate that has the maximum value among all values smaller than the value. 규칙 2Rule 2 만일 두개 이상의 이전 매치된 좌표에서 동일한 |x_any * y_any| 값을 가진다면, 그들 가운데 |x_any + y_any| 값이 더 큰 수를 가지는 좌표를 최종적으로 이전 매치된 좌표로서 선택한다.If two or more previously matched coordinates are | x _any * y _any | If you have a value, among them | x _any + y _any | The coordinate with the larger number is finally selected as the previously matched coordinate. 규칙 3Rule three 만일 |x_any * y_any| 값과 |x_any + y_any| 값이 모두 같은 두개 이상의 이전 매치된 좌표가 발견된다면 다음 두 단계의 계산 과정을 다시 따른다.If x _any * y _any | Value and | x _any + y _any | If two or more previously matched coordinates with the same value are found, follow the next two steps of calculation again. 규칙 3-1Rule 3-1 x 좌표와 y 좌표가 같을 때에는 고려할 필요없이 어느 이전 매치된 좌표를 선택하더라도 감점값은 같게 된다.If the x and y coordinates are the same, the deduction value is the same no matter which previous matched coordinate is selected. 규칙 3-2Rule 3-2 x 좌표가 y 좌표와 같지 않을 때에는 각 이전 매치된 좌표로부터 |x - x_any| 값과 |y - y_any| 값을 계산하여 둘 가운데 더 큰 값을 추출하여, 모든 추출된 값 중 최소값을 가지는 좌표를 최종적으로 이전 매치된 자표로서 선택한다.If the x coordinate is not equal to the y coordinate, then | x-x _any | Value and | y-y _any | The value is calculated to extract the larger of the two, and finally the coordinate with the minimum of all extracted values is finally selected as the previously matched grid.

도 9에서, 초기 교차 블록은 (0, 0)이고 이때의 BS는 0이다. 규칙 2에 의하여 교차 블록 (5, 3)에서 가장 인접한 이전 교차 블록은 블록 (4, 1)이 된다. 추가로, 규칙 3-2에 의하여 교차 블록 (9, 10)의 가장 인접한 이전 교차 블록은 블록 (8, 6)이 된다. 교차 블록 (10, 11)은 최대 BS로서 1 값을 가진다. 이는 두 서열상에서 교차된 요소 C 바로 뒤에 요소 F가 교차되기 때문이다.In FIG. 9, the initial intersection block is (0, 0) and BS is 0 at this time. By rule 2, the nearest previous intersection block in the intersection block (5, 3) becomes the block (4, 1). In addition, according to Rule 3-2, the nearest previous intersection block of the intersection block 9, 10 becomes the block 8, 6. The intersection blocks 10 and 11 have a value of 1 as the maximum BS . This is because element F intersects immediately after element C intersected on both sequences.

이와 같이 비교 점수제(comparison scoring)는 이전 순서로부터의 단계적인 전이에 기반하고 있다.As such, comparison scoring is based on gradual transitions from previous sequences.

따라서 SCS(Sequence Comparison Scoring)은 모든 BS들의 합과 같고, 두 서열 S _i 와 S _j 의 SCS는 2.56(= 1/4 + 1/4 + 1/2 + 1/5 + 1/9 + 1/4 + 1)이 되며, 이는 바로 구하려던 교집합의 수가 된다.Therefore, SCS (Sequence Comparison Scoring) is equal to the sum of all BSs , and SCS of the two sequences S _i and S _j is 2.56 (= 1/4 + 1/4 + 1/2 + 1/5 + 1/9 + 1 / 4 + 1), which is the number of intersections you want to find.

수학식 6에 의한 유사도 값은 0과 1 사이의 값을 가진다. 여기서, 합집합은 유사성 비교 함수의 일반화를 위해 사용된다. 합집합의 수는 도 10에서처럼 단순히 전체 합집합의 중복된 요소들의 수와 같다.The similarity value according to Equation 6 has a value between 0 and 1. Here, the union is used for generalization of the similarity comparison function. The number of unions is simply equal to the number of overlapping elements of the entire union as in FIG.

그러므로 최종적으로, 두 서열 S _i 와 S _j 의 유사도는 0.17(= 2.56 / 15)이 된다. 이러한 서열간 유사성 비교 방법은 순차패턴에 그대로 적용되며, 표 6에서는 본 실시예가 제안하는 유사도 측정함수의 알고리즘을 보인다.Therefore, finally, the similarity of the two sequences S _i and S _j is 0.17 (= 2.56 / 15). This similarity comparison method between sequences is applied to the sequential pattern as it is, Table 6 shows the algorithm of the similarity measurement function proposed in this embodiment.

procedure SeqSim(S _i , S _j ) begin set I [x, y] for every intersected blocks between S _i and S _j for each (x, y) ∈ I do { (x _prev , y _prev ) := select_nearest_prev-intersected_block(x, y) D := compute_distance(x, y, x _prev , y _prev ) BS := reciprocal(D) intersection := intersection + BS } union := size of bag { S _i ∪ S _j } similarity := intersection / union end procedure SeqSim ( S _i , S _j ) begin set I [ x , y ] for every intersected blocks between S _i and S _j for each ( x , y ) ∈ I do { ( x _prev , y _prev ): = select_nearest_prev-intersected_block ( x , y ) D : = compute_distance ( x , y , x _prev , y _prev ) BS : = reciprocal ( D ) intersection : = intersection + BS } union : = size of bag { S _i ∪ S _j } similarity : = intersection / union end

다. 군집화에 의한 대표 트래픽 흐름군 생성All. Representative Traffic Flow Group Generation by Clustering

본 실시예에서는 무수한 트래픽 연결흐름들을 효과적으로 축약하기 위하여 각 연결흐름 순차패턴들의 군집화를 수행하였으며, 이때 데이터마이닝 군집화 알고리즘의 하나인 ROCK을 이용하였다. ROCK은 수치 데이터가 아닌 범주형 데이터에 대한 군집화를 가능하게 하며, 각 군집들의 연결 관계에 기반하여 군집을 형성해 나간다. 이때, 서로 밀접한 군집일수록 상호간에 더 많은 연결 관계를 확보하고 있게 된다. 즉, 두 군집이 앞서 제시된 유사성 함수에 의한 유사도 측면에서 특정 임계값 이상인 경우에 이웃이라고 결정하고 군집간의 공통 이웃 개수를 두 군집의 연결수라고 정의한다. 동일한 군집에 속하는 하위 군집들은 일반적으로 많은 수의 공통 이웃의 수를 갖고 동시에 많은 수의 연결을 갖는다. 그러므로 군집을 합병할 때 가장 많은 수의 연결을 갖는 것끼리 먼저 합병함으로써 의미 있는 군집을 생성하게 된다. 여기서 군집의 합병을 위하여 적합도(goodness)가 제안되었고 임의 표본 추출을 이용해 규모 확장성을 높였다.In this embodiment, in order to effectively reduce a myriad of traffic connection flows, clustering of each connection flow sequential pattern was performed. In this case, ROCK, which is one of data mining clustering algorithms, was used. ROCK enables clustering of categorical data rather than numerical data, and forms clusters based on the connection relationship of each cluster. In this case, the closer the clusters are to each other, the more secure the connection is to each other. That is, when two clusters are above a certain threshold in terms of similarity by the similarity function presented above, the neighbors are determined to be neighbors, and the number of common neighbors between the clusters is defined as the number of connections between the two clusters. Subgroups belonging to the same cluster generally have a large number of common neighbors and a large number of connections at the same time. Therefore, when merging clusters, the ones with the most connections are merged first to create meaningful clusters. Here, a goodness was proposed for the merging of clusters and scaled up by using random sampling.

ROCK 알고리즘은 군집화를 위한 전처리 단계로서 도 11에서 보이는 것처럼 인접 행렬(adjacency matrix)을 생성하는데, 이는 모든 군집들 상호간의 유사도 비교로 얻어진 수치들을 통한 결과값들로 구성된다. 두 순차패턴의 유사도 수치가 설정된 임계값(θ)보다 작다면 0의 값을 가지고, 만약 임계값보다 크거나 같다면 1의 값을 가진다. 이렇게 해서, 인접 행렬은 0 또는 1의 값으로만 구성되어진 행렬이 된다. 한편, 이때의 임계값은 적합도 계산에 다시 이용된다.The ROCK algorithm generates an adjacency matrix as shown in FIG. 11 as a preprocessing step for clustering, which is composed of results from numerical values obtained by comparing similarities between all clusters. If the similarity value of the two sequential patterns is smaller than the set threshold value θ, it has a value of 0, and if it is greater than or equal to the threshold value, it has a value of 1. In this way, the adjacent matrix becomes a matrix composed only of values of zero or one. In addition, the threshold value at this time is used again for fitness calculation.

이렇게 생성된 인접 행렬을 기반으로 다시 도 12와 같이 연결 행렬(link matrix)을 생성한다.Based on the generated neighbor matrix, a link matrix is again generated as shown in FIG. 12.

연결 행령 생성 과정의 예로, 군집 D와 F에 대한 연결수 link[D,F]는 다음과 같은 계산과정에 의하여 얻는다.An example of a connection haengryeong generation process, the connection can link [D, F] for the cluster D and F are obtained by the following calculation procedure:

link[F,D] = 인접 행렬(가로축) x 인접 행렬(세로축) link [F, D] = adjacency matrix (horizontal axis) x adjacency matrix (vertical axis)

= 0x0 + 0x0 + 1x1 + 1x0 + 1x1 + 0x1 = 2= 0x0 + 0x0 + 1x1 + 1x0 + 1x1 + 0x1 = 2

이어서, 각 군집들 상호간의 연결 관계를 기준으로 군집화 수행을 반복한다. 이때 한 번의 군집화 단계마다 하나의 군집이 감소하게 된다. 이 과정은 더 이상 군집화가 불가능할 때까지 반복되며, 이때는 모든 각 군집들 상호간에 연결 관계가 성립되지 않게 될 때임을 의미한다.Subsequently, clustering is repeated based on the connection relationship between the clusters. At this time, one cluster is reduced per clustering stage. This process is repeated until clustering is no longer possible, which means it is time for a connection to not be established between all clusters.

도 13은 군집화가 진행된 뒤에 군집이 형성된 과정을 보이고 있다. 군집화된 두 개의 군집들은 연결수를 공유하며 서로 연결 관계였던 다른 군집들과 공동의 연결을 재형성한다.13 shows a process in which clusters are formed after clustering is performed. Two clustered clusters share a number of connections and reshape common connections with other clusters that were connected to each other.

이때, ROCK 알고리즘은 군집화를 위하여 연결수가 아닌 적합도를 이용한다. 단순 연결수에 의한 군집화는 군집이 클수록 인접 군집들과의 연결수가 크므로 군집화 과정에서 큰 군집이 계속 커져버리는 순환이 반복된다. 이러한 현상을 방지하기 위하여 수학식 7에서처럼 적합도는 연결수를 일반화하여 정의한다.In this case, the ROCK algorithm uses the goodness of fit, not the number of connections for clustering. Clustering by simple concatenation is repeated because the larger the cluster, the greater the number of connections with neighboring clusters, and the larger cluster continues to grow in the clustering process. In order to prevent this phenomenon, as in Equation 7, fitness is defined by generalizing the number of connections.

n _i 와 n _j 는 교차 연결의 수를 의미하고, θ는 앞서 인접 행렬을 생성하기 위하여 적용하였던 임계값을 적용한다. 그리고 본 논문에서 함수 f(θ)는 (1-)/(1+)로 선언한다. n _i and n _{j denote the number} of cross-connections, and θ applies the threshold that was previously applied to generate the adjacent matrix. In this paper, the function f (θ) is declared as (1-) / (1+).

2. 네트워크 트래픽 전이 모델과 이를 이용한 봇넷 탐지과정을 설명한다.2. Describe network traffic transition model and botnet detection process.

가. 개별 트래픽 상태정보 추출end. Extract individual traffic state information

프로토콜 헤더의 필드 정보로부터 단일 패킷의 상태를 구하는 방식이 침입탐지에 이용될 수 있다. 그 상태 정보는 표 7, 표 8에서와 같이 Dir(전송방향), TCP-flag, IP-flag로 구성된 9비트의 상태값을 가진다. 패킷의 전송방향은 (서버→클라이언트)와 (클라이언트→서버) 방향으로 이진 구분된다.A method of obtaining a single packet state from field information of a protocol header may be used for intrusion detection. The state information has a 9-bit state value consisting of Dir (transmission direction), TCP-flag , and IP-flag as shown in Tables 7 and 8. The transmission direction of the packet is binary-divided into (Server → Client) and (Client → Server) directions.

본 실시예에서는 개별 패킷이 아닌 대표 트래픽 흐름군의 상태정보를 추출하며 이 경우, 7비트로 구성된 상태값을 가진다.In this embodiment, the state information of the representative traffic flow group is extracted, not individual packets. In this case, the state value is composed of 7 bits.

속성property 척도Measure 척도 값Scale value 구성 상태Configuration status 프로토콜protocol 프로토콜protocol TCP (0)TCP (0) PTPT UDP (1)UDP (1) 포트port 내부 포트Internal port 랜덤한 포트 (0)Random Ports (0) LPLP 예약된 포트 (1)Reserved Ports (1) 외부 포트External port 랜덤한 포트 (0)Random Ports (0) RPRP 예약된 포트 (1)Reserved Ports (1) 신뢰 기준Trust standard 최소 연결흐름 수량 이상 (0)Minimum connection flow quantity (0) OPOP 최소 연결흐름 수량 미만 (1)Minimum connection flow quantity (1) 트래픽traffic 접속 성공 여부Successful connection 단방향 통신 (0)Unidirectional Communication (0) RSRS 양방향 통신 (1)Bidirectional Communication (1) 패킷수 비교Packet Count Comparison 내부 >= 외부 (0)Internal> = external (0) PCPC 내부 < 외부 (1)Inside <outside (1) 바이트수 비교Byte count comparison 내부 >= 외부 (0)Internal> = external (0) DCDC 내부 < 외부 (1)Inside <outside (1)

7비트의 상태 정보7 bits of status information 구성Configuration 프로토콜protocol 포트port 트래픽traffic 구성 상태Configuration status PTPT LPLP RPRP OPOP RSRS PCPC DCDC 구성 값Configuration value 6464 3232 1616 88 44 22 1One

충분히 많은 양의 연결흐름들을 소유하고 있는 트래픽 흐름군이 UDP 프로토콜로 통신하고, 내부의 랜덤한 포트와 외부의 특정 예약된 포트 사이의 전송 트래픽들로 구성되었으며, 군집에 소속된 각 연결흐름들은 연결요청 성공에 따른 양방향 통신을 하고 내부에서 유출되는 트래픽의 패킷수와 바이트수가 외부로부터 유입되는 패킷수와 바이트수보다 클 때, 대표 트래픽 흐름군의 상태값은 도 14와 같이 얻을 수 있다.A traffic flow group that owns a sufficient amount of connection flows communicates over the UDP protocol and consists of transport traffic between an internal random port and an external specific reserved port. When the two-way communication is performed according to the request success and the number of packets and bytes of traffic flowing out from the inside is larger than the number of packets and bytes flowing from the outside, the status value of the representative traffic flow group can be obtained as shown in FIG.

이 경우, 비트열로 표현되는 트래픽 흐름군의 상태는 1010100₍₂₎이 되며, 이는 즉 84₍₁₀₎ 상태값을 가진다. 이러한 방식으로 각 군집들은 0에서 127까지의 128(2⁷)가지 경우의 상태값으로 표현될 수 있다.In this case, the state of the traffic flow group represented by the bit string is 1010100 _(2), that is, it has a 84 ₍₁₀₎ state value. In this way, each cluster can be represented by 128 (2 ⁷ ) state values from 0 to 127.

나. 다중 상태전이 확률 모델링I. Multistate Transition Probability Modeling

1) 트래픽 상태전이 정보1) Traffic state transition information

상태전이 행렬은 상태 정보에 기반하여 생성된다. 트래픽 흐름에서 이전 상태(previous state)로부터 변화된 현재 상태(observed state)의 전이 정보를 저장한다. 행렬은 이전 상태의 행과 현재 상태의 열로 구성된다. 이 행렬은 각 상태전이의 통계적 빈도수에 대한 정보를 가진다. 도 15는 아래와 같은 트래픽 흐름이 가지는 상태의 전이 정보를 위한 빈도수 행렬(frequency matrix)을 보인다.The state transition matrix is generated based on state information. Stores transition information of an observed state changed from a previous state in a traffic flow. The matrix consists of the rows of the previous state and the columns of the current state. This matrix contains information about the statistical frequency of each state transition. FIG. 15 shows a frequency matrix for transition information of a state in which traffic flows as follows.

( 64→42→84→126→21→126→21→86→21→23→84 )(64 → 42 → 84 → 126 → 21 → 126 → 21 → 86 → 21 → 23 → 84)

이러한 방법으로, 한 트래픽 흐름은 상태전이 정보를 위한 빈도수 행렬을 가지며, 이들 상태전이 정보에 기반하여 통계적인 학습과정을 거치게 된다.In this way, one traffic flow has a frequency matrix for the state transition information and undergoes a statistical learning process based on the state transition information.

2) 마르코프 연쇄 과정에 기반한 전이 모델2) Transition model based on Markov chain process

빈도수 행력을 사용하는 방법은 모든 트래픽 흐름들을 위한 각각의 행렬들을 유지해야 하는 결점을 지닌다. 따라서 모든 흐름 데이터들을 위하여 마르코프 모델 이론에 기초한 단일 상태전이 행렬 생성 방법을 제안할 필요가 있다.The method of using frequency behavior has the drawback of maintaining separate matrices for all traffic flows. Therefore, it is necessary to propose a single state transition matrix generation method based on Markov model theory for all flow data.

마르코프 모델은 각 상태에 따른 전이 확률을 포함한다. 도 16과 같은 Ergodic 모델에 기반하여 특별한 상태전이 모델을 제시하고자 한다. 그 모델은 한 트래픽 흐름에서의 순차적으로 발생하는 패킷들이 가지는 상태들 사이에서의 통계적인 인과성을 내포한다.The Markov model includes transition probabilities for each state. Based on the Ergodic model as shown in FIG. 16, a special state transition model is proposed. The model implies statistical causality between the states of sequentially occurring packets in a traffic flow.

이전 상태 p로부터 현재 상태 o까지의 전이 확률이 다음과 같이 정의된다.The transition probability from the previous state p to the current state o is defined as follows.

, and

수학식 8에서, N은 가능한 상태들을 총 개수를 의미한다. 그리고 초기 상태 p의 확률은 다음과 같이 정의된다.In Equation 8, N means the total number of possible states. And the probability of the initial state p is defined as

,

반면, 전이 확률은 빈도수에 의하여 계산되는데, 그 식은 아래와 같다.On the other hand, the probability of transition is calculated by the frequency, which is

,

수학식 10에서,

은 이전 상태에 이어 발생하였던 모든 현재 상태 총 빈도수를 의미한다. 이는 이전 상태들과의 관계를 나타내기 위하여 적용된다. 만일 모든 이전 상태들의 총 빈도수에 의하여 나누어 진다면, 그것은 모든 이전 상태들에 대한 현재 상태로서의 관계성을 지니게 될 것이다. 그러나 위 연산은 수학식 8의 전이 확률에 반하는 의 결과를 초래한다.In Equation 10,

Means the total frequency of all current states that occurred after the previous state. This applies to indicate the relationship with previous states. If divided by the total frequency of all previous states, it will have a relationship as the current state to all previous states. However, the above operation results in the opposite of the transition probability of Equation 8.

순서집합(sequence) S={S ₁ ,S ₂ ,...,S _T }의 확률은 마르코프 특성에 의하여 수학식 11과 같이 정의될 수 있고,The probability of the sequence S = { S ₁ , S ₂ , ..., S _T } may be defined as shown in Equation 11 by the Markov characteristic,

다시 표현한다면 수학식 12와 같은 식을 얻을 수 있다.In other words, the equation (12) can be obtained.

수학식 12에 정의된 확률은 우도(likelihood)가 되며, 이는 모델 인식을 위한 기준으로 사용될 수 있다. 여기서, 초기 상태 p의 확률은 무시하도록 한다. 따라서 우도 L은 다음과 같이 정의될 수 있다.The probability defined in Equation 12 becomes likelihood, which can be used as a reference for model recognition. Here, the probability of the initial state p is to be ignored. Thus, likelihood L can be defined as

그러나, 우도는 확률들의 곱셈에 의한 언더플로우(underflow)를 발생시킬 것이다. 따라서 결과적으로 수학식 14와 같은 로그-우도(log-likelihood)를 정의한다.However, likelihood will cause underflow by multiplication of probabilities. Therefore, as a result, a log-likelihood as shown in Equation 14 is defined.

로그-우도는 상태들에 대한 전이 로그-확률(log-probability)들의 합이다. 이는 모델 매칭을 위한 최적의 기준이 되며, 이렇게 구축된 모델은 확률 행렬(probability matrix)로서 표현되는 다중 상태전이 정보를 포함한다.Log-likelihood is the sum of the transition log-probabilities for the states. This is an optimal criterion for model matching, and the model thus constructed includes multi-state transition information expressed as a probability matrix.

도 17은 확률 행렬을 생성하는 예를 보인다. 예를 들어, 아래의 세 가지 트래픽 연결흐름들은 총 네 종류의 상태들(0, 1, 2, 3)을 가질 수 있다.17 shows an example of generating a probability matrix. For example, the following three traffic flows can have a total of four types of states (0, 1, 2, 3).

트래픽 연결흐름 #1 = {2,2,2,3,3,1} : 2→2→2→3→3→1 Traffic link flow # 1 = {2,2,2,3,3,1}: 2 → 2 → 2 → 3 → 3 → 1

트래픽 연결흐름 #2 = {2,2,2,3,3,0,1} : 2→2→2→3→3→0→1 Traffic link flow # 2 = {2,2,2,3,3,0,1}: 2 → 2 → 2 → 3 → 3 → 0 → 1

트래픽 연결흐름 #3 = {3,3,2,0,1} : 3→3→2→0→1 Traffic link flow # 3 = {3,3,2,0,1}: 3 → 3 → 2 → 0 → 1

만일 빈도수가 0이라면, 상태전이의 모든 경우를 포함하는 최소값이 주어진다. 최소 빈도수 Fr_min은 0.001처럼 아주 작은 값을 가져야 한다.If the frequency is zero, the minimum value is given to cover all cases of state transition. The minimum frequency, Fr _min , should be as small as 0.001.

추가적인 예로, 임의의 트래픽 흐름 {2,2,3,0,1}과 도 17의 확률 행렬 사이의 로그-우도는 다음과 같이 계산된다. 로그-우도 값이 클수록 그 유사도 또한 비례하여 증가한다.As a further example, the log-likelihood between any traffic flow {2,2,3,0,1} and the probability matrix of FIG. 17 is calculated as follows. As the log-likelihood value increases, the similarity also increases proportionally.

다. 모델 매칭 방법론All. Model matching methodology

모델 매칭에 의한 침입 탐지 방법을 제시하기에 앞서, 우도의 일반화(normalization)를 위하여 정상 우도와의 비율을 계산하며, 이때의 로그-우도비(log-likelihood ratio)는 식 15와 같이 정의한다.Prior to suggesting the intrusion detection method by model matching, the ratio of the normal likelihood is calculated for the normalization of the likelihood, and the log-likelihood ratio at this time is defined as in Equation 15.

, M ∈ {A ₁, A ₂, ..., A _k, A, N}

, M ∈ { A ₁ , A ₂ , ..., A _k , A , N }

최종적으로, 본 실시예에서 제안하는 탐지 시스템은 전술된 바 있는 도 1 같은 구조로 이루어진다. 최대 로그-우도비 MLLR이 임계값(threshold)보다 크면 공격 흐름과 유사함을 의미하며 이는 수학식 17과 같다.Finally, the detection system proposed in this embodiment has a structure as shown in FIG. 1 as described above. If the maximum log-likelihood ratio MLLR is larger than the threshold, it means that it is similar to the attack flow.

MLLR = max(LLR) , MLLR ≥ 0 MLLR = max ( LLR ), MLLR ≥ 0

도 18은 LLR 비교에 의한 구체적인 매칭 과정을 보인다. 여기서 MLLR이 0 값을 가질 경우는 입력 트래픽이 정상 모델과 매칭되었음을 의미하고, 따라서 MLLR은 0보다 작은 값을 가질 수 없다. 모든 공격의 혼합 모델과 매칭되었을 때 최대의 LLR을 가진다면, 이 경우 입력 트래픽은 특정한 공격에 해당하지는 않지만 정상 트래픽 또한 아님을 의미한다. 결국 알려지지 않은 공격일 가능성이 있는 비정상 트래픽으로 분류된다.18 shows a specific matching process by LLR comparison. Here, if the MLLR have a value of 0 indicates that the type of traffic and matching normal model, and thus MLLR can not have a negative value. If we have the maximum LLR when matched with a mixed model of all attacks, then this means that the input traffic is not a specific attack but is also normal traffic. Eventually, it is classified as an abnormal traffic that may be an unknown attack.

본 실시예에서는 하나의 실시간적인 전체 트래픽을 이루는 각각의 단위 트래픽 흐름군들의 상태전이 정보에 기반하여 탐지가 이루어지고 있다. 개개의 세션이나 트래픽 연결흐름에 기반한 탐지는, 봇넷과 같이 비슷한 유형의 전송 패킷을 다량 발생시키거나 실질적인 공격이 발생하기 전까지는 특별히 악성적 행위를 찾아내기 어려운 특성을 가지는 트래픽을 효과적으로 탐지하기에는 부적절하기 때문이다.In the present embodiment, detection is performed based on the state transition information of each unit traffic flow group forming one real-time total traffic. Detection based on individual sessions or traffic flows is inadequate for effective detection of traffic that is particularly difficult to detect malicious behavior until large quantities of similar types of transport packets, such as botnets, or actual attacks occur. Because.

따라서, 바람직하게는 단일 트래픽만을 주시하는 관점에서 탈피하고, 이를 더욱 확장하여 전체 트래픽의 단계적 속성들을 구분 짓고 각 트래픽 흐름군들이 전이되는 정보에 기반한 트래픽 모델링 방법과 매칭 기법이 적합하다.Therefore, a traffic modeling method and a matching technique based on the information in which the traffic flow groups are transferred are separated from each other in order to escape from the viewpoint of monitoring only a single traffic, and further expand it to distinguish the stepwise properties of the entire traffic.

<실험예>Experimental Example

1) 실험 샘플 구성1) Experimental sample composition

현재 활동하지 않는 Sinit과 Phatbot은 실험 대상에서 제외하였다. 표 9에서 제시된 Nugache도 점차 소강 상태를 보여 다량의 트래픽을 발생시키면서 활발히 활동하지는 못하였지만 구축된 내부 실험 환경에서 두 피어들의 양단간 통신 트래픽에 기반하여 실험하였다.Sinit and Phatbot, which are not currently active, were excluded from the experiment. Nugache, shown in Table 9, also showed a sluggish state, which was not active while generating a large amount of traffic, but experimented based on the communication traffic between the two peers in the established internal experimental environment.

공격 종류Attack type 샘플 바이너리 MD5Sample Binary MD5 SpamThruSpamthru d844d871225484e3fd90c1ffe1e5706ad844d871225484e3fd90c1ffe1e5706a 6183f1a6d780c70c95d20798090b08286183f1a6d780c70c95d20798090b0828 NugacheNugache 0c859cfad2fa154f007042a1dca8d75b0c859cfad2fa154f007042a1dca8d75b 1720155bf90614866392b6b655d15cbe1720155bf90614866392b6b655d15cbe 74600e5bc19538a3b6a0b4086f4e005374600e5bc19538a3b6a0b4086f4e0053 9007e2a98f5e0399a798e35338e63d989007e2a98f5e0399a798e35338e63d98 PeacommPeacomm dc49dfc97d9698ebede280e8daa7fd7bdc49dfc97d9698ebede280e8daa7fd7b 562d6dad245497e6c95d1bb33e4bedda562d6dad245497e6c95d1bb33e4bedda 9441cab2287dd6a5713e9e285c862e5f9441cab2287dd6a5713e9e285c862e5f

2) 실험 데이터 구성2) Experiment data composition

모든 원천 데이터의 2/3는 학습에, 나머지 1/3은 실험(탐지)에 활용하였다. 여기서 표 10에서와 같은 단위 트래픽 수는 하나의 공격 샘플을 에뮬레이션하여 초기 실행 단계에서부터 이후 프로세스 종료까지 일련의 연속적인 트래픽 발생 데이터를 의미한다.Two-thirds of all source data were used for learning and the other 1/3 for experiments (detection). Here, the number of unit traffic as shown in Table 10 refers to a series of continuous traffic generation data from the initial execution phase to the end of the process by emulating one attack sample.

공격 종류Attack type 단위 트래픽 수Unit traffic 데이터 구성Data configuration 학습learning 실험Experiment SpamThruSpamthru 157157 105105 5252 NugacheNugache 120120 8080 4040 PeacommPeacomm 151151 101101 5050

나. 트래픽 특성 및 탐지 결과 분석I. Traffic Characteristics and Detection Results Analysis

1) 탐지대상 트래픽 분석1) Analysis of traffic to be detected

SpamThru는 초기단계로 외부의 활동 중인 피어들과 연결을 시도하는데, 이에 앞서 자신 외의 다른 악성코드들의 활동을 제한하기 위하여 백신프로그램을 임의 웹사이트로부터 내려 받는다. 도 19와 도 20은 이때의 트래픽 변화량을 보여주고 있다. 실행 초기에 파일을 전송 받는 트래픽이 현저히 증가하는 모습을 보인다.SpamThru attempts to connect with externally active peers in the early stages. Prior to this, SpamThru downloads an antivirus program from any website to limit the activity of other malware. 19 and 20 show the traffic change amount at this time. At the beginning of execution, the traffic receiving file transfers increased significantly.

이어 계속적인 피어 발견 과정을 거치는데 응답이 없는 피어들이 있을 수 있으며, 이 경우 도 21과 도 22와 같이 유입 패킷수보다 유출 패킷수가 더 증가하는 양상을 보인다. 오직 TCP 통신만을 수행하므로 전체적인 패킷수에 있어서 큰 차이를 보이지는 않지만, 데이터 크기에 있어서는 유입 트래픽이 월등히 높음을 볼 수 있다. 이는 유출 트래픽이 주로 발생시킨 패킷들이 대부분 데이터가 없는 ACK 응답 패킷이었음을 의미한다. 또한 점차 발생되는 트래픽의 규모가 다소 커지는데 이는 접속을 맺은 피어들로부터 또 다른 피어 목록을 송수신하는 단계가 진행되기 때문이다. 이어서 피어 목록 교환과 동시에 템플릿 서버 목록도 수신 받으며 이러한 동작들이 완료되면 이제 본격적인 스팸 공격을 시도할 준비에 앞서, SpamThru는 자신의 SMTP 기능을 간단히 점검하는 절차를 행한다. 이 단계에서부터 트래픽은 다소 감소 국면에 접어들게 되고 공격 준비를 마무리 한다.Subsequently, there may be peers that do not respond to the peer discovery process. In this case, as shown in FIGS. 21 and 22, the number of outgoing packets increases more than the number of incoming packets. Since only TCP communication is performed, there is no big difference in the total number of packets, but the incoming traffic is much higher in data size. This means that the packets mainly generated by the outgoing traffic were ACK response packets without data. In addition, the amount of traffic that is gradually generated is rather large, because the step of transmitting and receiving another peer list from the peers that are connected. The server then receives a list of template servers at the same time as the peer list exchange, and when these actions are completed, SpamThru simply checks its SMTP function before preparing to launch a full-scale spam attack. From this stage, traffic is somewhat reduced and ready for attack.

프로토콜별 트래픽의 변화량을 살펴보면 SpamThru 봇넷은 자체 P2P 프로토콜을 이용하여 통신한다. 그러나, 도 23에서와 같이 초기 백신 프로그램 내려 받기 과정에서는 HTTP 프로토콜을 이용하여 웹서버와 통신하며, 이후 스팸 메일 발송을 위한 준비 과정에서는 도 24와 같이 SMTP 프로토콜과 메일 주소 도메인 질의를 위한 DNS 프로토콜을 이용하는 패턴을 보인다.Looking at the amount of traffic change by protocol, the SpamThru botnet communicates using its own P2P protocol. However, as shown in FIG. 23, the initial vaccine program download process communicates with the web server using the HTTP protocol, and in the preparation process for sending spam mail, the SMTP protocol and the DNS protocol for querying the mail address domain as shown in FIG. Show the pattern you use.

도 25와 도 26은 통신 과정에서 이용하는 포트의 분포는 보인다. 웹서버의 80번 포트로부터 파일을 전송받고 있으며 이후 피어 발견 및 통신 과정에서는 랜덤한 이용 양상을 보인다. 스팸 메일 발송 과정에서는 SMTP 서비스를 위한 25번 포트와 DNS 질의를 위한 53번 포토를 주로 이용하고 있다.25 and 26 show the distribution of ports used in the communication process. File is received from port 80 of web server and shows random usage in peer discovery and communication process. The spam mailing process uses port 25 for SMTP service and port 53 for DNS query.

Nugache는 샘플 에뮬레이션 결과, 접속을 시도하려는 유출 트래픽이 계속 발생하지만 유입 트래픽으로 응답 패킷들은 거의 수신되지 않는다. 이는 감염된 피어들의 갱신된 목록을 전달하여 줄 서버가 닫혀 활동이 어렵기 때문이다. 따라서 근래 더 이상 효력을 상실한 봇넷으로 판명되었다. 바이너리에 하드코드되어 사전 정의된 서버들의 IP 리스트들에 접속을 시도하려는 단점이 가장 큰 원인이었을 것으로 추측된다. 본 실험에서는 내부 실험 환경을 구축하여 두 피어들이 양단간에 송수신하는 트래픽을 추출하고 분석하여 보았다.In Nugache's sample emulation, outgoing traffic will continue to attempt to connect, but as incoming traffic, response packets are rarely received. This is because the line server is closed by passing an updated list of infected peers, making the activity difficult. As a result, it has recently become a botnet that is no longer effective. It is presumed that the main cause was the disadvantage of being hardcoded into the binary and attempting to access the IP lists of predefined servers. In this experiment, we built an internal experimental environment and extracted and analyzed the traffic transmitted and received between two peers.

도 27과 도 28은 Nugache 봇넷의 트래픽 변화량을 보인다. 일반적으로 유입 트래픽보다 유출 트래픽의 양이 현저히 높음을 알 수 있으며, 이는 응답 피어가 거의 없고 연결된 단일 피어에게 자신의 피어 정보를 전송하여 주는 과정에 의한 결과로 분석된다. Nugache 봇넷은 자체 암호화된 P2P 통신을 하며 TCP 프로토콜만을 기반으로 한다.27 and 28 show the traffic variation of the Nugache botnet. In general, it can be seen that the amount of outgoing traffic is significantly higher than inflow traffic, which is analyzed by the result of transmitting its peer information to a single connected peer with few responding peers. The Nugache botnet is a self-encrypted peer-to-peer communication based only on the TCP protocol.

이용하는 포트의 분포를 살표보면 TCP 포트 8번과 주로 통신을 하며, 도 29에서와 같이 내부에서의 피어 발견에 이용되는 포트번호는 랜덤하게 단조증가하는 양상으로 진행된다.As shown in the distribution of the ports used, communication is mainly performed with TCP port 8, and as shown in FIG. 29, the port numbers used for peer discovery are randomly increased.

Peacomm은 eDonkey/Overnet 트래픽을 다량 발생시키며 자신의 질의코드를 검색한다. 하지만 그 질의를 실제로 파일에 대응하지는 않는다. 도 30 내지 도 32에서 특이한 사항은 패킷수와 데이터 크기에 있어서 유입와 유출 트래픽이 다소 반비례하다는 점이다. 이 경우가 모두 파일을 내려 받는 경우이기 때문은 아니다. Peacomm은 기본적으로 eDonkey로부터 파일을 내려 받지 않는다. 다만 파일을 내려 받는 사이트를 찾기 위하여 피어들로부터 정보를 필요로 하는 것일 뿐이다. 이와 같은 경우는 검색 질의에 대한 응답으로 여러 피어들의 목록들이 전송된다. Peacomm 트래픽의 대부분은 그러한 파일 검색과 인접 피어 목록의 전달 및 검색 결과의 응답 트래픽들로 이루어진다.Peacomm generates a lot of eDonkey / Overnet traffic and retrieves its query code. However, the query does not actually correspond to a file. 30-32 is unusual in inflow and outflow traffic in terms of the number of packets and data size. This is not because all of the files are downloaded. Peacomm does not download files from eDonkey by default. It just needs information from peers to find a site that downloads files. In this case, a list of peers is sent in response to the search query. Most of the Peacomm traffic consists of such file search and forwarding of neighbor peer list and response traffic of search results.

Peacomm 트래픽의 대부분은 UDP 프로토콜이 차지한다. 피어 발견과 파일 검색 과정에서의 프로토콜이 주를 이루는 eDonkey/Overnet 접속 프로토콜이 UDP에 기반하기 때문이다. 간혹 응답받은 웹서버의 경로로부터 파일을 내려 받기 위하여 HTTP 프로토콜을 발생시키기도 한다. 이 경우 파일 전송 트래픽은 아주 간헐적으로 발생한다.Most of the Peacomm traffic comes from the UDP protocol. This is because the eDonkey / Overnet access protocol, which is the protocol for peer discovery and file retrieval, is based on UDP. Sometimes the HTTP protocol is generated to download a file from the web server's path. In this case, file transfer traffic is very intermittent.

도 33은 Peacomm 봇넷 트래픽의 포트별 분포를 보이는데, 샘플 분석 결과에서와 같이 주로 자신의 UDP 포트 4000, 7871, 11271번 포트를 택일하여 eDonkey 피어들과 통신한다. 본 에뮬레이션 결과에서는 11271번 포트를 이용하여 통신하고 있다.FIG. 33 shows the distribution of each Peacomm botnet traffic by port. As shown in the sample analysis result, the UDP port 4000, 7871, and 11271 are mainly selected to communicate with eDonkey peers. In this emulation result, communication is made using port 11271.

2) 탐지 결과 분석2) Detection result analysis

도 34는 위 분석된 트래픽들에 대한 다중 상태전이 확률 모델의 우도비 임계값에 따른 탐지 성능을 보인다. 본 실험에서는 개별 공격별 탐지 모델을 생성하여 매칭된 탐지 결과를 보였으며 MLLR의 임계값 0.2에서의 결과는 표 11과 같다. 실제 SpamThru는 공격 준비단계의 거의 마지막 절차인 'SMTP 성능 점검' 단계의 트래픽에서 대부분 탐지되었다.34 shows the detection performance according to the likelihood ratio threshold of the multi-state transition probability model for the analyzed traffic. In this experiment, the detection model for each attack was generated and the matched detection result was shown. The results at the threshold 0.2 of MLLR are shown in Table 11. In fact, SpamThru was mostly detected in traffic during the 'SMTP performance check' phase, which is almost the last step in preparation for the attack.

봇넷Botnet 공격attack 탐지Detection 미탐지Not detected 오탐지False positives 탐지율Detection rate 분류율Classification rate 미탐율Not detected 오탐율False positive rate 분류Classification 미분류Unclassified SpamThruSpamthru 5252 5050 4747 33 22 -- 96.15%96.15% 90.39%90.39% 3.85%3.85% -- NugacheNugache 4040 3838 3737 1One 22 -- 95.00%95.00% 92.50%92.50% 5.00%5.00% -- PeacommPeacomm 4646 4646 4646 00 00 -- 100.00%100.00% 100.00%100.00% 0.00%0.00% -- 정상트래픽Normal traffic 764764 -- -- -- -- 2222 -- -- -- 2.88%2.88%

도 35는 모든 공격 패턴들의 통합 모델에 기반한 탐지 결과를 보이며 MLLR의 임계값 0.2에서의 결과는 표 12와 같다. 개별 공격별 모델에 의한 탐지가 통합 모델에 기반하였을 때보다 더 좋은 성능을 도출하였음을 알 수 있다.35 shows detection results based on the unified model of all attack patterns, and the results at the threshold value 0.2 of the MLLR are shown in Table 12. It can be seen that the detection by individual attack-specific models yielded better performance than that based on the unified model.

봇넷Botnet 공격attack 탐지Detection 미탐지Not detected 오탐지False positives 탐지율Detection rate 미탐율Not detected 오탐율False positive rate SpamThruSpamthru 5252 4848 44 -- 92.31%92.31% 7.69%7.69% -- NugacheNugache 4040 3737 33 -- 92.50%92.50% 7.50%7.50% -- PeacommPeacomm 4646 4545 1One -- 97.83%97.83% 2.17%2.17% -- 정상트래픽Normal traffic 764764 -- -- 3939 -- -- 5.11%5.11%

본 실험에서는 비록 전체 트래픽이 하나의 탐지 단위가 되는 이유로 많은 양의 공격 데이터가 구성되지는 못했지만 탐지와 더불어 분류 성공률 또한 비교적 높게 측정되었다는 점에서 향후 연구의 발전 가능성을 제시한다.In this experiment, although a large amount of attack data was not constructed because the total traffic was one detection unit, this study suggests the possibility of future research in that the success rate of classification was also measured relatively high.

다. 확률분포거리 기반 탐지 결과 분석All. Probability distribution distance based detection result analysis

학습된 공격의 모든 전이 과정들이 탐지 성능 향상에 관여하지는 않는다. 경우에 따라서 정상 패턴과 유사한 전이 과정들은 탐지 성능에 부정적인 영향을 미치기도 한다. 본 실험에서는 확률분포거리(Probability Distribution Distance, PDD)에 기반한 유효한 탐지 척도 선정 기법을 적용하여 선정된 전이 과정만을 매칭에 이용하였을 때 탐지 성능에 미치는 영향을 분석하고자 한다.Not all transitions of a learned attack are involved in improving detection performance. In some cases, transition processes similar to normal patterns may have a negative effect on detection performance. In this experiment, we apply the effective detection scale selection method based on probability distribution distance (PDD) to analyze the effect on detection performance when only selected transition process is used for matching.

탐지에 유효한 척도의 선정 방법은 학습과정에서의 실험적 분석 결과를 이용한 방법이 주로 이용되어 왔다. 이는 여러 차례 척도를 변경해가며 규칙을 추출하고 그 규칙을 이용하여 탐지율을 점검해 봄으로써 해당 척도의 유효성을 판단하게 되며, 다시 척도를 변경하여 실험하는 과정의 반복을 통하여 최종 척도를 선정하게 되는 경험에 의한 방법을 수반한다. 하지만 학습을 통한 검증과정에서의 탐지율이 실제 실험에서의 탐지율을 대변한다고는 볼 수 없고 정상과 공격에서 척도들이 가지는 서로 다른 의미를 무시한 탐지 결과만을 고려하는 실험적인 방법으로서 그 기준이 모호하다.As a method of selecting a valid measure for detection, a method using experimental analysis results in the learning process has been mainly used. This is based on the experience of selecting the final scale by repeating the experiment by changing the scale several times and extracting the rule and checking the detection rate using the rule to check the detection rate. Involves the method. However, the detection rate in the verification process through learning does not represent the detection rate in the actual experiment, and the standard is ambiguous as an experimental method considering only the detection result ignoring the different meanings of the scales in the normal and the attack.

본 실험에서는 각 척도마다 그 중요도에 따른 가중치를 부여하여 공격의 특징을 가장 효과적으로 표현하는 척도들을 식별하고 최적의 탐지 척도들을 선정한다. 공격 데이터에서의 척도 값 분포와 정상 데이터에서의 척도 값 분포 사이의 거리를 측정하며 거리가 큰 척도일수록 공격 분별력이 높은 척도로 인지한다. 즉, 각 척도에 대한 공격과 정상 데이터 사이의 PDD를 구하고, 이를 척도의 가중치로 적용하였다. 본 실험에서는 정상행위와 공격행위 분포 사이에서 각 척도들의 PDD를 구하기 위하여 상대 엔트로피(relative entropy)로 정의되는 쿨백-라이블러 거리(Kullback-Leibler distance)를 활용하였다. 쿨백-라이블러 엔트로피는 다양한 응용분야에서 적용되고 있는 거리 계산방법으로 정보이론에 기반하고 있다. 이 엔트로피 값은 두 분포 사이의 거리를 수학식 17과 같이 정의한다.In this experiment, each measure is weighted according to its importance to identify the measures that most effectively represent the characteristics of the attack, and to select the optimal detection measures. The distance between the distribution of the scale values in the attack data and the distribution of the scale values in the normal data is measured. The larger the distance is, the more the attack discrimination is recognized. That is, the PDD between the attack and the normal data for each scale was obtained and applied as the weight of the scale. In this experiment, we used the Coolback-Leibler distance, which is defined as relative entropy, to calculate the PDD of each scale between normal behavior and attack behavior distribution. Coolback-Libler entropy is based on information theory as a distance calculation method applied in various applications. This entropy value defines the distance between two distributions as shown in Equation 17.

X는 탐지척도의 벡터 집합, p는 정상행위에 대한 확률 분포, q _i 는 공격행위에 대한 확률 분포이다. 각 척도는 f ∈ {척도 ₁ , 척도 ₂ , ... , 척도 _m }이고, 각 공격은 i ∈ {공격 ₁ , 공격 ₂ , ... , 공격 _n }이다. 단, 각 척도들은 서로 독립적이라는 가정 하에 위 식은 성립한다. X is the vector set of detection scales, p is the probability distribution for normal behavior, and q _i is the probability distribution for attacking behavior. Each scale is f ∈ { scale ₁ , scale ₂ , ..., scale _m }, and each attack is i ∈ { attack ₁ , attack ₂ , ..., attack _n }. However, the above equations hold on the assumption that the measures are independent of each other.

정상행위를 기준으로 보았을 때 PDD와 공격행위를 기준으로 보았을 때 PDD는 각각 달라 대칭성이 없으므로, 본 논문에서 구하고자 하는 최종 척도 기준 거리로 서의 PDD는 대칭적인 엔트로피를 위한 두 분포 사이 거리 합의 최대값으로 정한다.Since PDDs are different from each other based on normal behavior and PDDs are different based on attack behavior, PDD is the maximum distance distance between two distributions for symmetric entropy. Set by value.

위 식에서 D는 두 개체의 거리(비유사성)를 판별하는 척도 기준 거리를 의미하며, 공격과 정상행위 분포 사이의 최대 거리를 본 논문에서 구하고자 하는 PDD로 활용하였다.In the above equation, D is the standard reference distance for determining the distance (similarity) of two individuals, and the maximum distance between the attack and normal behavior distribution was used as the PDD to be obtained in this paper.

표 13은 PDD를 이용한 탐지 척도 선정의 부분적인 예를 보이고 있다. 먼저, 각 척도마다의 PDD 값을 구하여 저장하고, PDD가 임의의 임계 거리보다 크거나 같은 척도만을 탐지척도로 선정한다. 여기서 척도란 본 실험예에서의 전이 과정을 의미한다.Table 13 shows a partial example of detection scale selection using PDD. First, a PDD value for each scale is obtained and stored, and only a scale whose PDD is greater than or equal to an arbitrary threshold distance is selected as a detection scale. Here, the scale means the transition process in the present experimental example.

임계 PDD의 변화에 따라, 선정되는 척도의 개수 변화량을 도 36에서 확인할 수 있다. 거리와 척도 개수는 반비례 관계에 있음을 알 수 있으며, 일정 거리 이상이 되면 감소 곡선이 급격히 완만해 지는 것으로 보아 크게 중요하지 않은 척도들이 동시적으로 다수 제거되어 버림을 알 수 있다. 임계 거리는 탐지 성능과 탐지율 사이의 상관관계를 고려하여 조정되어야 한다.According to the change in the threshold PDD, the number change amount of the selected scale may be confirmed in FIG. 36. It can be seen that the distance and the number of scales are inversely related, and when the distance is over a certain distance, the decrease curve is abruptly gentle, indicating that many non-significant scales are simultaneously removed. The critical distance should be adjusted to take into account the correlation between detection performance and detection rate.

PDD를 적용한 SpamThru 봇넷의 탐지 결과는 도 37에서와 같다. 통합 탐지 모델에 대한 PDD 적용에 따른 결과만을 실험하였다. SpamThru 공격 트래픽과 정상 트래픽의 전이 과정 분포 사이의 임계 PDD 값이 67일 때 최고의 성능을 보임을 알 수 있다. 또한 68의 임계 PDD 값을 기준으로 하였을 때 오히려 성능이 저하됨을 보이는데, 이 경우는 일부 유효한 척도가 상쇄되어 버렸음을 의미한다. 더불어 이 결과는 그림 5.17에서의 성능과 비교하였을 때 뚜렷한 향상을 보여 준다.The detection result of the SpamThru botnet to which the PDD is applied is shown in FIG. 37. Only the results of applying the PDD to the integrated detection model were tested. The best performance is shown when the threshold PDD value between SpamThru attack traffic and normal traffic transition is 67. In addition, when the threshold PDD value of 68 is used, the performance is shown to be deteriorated, which means that some valid measures have been canceled out. In addition, the results show a marked improvement compared to the performance in Figure 5.17.

본 실험에서 PDD를 이용한 탐지 척도 선정의 유효성을 보였다. 이를 이용하여 공격과 정상 트래픽의 판별력을 더욱 높일 수 있을 것으로 기대된다. 뿐만 아니라, 공격 트래픽 특성의 이해를 돕고 탐지 결과에 대한 원인 분석을 용이하게 한다.In this experiment, we showed the effectiveness of detection scale selection using PDD. It is expected to further increase the discrimination ability between attacks and normal traffic. In addition, it facilitates the understanding of attack traffic characteristics and facilitates the cause analysis of detection results.

도 1은 본 발명에 의한 봇넷 탐지과정을 수행하는 시스템의 블록도.1 is a block diagram of a system for performing a botnet detection process according to the present invention.

도 2는 SpamThru 봇의 개별 피어들과 통신 트래픽의 패킷 증가량 그래프.2 is a graph of packet growth of communication traffic with individual peers of the SpamThru bot.

도 3은 TCP 프로토콜의 패킷 재전송을 나타낸 예시도.3 is an exemplary diagram illustrating packet retransmission of a TCP protocol.

도 4는 UDP 프로토콜의 패킷 재전송을 나타낸 예시도.4 is an exemplary diagram illustrating packet retransmission of the UDP protocol.

도 5는 갭감점 행렬의 정방향 정렬의 예시도.5 is an illustration of forward alignment of the gap reduction matrix.

도 6은 갭 감점 행렬의 역방향 정렬의 예시도.6 is an illustration of reverse alignment of a gap penalty matrix.

도 7은 갭 감점 행렬의 혼합 정렬의 예시도.7 is an illustration of a mixed alignment of a gap deduction matrix.

도 8은 갭감점 행렬의 혼합 정렬에 의한 두서열의 전이방향 그래프.8 is a transition direction graph of two sequences by a mixed alignment of a gap reduction matrix.

도 9는 갭 감점 행렬의 결과 도표 예시도.9 is a diagram illustrating the result of a gap deduction matrix;

도 10은 혼합정렬된 합집합을 나타낸 밴다이어 그램.FIG. 10 is a band diagram illustrating a mixed sorted union. FIG.

도 11은 인접행렬이 생성된 상태 예시도.11 is an exemplary view in which an adjacent matrix is generated.

도 12는 연결행렬이 생성된 상태 예시도.12 is a view illustrating a state in which a connection matrix is generated.

도 13은 군집 간의 연결 관계를 통해 연결이 재형성되는 상태 예시도.13 is an exemplary view in which a connection is reformed through a connection relationship between clusters.

도 14는 대표 트래픽 흐르문의 상태값이 연산되는 과정의 예시도.14 is an exemplary diagram of a process of calculating a state value of a representative traffic flow.

도 15는 상태전이 정보의 빈도수 행렬의 도표 예시도.15 is a diagrammatic illustration of a frequency matrix of state transition information.

도 16은 Ergodic 모델의 예시도.16 illustrates an Ergodic model.

도 17은 세가지 트래픽 연결흐름의 다중 상태전이 확률 행렬의 예시도.17 illustrates an exemplary multi-state transition probability matrix of three traffic link flows.

도 18은 모델매칭과정이 수행되는 상태 예시도.18 is an exemplary view in which a model matching process is performed.

도 19는 SpanThru 봇넷 트래픽#1의 송수신 패킷수 그래프.19 is a graph of transmit / receive packet count of SpanThru botnet traffic # 1.

도 20은 SpanThru 봇넷 트래픽#1의 송수신 바이트수 그래프.20 is a graph of transmit / receive bytes of SpanThru botnet traffic # 1.

도 21은 SpanThru 봇넷 트래픽#2의 송수신 바이트수 그래프.21 is a graph of transmit / receive bytes of SpanThru botnet traffic # 2.

도 22는 SpanThru 봇넷 트래픽#2의 송수신 바이트수 그래프.22 is a graph of transmit / receive bytes of SpanThru botnet traffic # 2.

도 23은 SpanThru 봇넷 트래픽#1의 프로토콜별 패킷수 그래프.Fig. 23 is a graph of the number of packets for each protocol of SpanThru botnet traffic # 1.

도 24는 SpanThru 봇넷 트래픽#2의 프로토콜별 패킷수 그래프.24 is a graph of packet count for each protocol of SpanThru botnet traffic # 2.

도 25는 SpanThru 봇넷 트래픽#1의 포트별 분포 그래프.25 is a distribution graph of ports of SpanThru botnet traffic # 1;

도 26은 SpanThru 봇넷 트래픽#2의 포트별 분포 그래프.Fig. 26 is a graph showing distribution of ports of SpanThru botnet traffic # 2.

도 27은 Nugache 봇넷 트래픽의 송수신 패킷수 그래프.27 is a graph of transmit / receive packets of Nugache botnet traffic.

도 28은 Nugache 봇넷 트래픽의 송수신 바이트수 그래프.28 is a graph of transmit and receive bytes of Nugache botnet traffic.

도 29는 Nugache 봇넷 트래픽의 송수신 포트별 분포 그래프.29 is a distribution graph of transmission and reception ports of Nugache botnet traffic.

도 30은 Peacomm 봇넷 트래픽의 송수신 수 그래프.30 is a graph of the number of transmission and reception of Peacomm botnet traffic.

도 31은 Peacomm 봇넷 트래픽의 송수신 바이트수 그래프.31 is a graph of transmit / receive bytes of Peacomm botnet traffic.

도 32는 Peacomm 봇넷 트래픽의 송수신 프로토콜별 패킷수 그래프.32 is a graph of the number of packets for each transmission / reception protocol of Peacomm botnet traffic.

도 33은 Peacomm 봇넷 트래픽의 포트별 분포 그래프.Fig. 33 is a distribution graph of ports of Peacomm botnet traffic.

도 34는 개별 공격별 탐지 모델에 따른 ROC 곡선 그래프.34 is a graph of ROC curves according to individual attack detection models.

도 35는 통합 탐지 모델에 따른 ROC 곡선 그래프.35 is a graph of ROC curves according to an integrated detection model.

도 36은 확률분포거리 변화에 따라 선정된 척도 개수 그래프.36 is a graph of the number of measures selected according to the change of probability distribution distance.

도 37은 통합 탐지 모델에 따른 SpamRhru 봇넷의 PDD 적용 ROC 곡선 그래프.37 is a PDD applied ROC curve graph of the SpamRhru botnet according to the integrated detection model.

Claims

Collecting and clustering traffic connection flows on a network to generate a traffic flow group, and then aggregating information of the state of the traffic flow group transition, and forming a multi-frequency matrix based on the aggregated information to perform a learning process; ;

Obtaining a multi-state transition probability model by generating a probability matrix for a botnet attack and a normal probability matrix through the learning process;

Performing model matching on the state transition information collected from the probability model and the traffic of the network in real time to detect whether a malicious botnet exists by comparing a result value of model matching with a preset threshold value;

Pepipi-based botnet detection method using a network traffic transition model comprising a.

The method of claim 1,

The multi-state transition probability model distinguishes between a probability model group according to traffic state transitions for botnets already found and a probability model group according to traffic state transitions for botnets not found,

When the existence of the botnet is detected as a result of performing the model matching, if the detected botnet belongs to the model group of traffic state transition to the already discovered botnet, it is regarded as a known attack and is regarded as a misuse detection. A traffic-penetration-based botnet detection method using a network traffic transition model, characterized in that the botnet is regarded as an abnormal detection by determining that the traffic state transition for the botnet which is not found belongs to a model group.

The method of claim 1, wherein the multi-state transition probability model

Pepipi-based botnet detection method using a network traffic transition model characterized in that generated by the Markov chain process.

The method of claim 1, wherein the model matching is

A peer-to-peer based botnet detection method using a network traffic transition model, which is performed by a log-likelihood ratio test for calculating a ratio of normal likelihood for generalization of likelihood.

The method of claim 1,

In the performing of the learning process: using the network traffic transition model, the collected traffic connection flows are divided into individual unit traffic connection flows, and similar unit traffic connection flows are grouped by sequential patterns according to behavior characteristics. P2P based botnet detection method.

The method of claim 5, wherein

In the step of performing the learning process: Peptide-based botnet detection method using a network traffic transition model characterized in that the gap deduction matrix is applied to compare the similarity between the unit traffic connection flow.

The method of claim 6,

A function for determining similarity between the unit traffic flows,

Pepipi-based botnet detection method using a network traffic transition model, characterized in that.

The method of claim 1,

A peer-to-peer based botnet detection method using a network traffic transition model, characterized in that the ROCK algorithm, which is a data mining clustering algorithm, is applied to cluster the traffic connection flow.

The method of claim 8,

In the preprocessing step of the clustering, a neighboring matrix is generated by a result value obtained by comparing similarities, and a linking matrix is generated based on the neighboring matrix, and the number of links of the linking matrix link = neighboring matrix (horizontal axis) x Pepipi-based botnet detection method using a network traffic transition model, characterized in that obtained by the adjacent matrix (vertical axis).

The method of claim 9,

The ROCK algorithm is applied to the fitness for clustering traffic link flow, the fitness is applied by generalizing the number of connections, which

Pepipi-based botnet detection method using a network traffic transition model, characterized in that calculated by.

The method of claim 3, wherein

The model generated by the Markov chain process presents a state transition model based on an Ergodic model, and based on this Ergodic model

Ptpi-based botnet detection method using a network traffic transition model characterized in that it calculates the log-likelihood, and the model thus constructed includes multi-state transition information represented as a probability matrix.

The method of claim 4, wherein

The log-likelihood ratio is

, M ∈ { A ₁ , A ₂ , ..., A _k , A , N }, and when the maximum log-likelihood MLLR is greater than the threshold, it is determined to be similar to the attack flow. Peopi based botnet detection method using network traffic transition model.