KR20110041149A

KR20110041149A - Method for detecting traffic anomaly using fisher linear discriminant

Info

Publication number: KR20110041149A
Application number: KR1020090098199A
Authority: KR
Inventors: 강철희; 박현희; 김미정
Original assignee: 고려대학교 산학협력단
Priority date: 2009-10-15
Filing date: 2009-10-15
Publication date: 2011-04-21
Also published as: KR101048924B1

Abstract

PURPOSE: A method for detecting an abnormal traffic is provided to determine whether an inflow traffic is belong to a normal group or an abnormal group using a fisher linear discrimination method. CONSTITUTION: If an inflow traffic is located in the lower part of a hyperplane(S50), the traffic is determined as a normal group(S60). If the traffic is located in the upper part of the hyperplane, the traffic is determined as an abnormal group(S70). It is determined whether the traffic level exceeds a predetermined update period(S80). If the traffic reaches the update period, traffic vectors which were collected earlier than other traffic vectors are abandoned. A new hyperplane is created with the newly added traffic vectors during the update period(S90).

Description

Method for detecting traffic anomaly using Fisher linear discriminant

본 발명은 서브 네트워크에서의 비정상 트래픽을 탐지하는 방법에 관한 것으로, 특히 수집 기간 동안의 트래픽들을 데이터로 하여, 카이스퀘어 분포(chi-square distribution)와 피셔(Fisher) 선형 분류법을 이용하여 정상 그룹 및 비정상 그룹으로 분류할 수 있는 하이퍼플레인을 형성한 후, 이후에 유입되는 트래픽이 정상 그룹에 속하는지 또는 비정상 그룹에 속하는지를 용이하게 판별할 수 있도록 하는 피셔 선형 분류법을 이용한 비정상 트래픽 탐지 방법에 관한 것이다.The present invention relates to a method for detecting anomalous traffic in a sub-network, in particular, using traffic during the collection period as data, using normal groups and chi-square distribution and Fisher linear classification. After forming a hyperplane that can be classified into an abnormal group, and subsequently to the traffic traffic detection method using the Fisher linear classification method to easily determine whether the incoming traffic belongs to the normal group or the abnormal group. .

오늘날의 통신 네트워크는 전통적인 인터넷 응용 프로그램뿐만 아니라 멀티미디어 스트리밍, P2P, 실시간 음성 및 화상 통신, 네트워크 게임 등 인터넷 기반 응용 프로그램들이 증가하면서 네트워크의 트래픽들이 복잡화되고 다양화되어 이에 대한 트래픽 모니터링과 분석의 중요성이 대두되고 있다.Today's telecommunications networks have become increasingly complex and diversified, with the growing popularity of Internet-based applications such as multimedia streaming, peer-to-peer, real-time voice and video communications, and network games, as well as traditional Internet applications. It is emerging.

이렇게 복잡화되고 다양화된 네트워크를 통하여 비정상 트래픽의 발생과 각종 침해사고는 점점 증가하고 있으며 이로 인한 피해가 지속적으로 발생하고 있다. 비정상 트래픽은 네트워크 트래픽 폭주, DoS 공격, 웜과 같은 다양한 공격들의 원인이 되며 지능적이고 복합적인 형태로 발생되고 있기 때문에 이러한 이상 징후를 조기에 탐지할 수 있는 보다 진보된 기술이 요구되고 있다.Through these complicated and diversified networks, the occurrence of abnormal traffics and various infringement accidents is increasing, and the damages are continuously occurring. Anomalous traffic causes various attacks such as network traffic congestion, DoS attacks, and worms, and is generated in an intelligent and complex form, requiring more advanced technology to detect such anomalies early.

따라서 기존의 많은 연구들이 비정상 트래픽을 탐지하기 위한 방법을 다루어 왔다. 대표적인 두 가지 방법으로는 규칙(rule) 기반의 방식과 측정(measurement) 기반의 방식이 있다.Therefore, many existing studies have dealt with the method for detecting abnormal traffic. Two representative methods are rule-based and measurement-based.

먼저 규칙(rule) 기반의 방식은 비정상 트래픽의 패턴을 미리 파악해서 새롭게 발생하는 트래픽의 패턴은 미리 파악한 패턴을 비교해서 검출하는 방식이다.First, a rule-based method is to grasp an abnormal traffic pattern in advance and detect a newly generated traffic pattern by comparing the previously detected pattern.

패턴 매칭과 같은 이러한 접근 방식은 알려진 공격에 대해서 높은 탐지율을 보인다. 그러나 알려지지 않은 새로운 공격에 대해서는 비교적 낮은 탐지율을 보이며, 패킷을 일일이 비교해야 하는 문제 때문에 백본과 같은 고속의 대용량 네트워크에서는 접합하지 않다는 단점이 있다. This approach, such as pattern matching, has a high detection rate for known attacks. However, it has a relatively low detection rate for new unknown attacks, and due to the problem of comparing packets individually, it is not conjoined in a high-speed large network such as a backbone.

이러한 단점을 극복하기 위하여 측정(measurement) 기반의 방식에서는 인터넷 백본 트래픽이 주기성을 띈다는 점을 전제로 하여, 네트워크 상에서 얻을 수 있는 간단한 트래픽 정보를 바탕으로 트래픽을 모델링함으로써 미래의 트래픽을 예측하는 방법과 많은 트래픽 정보 중 유용한 특정 정보들만을 샘플링하여 모델링하는 방법들이 제안되었다.In order to overcome this drawback, the measurement-based method is based on the premise that the Internet backbone traffic is periodic, and predicts future traffic by modeling the traffic based on simple traffic information available on the network. Methods of sampling and modeling only useful specific information among the many traffic information have been proposed.

그러나 모델링을 통해 탐지하는 방법은 정상의 여부만을 구분할 뿐 비정상의 종류를 알기 위해 여러 단계를 거쳐야 한다는 단점이 있다.However, the modeling detection method has a disadvantage in that it is necessary to go through several steps to find out the kind of abnormality.

한편, 비정상 트래픽의 탐지는 트래픽 모니터링이 중요한 부분을 차지하고 있기 때문에 트래픽의 특징을 잘 표현할 수 있는 파라미터들을 정의하고 트래픽의 분석 시스템을 개발하는 것이 중요하다. On the other hand, since the detection of abnormal traffic is an important part of traffic monitoring, it is important to define parameters that can express the characteristics of the traffic and to develop a traffic analysis system.

본 발명은 상기와 같은 종래 기술의 문제점들을 해결하기 위하여 창안된 것으로, 수집 기간 동안의 트래픽들을 데이터로 하여, 카이스퀘어 분포(chi-square distribution)와 피셔(Fisher) 선형 분류법을 이용하여 정상 그룹 및 비정상 그룹으로 분류할 수 있는 하이퍼플레인을 형성한 후, 이후에 유입되는 트래픽이 정상 그룹에 속하는지 또는 비정상 그룹에 속하는지를 용이하게 판별할 수 있도록 하는 피셔 선형 분류법을 이용한 비정상 트래픽 탐지 방법을 제공하는 것을 그 목적으로 한다. The present invention was devised to solve the above-mentioned problems of the prior art, and the traffic group during the collection period is used as data, and a normal group and a chi-square distribution and Fisher linear classification method are used. After forming a hyperplane that can be classified into an abnormal group, it provides a method for detecting an abnormal traffic using the Fisher linear classification method to easily determine whether the incoming traffic belongs to the normal group or the abnormal group. For that purpose.

상기와 같은 과제를 해결하기 위하여 제안된 본 발명인 비정상 트래픽 탐지 방법을 이루는 구성수단은, 라우터로부터 특정 수집 기간 동안 특정 주기로 트래픽을 수집한 후, 수집된 각 트래픽의 특성을 나타내는 복수개의 성분으로 구성된 트래픽 벡터들을 생성하고, 상기 트래픽 벡터들을 데이터로 하여 피셔(Fisher) 선형 분류법을 이용하여 정상 그룹과 비정상 그룹으로 분류할 수 있는 하이퍼플레인(hyperplane)을 형성하는 제1 단계, 상기 하이퍼 플레인을 형성한 후에, 유입되는 트래픽이 하이퍼플레인의 면을 포함한 아래 영역에 위치하는 경우에는 정상 그룹으로 판단하고, 하이퍼플레인의 위 영역에 위치하는 경우에는 비정상 그룹으로 판단하는 제2 단계, 상기 판단 후, 사전에 세팅된 업데이트 주기에 도달했는지 판 단하는 제3 단계, 상기 판단 결과, 업데이트 주기에 도달한 경우에는, 업데이트 주기에 해당하는 기간 동안 수집된 가장 오래된 트래픽 벡터들을 버리고, 업데이트 주기 동안 새롭게 생성된 트래픽 벡터들을 포함시켜 새로운 하이퍼플레인을 형성한 후, 상기 제2 단계로 귀환하는 제4 단계를 포함하여 이루어진 것을 특징으로 한다.The constituent means of the abnormal traffic detection method of the present invention proposed to solve the above problems is, after collecting the traffic in a specific period for a specific collection period from the router, traffic consisting of a plurality of components representing the characteristics of each traffic collected After generating the vectors and forming a hyperplane that can be classified into normal and abnormal groups using Fisher linear classification using the traffic vectors as data, after forming the hyperplane If the incoming traffic is located in the lower area including the plane of the hyperplane, the second step is determined as a normal group, and if it is located in the upper area of the hyperplane, the second step is determined as an abnormal group. A third step of determining whether a predetermined update cycle has been reached; When the update period is reached, the oldest traffic vectors collected for the period corresponding to the update period are discarded, the newly generated traffic vectors are included in the update period to form a new hyperplane, and then returned to the second step. It comprises a fourth step.

여기서, 상기 각 트래픽 벡터의 성분은 대역폭(bandwidth)의 용량, 플로우(flow)의 용량 및 패킷(packet)의 용량인 것을 특징으로 한다.Here, the components of each traffic vector are characterized in that the capacity of the bandwidth (band), the capacity of the flow (flow) and the capacity of the packet (packet).

여기서, 상기 트래픽 벡터의 각 성분에 대한 상기 수집 기간 동안의 분포는 카이스퀘어 분포(chi-square distribution)를 따르고, 각 성분의 용량이 하나라도 상위 4% ~ 6%에 해당하는 경우에는 비정상 그룹으로 분류하고, 모든 성분의 용량이 상위 4% ~ 6%에 해당하지 않는 경우에는 정상 그룹으로 분류하는 것을 특징으로 한다.Here, the distribution during the collection period for each component of the traffic vector follows a chi-square distribution, and if at least one of each component corresponds to the top 4% to 6%, it is an abnormal group. If the dose of all the components do not correspond to the top 4% ~ 6%, characterized in that classified into a normal group.

상기와 같은 과제 및 해결 수단을 가지는 본 발명인 서브 네트워크에서의 피셔 선형 분류법을 이용한 비정상 트래픽을 탐지하는 방법에 의하면, 수집 기간 동안의 트래픽들을 데이터로 하여, 카이스퀘어 분포(chi-square distribution)와 피셔(Fisher) 선형 분류법을 이용하여 정상 그룹 및 비정상 그룹으로 분류할 수 있는 하이퍼플레인을 형성한 후, 이후에 유입되는 트래픽이 상기 하이퍼플레인의 아래 영역에 해당되면 정상 그룹으로 분류하고, 하이퍼플레인의 위 영역에 해당되면 비정상 그룹으로 분류할 수 있기 때문에, 유입되는 트래픽 중, 비정상 트래픽을 용이 하게 판별할 수 있는 장점이 있다.According to the method for detecting abnormal traffic using Fischer linear classification in the sub-network of the present invention having the above-described problems and solving means, the chi-square distribution and the Fisher (Fisher) After forming a hyperplane that can be classified into a normal group and an abnormal group by using a linear classification method, if the incoming traffic falls into the lower region of the hyperplane, it is classified into a normal group, and the top of the hyperplane If the area corresponds to the abnormal group can be classified, there is an advantage that can easily determine the abnormal traffic among the incoming traffic.

또한, 하이퍼플레인을 주기적으로 업데트하기 때문에, 변화되는 트래픽의 추세에 부합할 수 있는 장점이 있다.In addition, since the hyperplane is periodically updated, there is an advantage that can meet the trend of changing traffic.

이하, 첨부된 도면을 참조하여 상기와 같은 과제, 해결수단 및 효과를 가지는 본 발명인 피셔 선형 분류법을 이용한 비정상 트래픽 탐지 방법에 관한 바람직한 실시예를 상세하게 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the abnormal traffic detection method using the Fischer linear classification method of the present invention having the above problems, solutions and effects.

네트워크 관리자의 입장에서 사용자에게 효율적이고, 정상적인 서비스를 제공하기 위해서는 유해 트래픽뿐 아니라, 악의성은 없지만 네트워크 사용자가 정상적인 서비스를 제공받는데 문제가 되는 비정상 트래픽들에 대한 탐지가 이루어져야 한다.From the point of view of the network administrator, in order to provide efficient and normal services to users, not only harmful traffic, but also malicious traffic should be detected for abnormal traffic that is problematic for the network user to receive normal services.

또한 각각의 비정상 트래픽들은 네트워크의 특성에 따라서 정상 트래픽으로 분류되는 경우도 있을 수 있다. 예를 들어, 어떤 네트워크 안에서는 ftp 트래픽으로 인해 다른 정보 전송에 문제를 야기한다면 해당하는 ftp 트래픽은 비정상 트래픽으로 분류되어야 하고, 네트워크의 효율 측면에서 일정 수준의 ftp 트래픽만을 허용할 수 있다면 그보다 더 많은 양의 ftp 트래픽에 대해서는 비정상 트래픽으로 탐지를 해야만 제어할 수 있을 것이다.In addition, each abnormal traffic may be classified as normal traffic according to the characteristics of the network. For example, in some networks, if ftp traffic is causing problems with other information, the corresponding ftp traffic should be classified as abnormal, and if it can only allow a certain level of ftp traffic in terms of network efficiency, Ftp traffic can only be controlled by detecting abnormal traffic.

이러한 관점에서 네트워크 트래픽들이 가진 특성에 따라 분류하고 유입되는 트래픽이 어떠한 트래픽의 그룹에 속하는지 분류를 하는 작업이 필요하다. From this point of view, it is necessary to classify according to the characteristics of network traffic and classify which traffic group the incoming traffic belongs to.

본 발명에서 제안하는 방법은 네트워크 상에서 관측되는 트래픽들이 지니고 있는 특성들을 성분으로 하는 데이터의 집합인 트래픽 벡터를 정의한 후 트래픽 벡터를 사용하여 트래픽을 그 특성에 따라 그룹을 짓고 그 그룹이 잘 구분이 되는지 판별한 후에 유입되는 트래픽이 어떠한 그룹에 속하게 되는지 탐지하는 방법이다.The method proposed in the present invention defines a traffic vector, which is a set of data consisting of the characteristics of traffic observed on the network, and then groups the traffic according to the characteristics using the traffic vector and distinguishes the groups well. After discriminating, it is a method for detecting which group the incoming traffic belongs to.

본 발명인 피셔(Fisher) 선형 분류법을 이용한 비정상 트래픽 탐지 방법에 대하여 구체적으로 설명하기 전에, 본 발명에 적용되는 전체 시스템에 대하여 도 1을 참조하여 간략하게 설명한다.Before describing the abnormal traffic detection method using the Fischer linear classification method of the present invention in detail, the entire system applied to the present invention will be briefly described with reference to FIG.

본 발명인 피셔 선형 분류법을 이용한 비정상 트래픽 탐지 방법을 구현하기 위하여 도 1에 도시된 시스템이 구비된다. 즉, 트래픽 모니터링 수단(10), 분석 수단(20), 탐지 수단(30) 및 제어 수단(40)으로 구성된 시스템을 통하여 본 발명이 구현된다.In order to implement the abnormal traffic detection method using the Fisher linear classification method of the present invention, the system shown in FIG. That is, the present invention is implemented through a system composed of traffic monitoring means 10, analysis means 20, detection means 30 and control means 40.

상기 트래픽 모니터링 수단(10)은 크게 트래픽 캡쳐부(15), 트래픽 샘플링부(17) 및 트래픽 저장 및 모니터링부(19)를 포함하여 구성되고, 상기 분석 수단(20)은 저장부(21), 트래픽 분류부(23) 및 수학적 분석부(25)를 포함하여 구성된다.The traffic monitoring means 10 includes a traffic capture unit 15, a traffic sampling unit 17, and a traffic storage and monitoring unit 19. The analysis unit 20 includes a storage unit 21, And a traffic classifier 23 and a mathematical analyzer 25.

트래픽 로(raw) 데이터는 네트워크(11) 상에 연결되는 라우터(13)를 통하여 상기 트래픽 캡쳐부(15)에 의하여 수집될 수 있다. 상기 트래픽 로(raw) 데이터는 Netflow v5의 형식으로 구성한다. Netflow는 기본적으로 단일 방향의 특성을 가지며 출발지 및 도착지 주소와 포트번호, 프로토콜 등 5개의 튜플(tuple)로 구성된다. 그 외에도 TOS 및 인터페이스 정보, AS 정보를 포함한다.Traffic raw data may be collected by the traffic capture unit 15 through a router 13 connected on the network 11. The traffic raw data is configured in the format of Netflow v5. Netflow is basically one-way and consists of five tuples: source and destination addresses, port numbers, and protocols. In addition, it includes TOS and interface information and AS information.

상기 Netflow는 이러한 정보를 규칙으로 적용하여 다양한 프로토콜 플로우를 생성하고 구분할 수 있다. 이는 처리 속도가 빠르고 트래픽의 흐름과 동시에 라우터에서 생성되어 네트워크 관리자가 그 정보를 즉시 받아볼 수 있다는 장점이 있다.The Netflow can generate and distinguish various protocol flows by applying this information as a rule. This has the advantage of fast processing speed and the flow of traffic generated by the router so that the network administrator can receive the information immediately.

상기 트래픽 로(raw) 데이터는 트래픽 샘플링부(17)에 의하여 소정의 샘플링 과정을 거쳐 트래픽 저장 및 모니터링부(19)에 의하여 저장되고 모니터링된다.The traffic raw data is stored and monitored by the traffic storage and monitoring unit 19 through a predetermined sampling process by the traffic sampling unit 17.

상기와 같이 트래픽 저장 및 모니터링부(19)에 저장된 트래픽들은 분석 수단(20)의 저장부(21)로 소정의 가공을 거쳐 저장된다. 즉, 상기 저장부(21)는 상기 트래픽 저장 및 모니터링부(19)에 저장된 트래픽을 특정 주기마다 가공하여 저장한다.The traffic stored in the traffic storage and monitoring unit 19 as described above is stored in the storage unit 21 of the analysis means 20 through a predetermined process. That is, the storage unit 21 processes and stores the traffic stored in the traffic storage and monitoring unit 19 at specific cycles.

물론 이와 같은 가공 과정은 제어수단(40)의 제어에 따라 진행되고, 가공된 트래픽은 특정 주기마다 엑셀 파일 형식으로 상기 저장부(21)에 저장된다. 그러면, 상기 저장부(21)에 수집되어 저장된 트래픽의 특성에 따라 트래픽 벡터의 성분을 결정하고 각 트래픽 벡터의 공통된 특성에 따라 트래픽 그룹을 결정하게 된다. 본 발명에서는 트래픽 그룹의 기준을 결정하기 위하여 MS-SQL 쿼리를 이용하여 상기 분류부(23)가 두개의 트래픽 그룹(정상 그룹 및 비정상 그룹)을 만든다.Of course, such a process is performed under the control of the control means 40, the processed traffic is stored in the storage unit 21 in the form of an Excel file every specific period. Then, the components of the traffic vector are determined according to the characteristics of the traffic collected and stored in the storage unit 21, and the traffic group is determined according to the common characteristics of each traffic vector. In the present invention, the classifier 23 creates two traffic groups (a normal group and an abnormal group) using an MS-SQL query to determine the criteria of the traffic group.

그리고, 상기 수학적 분석부(25)는 상기 그룹의 정확한 기준을 만들기 위하여 카이스퀘어 분포를 적용하고, 피셔(Fisher) 선형 분류법을 통해 하이퍼플레인(hyperplane)을 형성한다.In addition, the mathematical analysis unit 25 applies the chi-square distribution in order to make an accurate criterion of the group, and forms a hyperplane through Fisher linear classification.

상기 하이퍼플레인이 형성된 후에는, 정상 또는 비정상으로 분류될 수 있는 트래픽들을 유입한다. 그러면, 탐지수단(30)이 상기 하이퍼플레인을 이용하여 유입되는 트래픽이 정상 그룹에 포함되는지 또는 비정상 그룹에 포함되는지 판단하고, 비정상 트래픽들을 탐지한다. After the hyperplane is formed, traffic flows that can be classified as normal or abnormal. Then, the detection means 30 determines whether the incoming traffic is included in the normal group or the abnormal group by using the hyperplane, and detects the abnormal traffic.

한편, 트래픽의 추이는 지속적으로 변화기 때문에, 새롭게 발생하는 트래픽의 추이를 반영하기 위해서 하이퍼플레인을 주기적으로 상기 제어수단(40)의 제어에 따라 상기 수학적 분석부(25)가 업데이트한다. 즉, 업데이트 주기를 P라고 했을 때, 가장 오래된 P 기간의 트래픽은 P기간 동안 새롭게 들어온 트래픽으로 교체되어, 새로운 하이퍼플레인을 형성시킨다.On the other hand, since the traffic is constantly changing, the mathematical analysis unit 25 updates the hyperplane periodically under the control of the control means 40 to reflect the newly generated traffic. In other words, when the update period is P, the traffic of the oldest P period is replaced with new traffic during the P period, thereby forming a new hyperplane.

상기와 같은 시스템을 통하여 본 발명인 피셔(Fisher) 선형 분류법을 이용한 비정상 트래픽 감지 방법이 구현된다.Through such a system, the abnormal traffic detection method using the Fisher linear classification method of the present invention is implemented.

다음은 본 발명에서 중요하게 적용되는 피셔 선형 분류법과 카이스퀘어(chi-square) 분포에 관하여 설명한다.The following describes Fischer linear classification and chi-square distribution that are important in the present invention.

피셔 선형 분류법은 그룹 분류가 가능한 선형 함수로서 다차원 데이터를 하이퍼플레인(hyperplane)의 법선 벡터에 사영시켜 그룹을 분류하는 기법이다([A. Shashua, "On the Relationship Between the Support Vector Machine for Classification and Sparsified Fisher's Linear Discriminant," Neural Processing Letters, vol. 9, pp. 129-139, April 1999.], [R. Johnson, and D.Wichern, Applied Multivariate Statistical Analysis, 6th ed., Prentice-Hall, 2007, pp. 576-593, 623-633.]).Fischer linear classification is a linear function that can classify groups and is a technique for classifying groups by projecting multidimensional data into the normal vectors of the hyperplane ([A. Shashua, "On the Relationship Between the Support Vector Machine for Classification and Sparsified"). Fisher's Linear Discriminant, "Neural Processing Letters, vol. 9, pp. 129-139, April 1999.], R. Johnson, and D. Wichern, Applied Multivariate Statistical Analysis, 6th ed., Prentice-Hall, 2007, pp 576-593, 623-633.).

를 n개의 성분으로 이루어진 벡터 로 구성된 데이터 집합이라 하자. 이 데이터 집합은 m 종류의 데이터로 그룹으로 분류된다고 가정하고, j 그룹의 데이터 집합을 Y_j로 표기한다.

Let be a data set consisting of a vector of n components. This data set is assumed to be classified into groups of m kinds of data, and the data set of the j group is denoted by Y _j .

즉, Y = {Y_j}^m _j ₌ ₁는

를 만족하는 X의 디스조인트(disjoint) 서브셋(subset)이다. 여기서 ｜Y_j｜는 Y_j 집합의 원소의 개수를 나타낸다. u와 uj는 각각 X와 Yj의 평균이라 하자. 여기서 u=(u₁,....u_n)^t와 u_j=(u_j1,....u_jn)^t의 성분은 각각

와

로 주어진다. 여기서 t는 행렬의 전치(transpose)를 나타낸 것이다.That is, Y = {Y _j } ^m _j ₌ ₁

Is a disjoint subset of X that satisfies. Where | Y _j | represents the number of elements of the Y _j set. Let u and uj be the mean of X and Yj, respectively. Where u = (u ₁ , .... u _n ) ^t and u _j = (u _j1 , .... u _jn ) ^t are the components of

Wow

Is given by Where t is the transpose of the matrix.

이제 법선 벡터(direction vector)를 소개하고 데이터 그룹 내의 산란행렬(scatter matrix)과 그룹 간의 산란행렬을 정의한다. 본 발명에서는 각 그룹의 평균 u_j를 어떤 법선 벡터 w에 사영시켰을 때 각 그룹이 분리가 잘 될 수 있도록 최대의 분산을 갖게 하는 하이퍼플레인(hyperplane)

(여기서

이고, d는 원점으로부터 하이퍼플레인까지의 거리)과 수직인 법선 벡터 w를 찾는 것이 목적이다.We now introduce a direction vector and define a scatter matrix in the data group and a scatter matrix between the groups. In the present invention, when the average u _j of each group is projected to a normal vector w, a hyperplane that has a maximum variance so that each group can be separated well.

(here

And d is the distance from the origin to the hyperplane).

그룹 내의 산란 행렬과 그룹 간의 산란 행렬을 각각

와

_inter로 나타낸다. 이 행렬들을 정의하기 위하여

를 집합 j의 산란 행렬로 다음과 같이 정의한 다.Each of the scattering matrices within the group and

Wow

_Represented by _inter . To define these matrices

Is defined as the scattering matrix of set j as

수식 (1)

Formula (1)

이를 이용하여

와

_inter는 다음과 같이 정의한다.Using this

Wow

_inter is defined as follows.

,

_inter

,

_inter

그러므로 각 그룹의 사영된 중심으로부터 그 그룹의 사영된 원소들 간의 정규화된 분산의 합은 다음과 같이 나타난다.Therefore, the sum of normalized variances between the projected elements of each group from the projected center of each group is expressed as

,

수식 (2)

,

Formula (2)

여기서 ∥w∥는 유클리드 놈(Euclidean norm)을 나타내고,

을 계산하면

이 된다.Where wh stands at the Euclidean norm and

If you calculate

Becomes

그룹을 잘 분리하기 위하여 그룹 내의 데이터는 최대한 뭉쳐있어야 하고, 그룹간의 데이터는 최대한 멀리 떨어져 있어야 한다. 즉, 수식 (2)의 값이 크다는 것 은 그룹간의 구별이 명확하다는 것을 의미한다. 그러므로 수식 (1)과 수식 (2)를 이용하여 최적화 정식을 통한 전형적인 피셔 표준(classical Fisher criterion) 함수를 다음과 같이 정의할 수 있다.In order to separate groups well, the data within the group should be as much as possible and the data between groups should be as far apart as possible. In other words, the large value of Equation (2) means that the distinction between groups is clear. Therefore, using the equations (1) and (2), the classical Fisher criterion function through the optimization formulation can be defined as follows.

여기서

이다.here

to be.

최적화 정식 (OPT)의 해는 분모를 최소화하고 분자를 최대화하여 얻을 수 있으므로, 각 그룹 내의 트래픽들은 동일한 특성들로 구성되어 있고 그룹간의 트래픽들은 서로 상이한 특성을 나타낼 수 있어야 한다. 최적화 정식은 다음과 같은 고유치 문제(eigenvalue problem)와 동치이다. Since the solution of the optimization formula (OPT) can be obtained by minimizing the denominator and maximizing the numerator, the traffic in each group should be composed of the same characteristics, and the traffic between groups should be able to exhibit different characteristics. The optimization formulation is equivalent to the eigenvalue problem:

수식 (3)

Formula (3)

또한 수식 (3)의 최대 고유값에 대응하는 고유벡터가 구하는 법선 벡터 w이다([A. Shashua, "On the Relationship Between the Support Vector Machine for Classification and Sparsified Fisher's Linear Discriminant," Neural Processing Letters, vol. 9, pp. 129-139, April 1999.]).In addition, the eigenvector corresponding to the maximum eigenvalue of Equation (3) is the normal vector w ([A. Shashua, "On the Relationship Between the Support Vector Machine for Classification and Sparsified Fisher's Linear Discriminant," Neural Processing Letters, vol. 9, pp. 129-139, April 1999.].

한편, 후술하겠지만, 본 발명에서 수집된 매일의 트래픽의 양은 크게 정규분포(nomal distribution)를 따름을 관찰할 수 있었으므로, 이에 따라 30일 이상 수집한 트래픽 데이터는 다음 정리에 의하여 카이스퀘어 분포를 따른다고 가정해도 크게 문제가 없다. 다음 정리는 표준정규분포와 카이스퀘어 분포와의 관계를 나타낸다.On the other hand, as will be described later, since the amount of daily traffic collected in the present invention was largely observed to follow a normal distribution, traffic data collected for 30 days or more according to the following the square distribution according to the following theorem Assuming that there is no big problem. The following theorem shows the relationship between the standard normal distribution and the chi-square distribution.

정리 1 : 확률변수 Z1,.....Zk가 표준정규분포 N(0,1)을 따르고 서로 독립이면 Z² ₁+Z² ₂+.....+Z² _k는 자유도(degrees of freedom) k인 카이스퀘어 분포를 따른다. 이것은 Z² ₁+Z² ₂+....+Z² _k ~ χ²(k)로 나타낸다([Z. Zhang and H. Shen "Online Training of SVMs for Real-time Intrusion Detection," in Proc. AINA 2004, pp. 568-573, March 2004.], [T. Hamada, K. Chujo, T. Chujo, and X. Yang, "Peer-to-peer traffic in metro networks: analysis, modeling, and policies," in Proc. IEEE/IETF NOMS 2004, pp. 425-438, April 2004.])Theorem 1: If the random variables Z1, ..... Zk follow the standard normal distribution N (0,1) and are independent of each other, then Z ² ₁ + Z ² ₂ + ..... + Z ² _k is degrees of freedom of freedom) follows a square distribution of k. This is represented by Z ² ₁ + Z ² ₂ + .... + Z ² _k to χ ² (k) (Z. Zhang and H. Shen "Online Training of SVMs for Real-time Intrusion Detection," in Proc. AINA 2004, pp. 568-573, March 2004.], T. Hamada, K. Chujo, T. Chujo, and X. Yang, "Peer-to-peer traffic in metro networks: analysis, modeling, and policies, "in Proc. IEEE / IETF NOMS 2004, pp. 425-438, April 2004.])

이상에서 설명한 피셔 선형 분류법 및 카이스퀘어 분포와 도 1을 참조하여 설명한 시스템에 기반하여 비정상 트래픽을 탐지하는 방법에 대하여 첨부된 도 2를 참조하여 설명한다.The method for detecting abnormal traffic based on the Fisher linear classification method and the chisquare distribution described above and the system described with reference to FIG. 1 will be described with reference to FIG. 2.

비정상 트래픽을 탐지하기 위해서, 본 발명은 유입되는 트래픽이 정상 트래픽인지, 비정상 트래픽인지를 용이하게 판단하기 위한 기준면이 되는 하이퍼플레인(hyperplane)을 카이스퀘어 분포와 피셔 선형 분류법을 이용하여 형성한다.In order to detect abnormal traffic, the present invention forms a hyperplane, which is a reference plane for easily determining whether incoming traffic is normal traffic or abnormal traffic, by using the chisquare distribution and Fisher linear classification.

상기와 같이 하이퍼플레인을 형성하기 위하여, 로(raw) 데이터로 사용되는 트래픽들을 라우터를 통하여 수집한다(S10). 라우터로부터 로(raw) 데이터로 사용되는 트래픽들은 특정 수집 기간 동안 수집된다. 후술하겠지만, 본 발명의 실시예에서는 30일 동안 트래픽들을 수집한다.In order to form a hyperplane as described above, traffic used as raw data is collected through a router (S10). Traffic used as raw data from the router is collected for a specific collection period. As will be described later, embodiments of the present invention collect traffic for 30 days.

상기 수집되는 트래픽들은 소정의 특정 주기로 수집된다. 예를 들어서, 5분마다 한번씩 트래픽을 수집하여 저장한다. 따라서, 상기 수집되는 트래픽들은 특정 수집 기간 동안 특정 주기로 수집된다. 예를 들어, 30일 동안 5분마다 트래픽들을 수집하여 저장한다.The collected traffic is collected at some specific period. For example, traffic is collected and stored every five minutes. Thus, the collected traffic is collected at a specific period for a specific collection period. For example, traffic is collected and stored every 5 minutes for 30 days.

상기 수집된 트래픽들을 이용하여 특정 정보를 갖는 트래픽 벡터를 생성한다(S20). 즉, 트래픽을 수집한 후, 수집된 각각의 트래픽의 특성을 나타내는 복수개의 성분으로 구성된 트래픽 벡터들을 생성한다. A traffic vector having specific information is generated using the collected traffic (S20). That is, after collecting the traffic, traffic vectors consisting of a plurality of components representing the characteristics of each collected traffic are generated.

상기 각 트래픽 벡터의 성분은 대역폭(bandwidth)의 용량, 플로우(flow)의 용량 및 패킷(packet)의 용량을 포함한다. 즉, 각각의 트래픽 벡터는 해당 트래픽에 대한 정보인 대역폭(bandwidth)의 용량, 플로우(flow)의 용량 및 패킷(packet)의 용량을 나타낸다.The components of each traffic vector include a capacity of bandwidth, a capacity of flow and a capacity of packet. That is, each traffic vector represents a bandwidth capacity, a flow capacity, and a packet capacity, which are information on the corresponding traffic.

상기와 같이 트래픽 벡터들이 생성되면, 상기 트래픽 벡터들을 데이터로 하여, 트래픽 그룹을 형성하고 하이퍼플레인을 형성한다(S30). 즉, 상기 트래픽 벡터들을 데이터로 하여, 피셔 선형 분류법을 이용하여 정상 그룹과 비정상 그룹으로 분류할 수 있는 하이퍼플레인을 형성한다.When the traffic vectors are generated as described above, the traffic vectors are used as data to form a traffic group and form a hyperplane (S30). That is, using the traffic vectors as data, a hyperplane that can be classified into a normal group and an abnormal group using a Fisher linear classification method is formed.

본 발명에 따른, 상기 각 트래픽 벡터들의 성분인 대역폭(bandwidth)의 용 량, 플로우(flow)의 용량 및 패킷(packet)의 용량의 분포는 카이스퀘어 분포를 따른다.According to the present invention, the distribution of the capacity of the bandwidth, the capacity of the flow and the capacity of the packet, which is a component of each of the traffic vectors, follows the square distribution.

즉, 상기 트래픽 벡터의 각 성분에 대한 상기 수집 기간 동안(예를 들어, 30일 동안)의 분포는 카이스퀘어 분포를 따른다. 여기서, 상기 트래픽 벡터가 정상 트래픽에 해당하는가 또는 비정상 트래픽에 해당하는가를 결정짓는 기준을 정한다.That is, the distribution during each collection period (eg, for 30 days) for each component of the traffic vector follows a chi square distribution. Here, a criterion for determining whether the traffic vector corresponds to normal traffic or abnormal traffic is determined.

본 발명에서는 각 성분 별로 용량이 상대적으로 매우 큰 범위를 비정상 범위로 간주한다. 구체적으로, 수집 기간 동안의 카이스퀘어 분포에서 대역폭의 용량이 상위 특정 범위에 속하는 경우에는 비정상 트래픽에 해당하고, 수집 기간 동안의 카이스퀘어 분포에서 플로우의 용량이 상위 특정 범위에 속하는 경우에는 비정상 트래픽에 해당하고, 수집 기간 동안의 카이스퀘어 분포에서 패킷의 용량이 상위 특정 범위에 속하는 경우에는 비정상 트래픽에 해당하는 것으로 결정한다.In the present invention, a range with a relatively large capacity for each component is regarded as an abnormal range. Specifically, if the capacity of the bandwidth in the chi-square distribution during the collection period falls within the upper specific range, it corresponds to abnormal traffic, and if the capacity of the flow in the chi-square distribution during the collection period belongs to the upper specific range, If the capacity of the packet falls within the upper specific range in the chi-square distribution during the collection period, it is determined to correspond to abnormal traffic.

즉, 트래픽 벡터의 각 성분의 용량이 하나라도 상위 분포 범위에 해당하는 경우에는 비정상 트래픽으로 분류하고, 모든 성분의 용량이 상위 분포 범위에 해당하지 않는 경우에는 정상 트래픽으로 분류한다.That is, if any capacity of each component of the traffic vector falls within the upper distribution range, it is classified as abnormal traffic. If the capacity of all components does not fall into the upper distribution range, it is classified as normal traffic.

본 발명에서는 상기 상위 분포 범위를 4% ~ 6% 사이로 한다. 실험적으로, 상기 범위 내를 기준으로 하이퍼플레인을 형성한 후, 트래픽의 정상/비정상을 탐지한 경우에 탐지 정확도가 높았기 때문이다.In the present invention, the upper distribution range is between 4% and 6%. Experimentally, since the hyperplane is formed based on the above range, the detection accuracy is high when detecting normal / abnormal traffic.

상기 정상 트래픽으로 분류되는 트래픽 군들은 정상 그룹으로 분류되고, 비정상 트래픽으로 분류되는 트래픽 군들은 비정상 그룹으로 분류된다. 구체적으로 정리하면, 트래픽 벡터의 각 성분의 용량이 하나라도 상위 4% ~ 6%에 해당하는 경 우에는 비정상 그룹으로 분류하고, 모든 성분의 용량이 상위 4% ~ 6%에 해당하지 않는 경우에는 정상 그룹으로 분류한다.Traffic groups classified as normal traffics are classified into normal groups, and traffic groups classified as abnormal traffic are classified into abnormal groups. Specifically, if any of the capacity of each component of the traffic vector corresponds to the top 4% to 6%, it is classified as an abnormal group, and if the capacity of all the components does not correspond to the top 4% to 6% Classify as a normal group.

상기와 같은 정상 그룹 또는 비정상 그룹의 기준면은 피셔 선형 분류법에 의해 형성되는 하이퍼플레인이다. 상기 하이퍼플레인은 상술한 피셔 선형 분류법을 이용하여 법선 벡터 w구하고, 이 법선 벡터에 수직이면서 각 성분에 대한 전체 평균값에 의하여 공간에 표시되는 평균점을 지나는 평면을 구함으로써 형성할 수 있다.The reference plane of the above normal or abnormal group is a hyperplane formed by Fisher linear classification. The hyperplane can be formed by obtaining a normal vector w using the aforementioned Fischer linear classification method, and finding a plane perpendicular to the normal vector and passing through an average point indicated in space by the overall average value for each component.

상기와 같이, 하이퍼플레인이 형성되면, 정상 그룹 영역과 비정상 그룹 영역이 구분되어진다. 즉, 하이퍼플레인의 면을 포함한 아래 영역은 정상 그룹 영역이고, 하이퍼플레인의 위 영역은 비정상 그룹의 영역이다.As described above, when the hyperplane is formed, the normal group region and the abnormal group region are divided. That is, the lower region including the plane of the hyperplane is the normal group region, and the upper region of the hyperplane is the region of the abnormal group.

지금까지 과정에 의하여, 유입되는 트래픽에 대하여 정상 또는 비정상 트래픽으로 분류할 수 있는 기준면(하이퍼플레인)을 형성하였다. 그러면, 이후부터는 실제 라우터를 통하여 유입되는 트래픽이 정상 트래픽인지, 비정상 트래픽인지 판단할 수 있다.Thus far, the reference planes (hyperplanes) that can be classified as normal or abnormal traffic have been formed. Then, from now on, it is possible to determine whether the traffic flowing through the actual router is normal traffic or abnormal traffic.

따라서, 트래픽을 라우터를 통하여 유입받는다(S40). 그런 다음, 유입되는 트래픽에 대하여 세개의 성분(대역폭(bandwidth)의 용량, 플로우(flow)의 용량 및 패킷(packet)의 용량)을 나타내는 트래픽 벡터를 생성하고, 이 세 개의 성분에 의하여 표시되는 위치가 하이퍼플레인의 위 영역에 위치하는가 판단한다(S50). 즉, 유입되는 트래픽이 하이퍼플레인의 위 영역에 위치하는가 판단한다. 물론, 하이퍼플레인의 면을 포함한 아래 영역에 위치하는가로 판단할 수도 있다.Therefore, the traffic is introduced through the router (S40). Then, for the incoming traffic, a traffic vector representing three components (a bandwidth capacity, a flow capacity, and a packet capacity) is generated, and the positions indicated by these three components are generated. It is determined whether is located in the upper region of the hyperplane (S50). That is, it is determined whether the incoming traffic is located in the upper region of the hyperplane. Of course, it can also be determined whether it is located in the lower region including the plane of the hyperplane.

상기 판단 결과, 유입되는 트래픽이 하이퍼플레인의 면을 포함한 아래 영역에 위치하는 경우에는 정상 그룹(트래픽)으로 판단하고(S60), 하이퍼플레인의 위 영역에 위치하는 경우에는 비정상 그룹으로 판단한다(S70). As a result of the determination, when the incoming traffic is located in the lower region including the plane of the hyperplane, it is determined as a normal group (traffic) (S60), and when it is located in the upper region of the hyperplane, it is determined as an abnormal group (S70). ).

이와 같은 방법을 통하여, 유입되는 트래픽이 정상 트래픽인지, 비정상 트래픽인지를 용이하게 분류할 수 있다.Through this method, it is easy to classify whether the incoming traffic is normal traffic or abnormal traffic.

한편, 트래픽들의 추이는 지속적으로 변화기 때문에, 이를 반영할 필요가 있다. 따라서, 본 발명에서는 유입되는 트래픽의 정상/비정상 여부를 결정하기 위한 기준면인 하이퍼플레인을 업데이트하는 과정을 거친다.On the other hand, since the trend of traffic is constantly changing, it is necessary to reflect this. Therefore, in the present invention, the process of updating the hyperplane which is a reference plane for determining whether the incoming traffic is normal or abnormal.

따라서, 업데이트 주기를 사전에 설정해 놓고, 시간이 지남에 따라 증가되는 업데이트 값이 상기 업데이트 주기에 도달하면, 상기 하이퍼플레인을 업데이트하는 과정을 수행한다. 예를 들어, 업데이트 주기가 1시간이고, 시스템 내부에 구비되는 타이머가 1시간에 도달하면, 하이퍼플레인의 업데이트를 수행하고, 상기 타이머는 리셋된다.Therefore, the update period is set in advance, and when the update value that increases over time reaches the update period, the process of updating the hyperplane is performed. For example, when the update period is 1 hour and the timer provided in the system reaches 1 hour, the update of the hyperplane is performed and the timer is reset.

상기와 같이, 업데이트 주기에 도달했는지, 판단한 결과, 업데이트 주기에 도달한 경우에는, 기존의 트래픽 벡터들은 버리고, 새롭게 유입된 트래픽 벡터들을 포함시켜 새로운 하이퍼플레인을 형성한다. 구체적으로, 업데이트 주기에 도달한 경우에는 업데이트 주기에 해당하는 기간 동안 수집된 가장 오래된 트래픽 벡터들을 버리고, 업데이트 주기 동안 새롭게 생성된 트래픽 벡터들을 포함시켜 새로운 하이퍼플레인을 형성한다(S80~S90).As described above, when the update period has been reached and determined, when the update period has been reached, the existing traffic vectors are discarded and newly introduced traffic vectors are included to form a new hyperplane. Specifically, when the update period is reached, the oldest traffic vectors collected for the period corresponding to the update period are discarded, and a new hyperplane is formed by including the newly generated traffic vectors during the update period (S80 to S90).

예를 들어, 업데이트 주기가 10분이라고 가정할 때, 정상/비정상 트래픽인지 를 판단받기 위하여 10분 동안 유입된 트래픽 벡터들은 새로운 하이퍼플레인을 형성하기 위하여 사용되어지고, 사전에 수집되어 사용되어진 트래픽 벡터들 중에 10분 동안 수집된 가장 오래된 트래픽 벡터들은 버려진다.For example, assuming that the update period is 10 minutes, traffic vectors introduced for 10 minutes to determine whether it is normal / abnormal traffic are used to form a new hyperplane, and are previously collected and used. The oldest traffic vectors collected for 10 minutes are discarded.

이와 같은 수행 과정을 통하여, 하이퍼플레인은 주기적으로 업데이트되기 때문에, 가변적인 트래픽 추이에 용이하게 대응할 수 있다. 새로운 하이퍼플레인이 형성되면, 다시 새로운 트래픽을 유입하고, 정상/비정상 여부를 판단하는 절차(S40~S70))를 반복해서 수행한다.Through such an execution process, since the hyperplane is periodically updated, it can easily cope with the variable traffic trend. When a new hyperplane is formed, new traffic is introduced again, and the procedure (S40 to S70) of determining whether it is normal or abnormal is repeated.

이상에서 설명한 방법에 따라, 서브 네트워크 상의 트래픽에 대하여 비정상 트래픽을 용이하게 탐지할 수 있으며, 가변적인 트래픽 추이에 효과적으로 대응할 수 있다.According to the method described above, the abnormal traffic can be easily detected with respect to the traffic on the sub-network, and can effectively cope with the variable traffic trend.

이하에서는, 본 발명을 실제 적용한 구체적인 실시예 대하여 상세하게 설명한다.Hereinafter, specific examples to which the present invention is actually applied will be described in detail.

먼저, 본 발명에 적용되는 용어인 트래픽 벡터와 트래픽 그룹을 구체적으로 정의한다.First, a traffic vector and a traffic group, which are terms applied to the present invention, are specifically defined.

는 트래픽 정보를 성분으로 갖는 벡터인데, 이 벡터를 트래픽 벡터라 정의하고,

중에서 동일한 특성을 갖는 트래픽 벡터들의 집합을 트래픽 그룹이라 정의한다.

Is a vector containing traffic information as a component, and this vector is defined as a traffic vector.

A set of traffic vectors having the same characteristic among the traffic groups is defined.

트래픽 벡터는 각 트래픽 그룹의 특성을 나타낼 수 있는 측정 가능한 항목들을 그 성분으로 가지며 그 성분은 라우터로부터 관찰되는 정보인 출발지 IP, 목적 지 IP, 출발지 포트, 목적지 포트, 프로토콜, 플로우 등이 될 수 있다. 또한 트래픽 그룹은 이러한 동일한 특성을 갖는 트래픽 벡터들의 모음이라 할 수 있다.The traffic vector contains the measurable items that can represent the characteristics of each traffic group, and the components may be the source IP, the destination IP, the source port, the destination port, the protocol, and the flow of information observed from the router. . A traffic group may also be referred to as a collection of traffic vectors having these same characteristics.

트래픽 그룹은 정상 그룹, 비정상 그룹, 웜(worm) 그룹, 트래픽 혼잡 그룹 등 여러 그룹으로 분류할 수 있고, 특히 비정상 그룹의 경우 네트워크 오동작 그룹, flash crowd 비정상 그룹, 네트워크 남용 그룹 등 구체적으로 세분화될 수 있다([P. Barford and D. Plonkz, "Characteristics of Network Traffic Flow Anomalies," in Proc. ACM SIGCOMM 2001, pp. 69-73, August 2001.]).Traffic groups can be classified into normal groups, abnormal groups, worm groups, traffic congestion groups, etc. In particular, abnormal groups can be specifically divided into network malfunction groups, flash crowd abnormal groups, and network abuse groups. (P. Barford and D. Plonkz, "Characteristics of Network Traffic Flow Anomalies," in Proc. ACM SIGCOMM 2001, pp. 69-73, August 2001.).

본 실시예에서는 고려대학교의 백본 라우터를 통해 수집된 트래픽을 분석하였다. 트래픽은 2008년 5월 14일 자정을 시작으로 매 5분 단위로 2008년 6월 12일 23:55분까지 30일 동안 수집하였다. 하루 288개의 관찰 시점으로 얻은 트래픽 집합은 총 8640개로 구성된다.In this embodiment, the traffic collected through the backbone router of Korea University was analyzed. Traffic was collected every 30 minutes starting at midnight on May 14, 2008 until 23:55 on June 12, 2008. The total traffic set from 288 observations per day is 8640.

트래픽 벡터는 세 개의 성분을 가지며 그 성분은 대역폭의 용량(b), 플로우의 용량(f), 패킷의 용량(p)으로 정의하였다. 또한 그룹을 분리하기 위하여 정상 그룹과 비정상 그룹을 고려한다. 도 3은 30일 동안 수집한 트래픽과 그의 일부인 하루 트래픽의 대역폭(bandwidth)의 용량에 대한 경향을 보여준다.The traffic vector has three components, which are defined as the capacity of bandwidth (b), the capacity of flow (f), and the capacity of packet (p). We also consider normal and abnormal groups to separate groups. FIG. 3 shows a trend for the capacity of the bandwidth of the traffic collected over 30 days and part of the daily traffic.

상기 도 3을 통하여 주목해야할 점은 밤 11시부터 오전 6시까지는 매우 적은 양의 트래픽이 발생한다는 것인데, 이는 이 시간 동안 학교 도서관이 닫혀진 결과이다. 상기 도 3은 대역폭의 용량에 대해서만 보여주고 있으나, 플로우의 용량 및 패킷의 용량에 대한 경향도 상기 대역폭의 용량에 대한 경향과 유사한 패턴을 가진다.It should be noted from FIG. 3 that there is a very small amount of traffic from 11 pm to 6 am, which is the result of the school library being closed during this time. 3 shows only the capacity of the bandwidth, the trend for the capacity of the flow and the capacity of the packet also has a similar pattern to the trend for the capacity of the bandwidth.

수집된 트래픽 집합으로부터 두 개의 그룹을 분리하기 위한 기준을 만들기 위하여 수집한 트래픽의 분포를 관찰한다. 도 4는 일일 데이터의 대역폭 용량의 분포이고, 도 5는 30일간 수집한 데이터의 대역폭 양의 분포를 나타내고 있다. 도 4를 통하여 비록 연속적이지는 않지만 0 ~ 20Mbps를 제외한 일일 트래픽의 분포가 표준정규분포를 따름을 알 수 있다.Observe the distribution of the collected traffic to create a criterion for separating the two groups from the collected set of traffic. 4 is a distribution of bandwidth capacity of daily data, and FIG. 5 is a distribution of bandwidth amounts of data collected for 30 days. 4, although not continuous, it can be seen that the distribution of daily traffic except 0 to 20Mbps follows the standard normal distribution.

0 ~ 20Mbps를 제외한 이유는 밤 11시부터 오전 6시까지 매우 적은 양의 데이터가 포함되어 있어서, 이는 비정상 기준에 포함되지 않기 때문이다. 도 5에 나타난 바와 같이, 30일 동안 수집한 트래픽은 왼쪽으로 치우쳐 있고 오른쪽으로 긴 꼬리 모양을 갖는 카이스퀘어 분포를 따르고 있음을 알 수 있다.The reason for excluding 0 to 20 Mbps is that a very small amount of data from 11 pm to 6 am is not included in the abnormal criteria. As shown in FIG. 5, it can be seen that the traffic collected for 30 days follows a Kaisquare distribution having a long tail shape to the left and a right.

일일 데이터가 표준정규분포를 따르고 이를 제곱하여 더한 그래프가 카이스퀘어 분포를 따르고 있으므로, 30일 동안의 데이터는 자유도가 30인 카이스퀘어 분포를 따른다고 가정해도 무방함을 알 수 있다.Since the daily data follow the standard normal distribution and the squared sum of the graphs follow the Kai Square distribution, it can be assumed that the data for 30 days follow the Kai Square distribution with 30 degrees of freedom.

여기서, 약 5%의 데이터가 전체 트래픽의 비정상 범위에 속한다고 가정하고 그룹을 나누기 위한 기준을 만들었다. 카이스퀘어 분포의 표(Table C.5, [K. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 2nd ed., John Wiley and Sons, 2002, pp. 658-664.])를 참고하여

의 결과를 얻었다.Here, assuming that about 5% of the data is in the abnormal range of the total traffic, the criteria for grouping were created. See Table C.5, K. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 2nd ed., John Wiley and Sons, 2002, pp. 658-664. So

Result was obtained.

이는 437.730Mbps 이상의 대역폭(bandwidth) 용량을 가지는 트래픽은 비정상 범위에 속한다는 것을 의미한다. 이 기준에 의하면, 8640의 트래픽 중 8468개의 트 래픽이 정상에 속하고, 남은 172개의 트래픽이 비정상 그룹에 속함을 알 수 있다. 그리고, 같은 방법으로 플로우 용량 및 패킷의 용량에 대해서도 적용한 결과, 각각 8617개와 8263개가 정상 그룹에 속함을 알 수 있었다.This means that traffic having a bandwidth capacity of 437.730 Mbps or more belongs to an abnormal range. According to this criterion, it can be seen that 8468 traffics of the 8640 traffics belong to the normal, and the remaining 172 traffics belong to the abnormal group. As a result of applying the flow capacity and the packet capacity in the same manner, it was found that 8617 and 8263 belonged to the normal group, respectively.

또한 MS-SQL의 쿼리에 OR 조건을 적용하여 3개의 트래픽 벡터의 성분 중 하나라도 비정상이라면 비정상 그룹에 속한다는 기준을 정하였다. 이 기준에 따라 정상 그룹과 비정상 그룹이 각각 ｜Y_n｜= 8241와 ｜Y_an｜= 399로 나타났다. In addition, the OR condition is applied to the MS-SQL query to define the criterion that if any one of the three traffic vectors is abnormal, it belongs to the abnormal group. According to this criterion, the normal and abnormal groups were | Y _n | = 8241 and | Y _an | = 399, respectively.

여기서 ｜Y_n｜는 정상 그룹에 속하는 트래픽의 개수이고, ｜Y_an｜는 비정상 그룹에 속하는 트래픽의 개수이다.Where Y _n | is the number of traffic belonging to the normal group, and Y _an | is the number of traffic belonging to the abnormal group.

다음은 상기와 같은 결과를 이용하고, 피셔 선형 분류법을 이용하여 하이퍼플레인(hyperplane)을 얻는 과정을 설명하면 다음과 같다.Next, using the results as described above, the process of obtaining a hyperplane using the Fisher linear classification method is as follows.

먼저 법선 벡터 w를 구하기 위해 각 그룹의 평균과 전체 평균을 모두 구하고,

의 값을 각각 계산한다. 트래픽에 관한 각각의 평균은 표 1에 나타나 있다.First, find both the mean and the total mean of each group to find the normal vector w,

Calculate the value of. Each average for the traffic is shown in Table 1.

- 표 1 -Table 1

w를 찾기 위해 수식 (1)에 주어진 고유값 문제의 특성 방정식

을 이용한다. 이 특성 방정식으로부터 3개의 다른 고유값과 그에 대응하는 고유벡터를 구하면 다음과 같다.Characteristic equation of the eigenvalue problem given by equation (1) to find w

Use Three different eigenvalues and corresponding eigenvectors are obtained from this characteristic equation:

λ₁ = 0.1882e⁰⁰², λ₂ = -0.1083e^-14, λ3 = 0.7932^-008 λ ₁ = 0.1882e ⁰⁰² , λ ₂ = -0.1083e ^-14 , λ3 = 0.7932 ^-008

w₁ = (-28.6341, -0.9995e^-002, -0.4927)^t w ₁ = (-28.6341, -0.9995e ^-002 , -0.4927) ^t

w₂ = (0.2597e^-003, -0.3358e^-002, 9.8352)^t w ₂ = (0.2597e ^-003 , -0.3358e ^-002 , 9.8352) ^t

w₃ = (9995.6493, -0.1654e^-002, -0.1577e^-001)^t w ₃ = (9995.6493, -0.1654e ^-002 , -0.1577e ^-001 ) ^t

이렇게 구한 고유값들 중 가장 큰 값에 대응하는 고유벡터가 법선벡터 w가 된다. 따라서, λ₁ = 0.1882e⁰⁰²에 대응하는 w₁ = (-28.6341, -0.9995e^-002, -0.4927)^t이 법선벡터가 된다.The eigenvector corresponding to the largest value among the eigenvalues thus obtained becomes the normal vector w. Therefore, w ₁ = (-28.6341, -0.9995e ^-002 , -0.4927) ^t corresponding to lambda ₁ = 0.1882e ⁰⁰² becomes a normal vector.

이제 이렇게 구한 법선벡터와 수직이고 평균점을 지나는 평면을 구함으로써 찾고자 하는 하이퍼플레인을 정의할 수 있다. 이에 따라 구한 하이퍼플레인의 방정식은 다음과 같다.We can now define the hyperplane we are looking for by finding a plane that is perpendicular to the normal vector and passes through the mean point. The equation of the hyperplane thus obtained is as follows.

- 28.6341(x - 103,709,872.7) --28.6341 (x-103,709,872.7)-

0.9995e^-002(y - 5,247,040) - 0.4927(z - 18,336,008)=00.9995e ^-002 (y-5,247,040)-0.4927 (z-18,336,008) = 0

도 6은 각각의 트래픽과 w, 하이퍼플레인, 분리된 두 개의 그룹을 보여준다. 상기에서 구한 법선 벡터와 하이퍼플레인, 그리고 도면을 단순화시키기 위하여 정상 트래픽과 비정상 트래픽의 1/10만을 표현하였다.6 shows each traffic and w, hyperplane, two separate groups. In order to simplify the drawing, the normal vector, the hyperplane, and the normalized and abnormal traffic are represented by only one tenth of the normal and abnormal traffic.

이 하이퍼플레인을 새로 유입되는 트래픽에 적용함으로써 그 트래픽이 아래 영역인 정상 그룹에 속하는지 위 영역인 비정상 그룹에 속하는지 판단할 수 있다. 이 하이퍼플레인은 탐지하고자 하는 트래픽의 추세를 반영할 수 있도록 계속적으로 변하는 평균값과 하이퍼플레인을 통해 실시간 탐지가 가능하다.By applying this hyperplane to newly introduced traffic, it is possible to determine whether the traffic belongs to the normal group in the lower region or the abnormal group in the upper region. This hyperplane can be detected in real time with a constantly varying average value and hyperplane to reflect the trend of traffic to be detected.

한편, 이렇게 구한 하이퍼플레인의 신뢰성을 입증하기 위하여 특정 하루의 데이터인 2008년 7월 2일의 데이터를 이 하이퍼플레인에 적용하여 실험해 보았다. 이 날은 대학의 1학기 성적이 공시되어 학교 포털시스템의 트래픽이 매우 혼잡한 날이었고, 학교 전산처에서 관찰된 트래픽의 추이로 보아 비정상적으로 폭주한 트래픽이 많았던 날로 입증된다. On the other hand, in order to prove the reliability of the obtained hyperplane, we experimented by applying the data of July 2, 2008, which is the data of a specific day, to this hyperplane. This day was the day when the traffic of the school portal system was very crowded due to the university's first semester grades, and it was proved to be the day when the traffic was abnormally high due to the traffic observed in the school computer center.

도 7은 모니터링 서버로부터 수집한 비정상성을 보인 테이블의 일부를 나타낸 것이다. 상기 도 7을 통하여 비정상적인 트래픽의 종류와 발생 시점, 그리고 그 패턴 등을 알 수 있다. 선택된 이 특정 하루에는 비정상적으로 보이는 점프(jump) 트래픽과 급격히 폭주하는 모습의 프로파일(profile) 트래픽이 많이 관찰되었고, 이러한 비정상적 트래픽이 관찰되는 주기도 길고 빈번히 나타남을 알 수 있었다.Figure 7 shows a part of the table showing the abnormality collected from the monitoring server. 7, the type of traffic, the time of occurrence, and the pattern thereof can be known. During this particular day, there were a lot of jump traffic and profile traffic which seemed to be abnormally congested, and the frequency of the abnormal traffic was long and frequent.

또한 새로 들어오는 트래픽을 5분 단위로 업데이트하여 기존의 가장 오래된 5분의 트래픽을 삭제하고 새로운 5분의 트래픽을 삽입하여 업데이트된 하이퍼플레인을 만드는 동작을 수행하도록 한다.It also updates new incoming traffic every five minutes to delete the oldest five minutes of traffic and inserts new five minutes of traffic to create an updated hyperplane.

5분 단위로 계속해서 하이퍼플레인을 업데이트하는 이유는 기존 바이러스나 공격과 달리 매우 빠른 속도로 침해사고를 일으키는 슬램머 웜(slammer worm)과 같은 영향력 있는 공격을 차단하기 위해서이다([D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver, "Inside the Slammer worm," IEEE Security & Privacy Magazine 1(4), pp. 33-39, July/Aug. 2003.]).The reason why we update the hyperplane every five minutes is to block influential attacks, such as the slammer worm, which causes intrusions very quickly, unlike conventional viruses or attacks (D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver, "Inside the Slammer worm," IEEE Security & Privacy Magazine 1 (4), pp. 33-39, July / Aug. 2003. ]).

슬램머 웜은 출몰과 동시에 마이크로소프트 SQL 데이터베이스 소프트웨어의 약점을 이용해 급속히 확산되었다. 이는 네트워크를 통해 자기 복제를 하며 이동하여 매 8.5초마다 두 배로 확산, 불과 10분만에 취약한 호스트의 90%를 감염시켰다. 전파 속도가 최고조에 달한 출현 후 3분 정도 되는 시점에선 인터넷 네트워크를 통해 초당 5천500만회의 위치 검색 요청 신호를 보낸 것으로 나타났다. The slammer worm spreads rapidly as it emerged, taking advantage of Microsoft SQL database software. It traveled and replicated itself over the network, doubling every 8.5 seconds, infecting 90% of vulnerable hosts in just 10 minutes. At about three minutes after the peak of the airspeed, it was found that 55 million times per second location search request was sent through the Internet network.

도 8은 기존 2008년 5월 14일부터 5월 20일까지의 수집된 트래픽을 삭제하고 새로운 2008년 6월 13일부터 6월 19일까지의 일주일의 수집된 트래픽 집합으로 교체해서 구한 하이퍼플레인의 모습을 보여준다. 새로 형성된 하이퍼플레인의 경향도 약간의 기울기만 변했을 뿐 거의 유사한 모습을 보임을 알 수 있다.8 is a diagram of a hyperplane obtained by deleting existing traffic collected from May 14 to May 20, 2008 and replacing it with a new set of collected traffic for a week from June 13 to June 19, 2008. Show you The trend of the newly formed hyperplane also shows a similar appearance with only a slight change in slope.

5분 단위의 하이퍼플레인의 변화를 살펴보았을 때, 거의 유사한 모습을 보이므로 육안으로는 변화의 모습을 볼 수가 없어서 편의상 일주일 단위의 하이퍼플레인의 변화를 보여준다. 상기 도 8을 통하여 새로 형성된 하이퍼플레인의 경향도 약간의 기울기만 변했을뿐 거의 유사한 모습을 보임을 알 수 있다.When we look at the change of hyperplane in 5 minutes, it is almost similar, so we can't see the change with the naked eye. 8, it can be seen that the tendency of the newly formed hyperplane is also changed to only a slight inclination, which is almost similar.

다음에서는 본 발명의 성능을 평가한다. 여기서 성능 평가를 위하여 용어를 정의한다.Next, the performance of the present invention is evaluated. The term is defined here for performance evaluation.

실제로 정상 그룹에 속하면서 정상 그룹으로 옳게 판단된 경우를 true positive라 하고, 실제로 비정상 그룹에 속하면서 비정상 그룹으로 옳게 판단된 경우를 true negative라 하고, 실제로는 정상 그룹에 속하나 비정상 그룹에 속하는 것으로 잘못 판단된 경우를 false positive라 하며, 실제로는 비정상 그룹에 속하나 정상 그룹에 속하는 것으로 잘못 판단된 경우를 false negative라 한다.In fact, the case of belonging to the normal group and judged correctly as the normal group is called true positive, and the case of the case of belonging to the abnormal group and judged right to the abnormal group is referred to as true negative, and in fact, to belong to the normal group but belonged to the abnormal group. The case is called a false positive and the case in which it is actually determined to belong to an abnormal group but is incorrectly regarded as a normal group is called a false negative.

이제 얻어진 결과에 따라 탐지가 얼마만큼 정확한지를 알아보기 위해 다음과 같이 correct detection probability와 miss detection probability를 정의한다.Now, to determine how accurate the detection is based on the results obtained, we define the correct detection probability and miss detection probability as follows.

여기서 ｜FP｜, ｜FN｜, ｜TP｜, ｜TN｜은 각각 false positive, false negative, true positive, true negative를 나타낸다. 특정한 날을 기준으로 288개의 트래픽에 탐지과정을 적용하여 표 2와 같은 결과를 얻었다.Here, | FP |, | FN |, | TP |, and | TN | represent false positive, false negative, true positive, and true negative, respectively. By applying the detection process to 288 traffics on a specific day, the results are shown in Table 2.

- 표 2 -Table 2

비정상 트래픽의 정보를 제공하는 학교 전산처 서버를 통해 보았을 때, 본 발명에서 고려한 특정한 날인 7월 2일에는 108개의 정상 트래픽과 180개의 비정상 트래픽이 있었음을 알 수 있었다. 한편, 본 발명에서 구한 하이퍼플레인에 이 트래픽을 적용해 보았을 때, 103개의 정상 트래픽과 185개의 비정상 트래픽을 탐지해 냈다. When looking through the school computer server providing the information of abnormal traffic, it was found that there were 108 normal traffics and 180 abnormal traffics on July 2, which is a specific day considered in the present invention. On the other hand, when this traffic was applied to the hyperplane obtained in the present invention, 103 normal traffics and 185 abnormal traffics were detected.

상기 표 2는 탐지율과 오탐지율의 결과를 보여주고 있다. 표 2에서 보는 바와 같이, 3%의 오탐지율이 발생하였으며, 이 오탐지율은 플로우 양의 차이가 매우 적어 트래픽 그룹을 나눔에 있어 생긴 오차로 보인다.Table 2 shows the results of the detection rate and the false detection rate. As shown in Table 2, a false detection rate of 3% occurred, and this false detection rate was considered to be an error in dividing traffic groups due to a very small difference in flow volume.

본 발명에서 제안한 탐지 기법의 성능을 알아보기 위하여 유사한 환경에서 비정상 트래픽 탐지를 다룬 기술(1)([K. H. Ramah, H. Ayari, and F. Kamoun, "Traffic Anomaly Detection and Characterization in the Tunisian National University Network," in Proc. NETWORKING 2006, LNCS 3976, pp. 136-147, May 2006.])의 결과와 비교해 본다.In order to examine the performance of the detection scheme proposed by the present invention, a technique (1) (KH Ramah, H. Ayari, and F. Kamoun, "Traffic Anomaly Detection and Characterization in the Tunisian National University Network) , "in Proc. NETWORKING 2006, LNCS 3976, pp. 136-147, May 2006.].

실험 데이터가 완전히 일치하지는 않으나 상기 기술(1)에서는 Tunisian National University Network(TNUN)을 이용하여 비정상 트래픽을 탐지하였다. 이를 위하여 상기 기술(1)에서는 45일간 주기적으로 캠퍼스 네트워크의 데이터를 수집하고 Anomaly Detection System(ADS)을 개발하였다.Although the experimental data does not completely match, the technique (1) detected abnormal traffic using the Tunisian National University Network (TNUN). To this end, the technique (1) collected data from the campus network periodically for 45 days and developed Anomaly Detection System (ADS).

TNUN 중앙의 방화벽으로부터 MIB(Management Information Base)를 통해 네트워크 트래픽 trace들을 수집한다. 수집한 결과를 통하여 inter-anomaly 시간 분포와 anomaly duration 분포를 계산하여 5분 이내에 비정상 트래픽을 탐지하게 된다. 비록 같은 실험 데이터는 아닐지라도 TNUN의 트래픽도 본 발명에서 사용한 실험 데이터와 거의 유사한 실험 데이터를 사용하였다. Collects network traffic traces through the Management Information Base (MIB) from the central firewall of TNUN. The collected results are used to calculate inter-anomaly time distribution and anomaly duration distribution to detect abnormal traffic within 5 minutes. Although not the same experimental data, the traffic of TNUN used experimental data almost similar to the experimental data used in the present invention.

그 실험 결과, 본 발명에서 고려한 것과 동일한 탐지 상황 하에서 약 90%의 탐지율을 보임을 알 수 있다. 즉 이 결과와 비교했을 때, 본 발명에서의 기법이 약 7% 우수함을 알 수 있다. As a result of the experiment, it can be seen that the detection rate of about 90% under the same detection situation considered in the present invention. That is, when compared with this result, it can be seen that the technique in the present invention is about 7% superior.

이상에서 살펴본 바와 같이, 본 발명에서는 트래픽 벡터와 트래픽 그룹을 정의하고 피셔 선형분류법에 의해 얻은 최적화된 하이퍼플레인을 이용하여 새 트래픽 데이터가 생성되었을 때 그 데이터가 어떤 그룹에 속하는지를 결정하였다. 본 발명에서 제안한 방법은 기존의 방법에 비해 트래픽의 비정상 여부를 실시간으로 보다 간단하고 정확하게 탐지할 수 있었다. As described above, in the present invention, the traffic vector and the traffic group are defined, and when the new traffic data is generated using the optimized hyperplane obtained by the Fisher linear classification method, it is determined which group the data belongs to. The method proposed in the present invention was able to detect the traffic abnormality more simply and accurately in real time than the conventional method.

또한 트래픽 추이를 주기적으로 반영함으로서 새롭게 나타나는 이상적 현상을 탐지할 수 있었다.In addition, by periodically reflecting the traffic trends, new ideal phenomena could be detected.

도 1은 본 발명인 피셔 선형 분류법을 이용한 비정상 트래픽 탐지 방법을 구현하기 위한 시스템도이다.FIG. 1 is a system diagram for implementing an abnormal traffic detection method using the inventor's linear classification method.

도 2는 본 발명인 선형 분류법을 이용한 비정상 트래픽 탐지 방법의 순서도이다.2 is a flowchart of a method for detecting abnormal traffic using the linear classification method according to the present invention.

도 3은 본 발명에서 적용되는 대역폭 용량의 특성 그래프이다.3 is a characteristic graph of bandwidth capacity applied in the present invention.

도 4는 본 발명에 적용되는 대역폭 용량의 일일 데이터 분포도이다.Figure 4 is a daily data distribution of the bandwidth capacity applied to the present invention.

도 5는 본 발명에 적용되는 30일간 수집한 데이터의 대역폭 용량의 분포도이다.5 is a distribution diagram of bandwidth capacity of data collected for 30 days according to the present invention.

도 6은 본 발명에 적용되는 하이퍼플레인을 보여주는 그래프이다.6 is a graph showing a hyperplane applied to the present invention.

도 7은 본 발명의 실시예에 따른 모니터링 서버에 의해 관찰된 비정상 트래픽의 예시도이다.7 is an exemplary diagram of abnormal traffic observed by the monitoring server according to an embodiment of the present invention.

도 8은 기존 하이퍼플레인과 업데이트된 하이퍼플레인의 비교도이다.8 is a comparison of an existing hyperplane with an updated hyperplane.

Claims

After collecting traffic from a router at a specific period for a specific collection period, traffic vectors consisting of a plurality of components representing the characteristics of each collected traffic are generated, and the traffic vectors are used as data to obtain a normal using Fisher linear classification. A first step of forming a hyperplane that can be classified into a group and an abnormal group;

A second step of, after forming the hyperplane, determining that the incoming traffic is a normal group if the incoming traffic is located in the lower area including the plane of the hyperplane, and determining the abnormal group if the incoming traffic is located in the upper area of the hyperplane;

A third step of determining whether a preset update period has been reached after the determination;

As a result of the determination, when the update period is reached, the oldest traffic vectors collected during the period corresponding to the update period are discarded, new traffic planes are formed by including the newly generated traffic vectors during the update period, and then the second hyperplane is generated. The abnormal traffic detection method using the fischer linear classification method comprising the step of returning to the step.

The method according to claim 1,

The component of each traffic vector is an abnormal traffic detection method using the fischer linear classification method, characterized in that the capacity of the bandwidth (flow), the capacity of the flow (flow) and the capacity of the packet (packet).

The method according to claim 2,

The distribution during the collection period for each component of the traffic vector follows a chi-square distribution, and if at least one of each component corresponds to the top 4% to 6%, it is classified into an abnormal group. , If the capacity of all the components do not fall in the top 4% to 6% abnormal traffic detection method using the Fisher linear classification method characterized in that the classification to the normal group.

The method of claim 3,

The chisquare distribution is abnormal traffic detection method using the Fischer linear classification, characterized in that the collection period.

The method according to claim 1,

And the hyperplane is a plane perpendicular to a normal vector and passing through a total average point including the normal group and the abnormal group.