KR20150019280A

KR20150019280A - Apparatus and method for Internet traffic classification

Info

Publication number: KR20150019280A
Application number: KR20130095949A
Authority: KR
Inventors: 김기창; 황진수; 김진경
Original assignee: 인하대학교 산학협력단
Priority date: 2013-08-13
Filing date: 2013-08-13
Publication date: 2015-02-25
Also published as: KR101539649B1

Abstract

The purpose of the present invention is to provide an apparatus and method for classifying Internet traffic, which classify applications with a high classification difficulty level. The apparatus for classifying Internet traffic includes a connection information detecting unit configured to detect information on a network, an information calculating unit configured to calculate numerical values for determining groups based on the detected connection information, a group assigning unit configured to classify groups of packets that are the connection information based on the calculated numerical values, and an evaluation unit configured to measure a misclassification probability of the classified groups. The apparatus for classifying Internet traffic determines the degree of similarity between data on the network and discriminates similar data packets.

Description

[0001] Apparatus and method for Internet traffic classification [

본 발명은 네트워크 트래픽 데이터의 통계적 특성을 사용한 패킷 분류기법으로 더욱 상세하게는 마르코프 모델과 쿨백라이블러 정보를 이용한 인터넷 트래픽 분류 기술이다.
The present invention relates to a packet classification technique using statistical characteristics of network traffic data, and more particularly to an Internet traffic classification technique using a Markov model and a Kullback Leibler information.

전통적인 패킷 분류기법은 포트 혹은 페이로드 기반 기법이다. 포트기반 기법은 패킷에 적혀있는 포트를 기준으로 분류한다. 하지만 최근 P2P 어플리케이션을 위시한 많은 어플리케이션들이 동적 혹은 예측 불가능한 포트번호를 사용하게 되면서 이 기법의 효용성이 벽에 부딪치게 되었다. 한편 페이로드 기반 기법 혹은 심층 패킷 조사(Deep Packet Insprection: DPI) 기법은 패킷의 내용을 깊이 분석함으로써 어플리케이션의 패턴을 잡아낸다. 그러나 이 방법 역시 패킷 포맷의 잦은 변경, 패킷 암호화, 패킷 분석 비용의 증가 등의 문제점 때문에 한계에 부딪치고 있다. 따라서 네트워크 트래픽 데이터의 통계적 특성을 사용한 패킷 분류기법이 대두되고 있다. Traditional packet classification techniques are port or payload based. Port-based techniques classify packets based on the ports listed in the packet. Recently, however, the utility of this technique has been hampered by the fact that many applications, such as peer-to-peer applications, use dynamic or unpredictable port numbers. On the other hand, the payload-based technique or the deep packet inspection (DPI) technique captures the pattern of the application by deeply analyzing the contents of the packet. However, this method is also limited by frequent changes of packet format, packet encryption, and increase of packet analysis cost. Therefore, a packet classification technique using statistical characteristics of network traffic data is emerging.

기계학습 기법은 자율학습이거나 지도학습으로 나누어진다. 자율학습 또는 비지도학습(Unsupervised Learning) 혹은 클러스터링이라고 불리는 기법은 네트워크 트래픽 데이터의 유사도에만 기초하여 작동한다. 클러스터링은 트레이닝 트래픽 데이터가 부족하거나 없을 때 사용된다. 자율학습은 레이블이 없는 패킷들을 서로 관련된 패킷들의 클러스터로 모은다. 각 플로우의 특징 벡터를 ρ-차원 공간의 한 점으로 표현하여 이 공간상에서 서로 가까운 거리에 몰려있는 플로우들을 같은 클러스터로 분류하는 K-평균 기법은 자율학습의 한 예이다. Machine learning techniques can be divided into self-learning or instructional learning. A technique called autonomous learning or non-supervised learning or clustering works based only on the similarity of network traffic data. Clustering is used when there is insufficient or no training traffic data. Autonomous learning aggregates unlabeled packets into a cluster of related packets. The K-means technique that expresses the feature vectors of each flow as a point in the ρ-dimensional space and classifies flows that are close to each other in the space into the same cluster is an example of autonomous learning.

지도학습(Supervised Learning)은 어플리케이션 이름을 알고 있는 트레이닝 트래픽 데이터를 사용하는 기법이다. 이 방법은 해당 어플리케이션 패킷 데이터의 패턴을 파악하고 이를 이용해 이름을 모르는 패킷에 대해 패턴 매칭을 통해 어플리케이션 이름을 유추하는 기법을 사용한다. 지도학습은 레이블이 알려져 있는 패킷 클래스를 가지고 시작한다. 패킷 크기, 패킷 방향, 패킷 간 도착시간 등이 각 패킷 클래스마다 추출되어 그 클래스를 대변하는 모델을 훈련시키는데 사용된다. 이를 위한 여러 모델링 기법이 제안되었으며 Roughan 등은 NN(Nearest Neighbors), LDA(Linear Discriminate Analysis), QDA(Quadratic Discriminant Analysis) 등의 모델링 기법을 사용할 것을 제안하고 있고, Moore와 Zuev는 나이브 베이즈(naive Bayes)기법을 사용할 것을 제안하였으며, Crotti 등은 각 패킷 클래스에 대해 패킷 사이즈, 패킷 간 도착시간에 대한 PDF 벡터인 프로토콜 지문을 사용하여 트래픽 특성을 경량이면서도 효율적으로 표현할 것을 제안하고 있다.Supervised Learning is a technique that uses training traffic data that knows the application name. In this method, a pattern of the application packet data is grasped, and the application name is inferred through pattern matching for a packet whose name is unknown. Map learning begins with a packet class whose label is known. Packet size, packet direction, and inter-packet arrival time are extracted for each packet class and used to train a model that represents the class. Roughan et al. Proposed the use of modeling techniques such as NN (Nearest Neighbors), LDA (Linear Discriminate Analysis) and QDA (Quadratic Discriminant Analysis), and Moore and Zuev proposed naive And Crotti et al. Propose that lightweight and efficient traffic characteristics are expressed using PDF fingerprint of packet vector, packet vector, and packet arrival time for each packet class.

또한, 지도학습의 경우 그 효율성을 높이기 위한 여러 가지 시도가 있어왔다. Nguyen과 Armitage는 패킷 전송기간 중 여러 위치에서 패킷을 샘플링함으로써 복수개의 서브 플로우를 생성하는 기법을 제안하였다. 대부분의 패킷 분류기법이 플로우 전체를 캡쳐하거나 플로우의 첫 수개의 패킷을 캡쳐하여 샘플로 쓰는데 그들의 주장에 의하면 패킷 전체를 캡쳐하는 것은 시간 비용이 너무 들며 패킷의 시작 부분을 탐지하는 것은 항상 가능한 것이 아니라는 것이다. 대신 이 기법은 패킷 전송이 이미 시작된 이후에도 여기 저기에서 패킷을 캡쳐함으로써 복수개의 서브 플로우들을 얻어내고 이것들을 가지고 모델을 훈련한다. In addition, there have been many attempts to improve the efficiency of map learning. Nguyen and Armitage proposed a technique to generate multiple subflows by sampling packets at various locations during the packet transmission period. Most packet classification techniques capture the entire flow or capture and sample the first few packets of a flow. Their claim is that capturing the entire packet is too time-consuming and it is not always possible to detect the beginning of the packet will be. Instead, the technique traverses the model with multiple sub-flows by capturing packets from here and there, even after the packet transfer has already begun.

패킷 분류의 또 다른 노력은 패킷의 통계적 특성만이 아니고 시계열 특징도 고려하는 것이다. Dainotti 등, 혹은 Mu 와 Wu는 은닉 마르코프모델(Hidden Markov Models)을 사용하여 트래픽 클래스를 표현한다. Munz 등의 경우는 8개의 상태와 4개의 단계를 갖는 단순 마르코프모델을 사용해서 HMM에 비해 경량이면서도 효율적으로 전송된 패킷의 통계적 특징 및 시계열 특성을 나타내는 기법을 제안한다. Zhang 등은 플로우 백 (a bag of flows)를 생성하고 이 플로우 백 단위로 패킷을 분류할 것을 제안한다. 플로우 백 단위로 패킷을 분류하는 것은 각각의 플로우를 따로 따로 분류하는 것에 비해 잘못 분류되는 플로우가 백 안에서 차지하는 비율이 작을 때 패킷 분류 정확도를 높이는 역할을 한다. 이는 백 안에서 다수를 차지하는 플로우들이 제대로 맞는 클래스에 분류됨으로써 백 전체의 클래스 분류가 올바르게 되고 따라서 오 분류되는 플로우들을 제거하는 역할을 하기 때문이다.Another effort to classify packets is to consider not only the statistical characteristics of the packets but also the time series characteristics. Dainotti et al., Or Mu and Wu express traffic class using Hidden Markov Models. In Munz et al., We propose a statistical and time-series technique for lightweight and efficient packets compared to HMMs using simple Markov models with 8 states and 4 stages. Zhang et al. Propose to create a bag of flows and classify the packets in this flowback unit. Classification of packets by flowback unit improves packet classification accuracy when flow rate of wrongly classified packets is small, compared to classifying each flow separately. This is because the flows that occupy a large number in the bag are classified into the classes that are properly matched, so that the class classification of the bag is correct and thus the flows that are misclassified are removed.

기존에 제안된 기법은 플로우 백(BOF: Bag-Of-Flow) 기법과 Munz의 마르코프모델을 조합하고 있다. Munz의 마르코프모델이 단순하면서도 효과적으로 트래픽의 특성을 잡아낸다고 판단하며 BOF 개념이 분류의 정확도를 높이는데 중요한 역할을 한다고 생각한다. 그러나 BOF를 곧이 곧대로 패킷 분류에 적용하는 것은 트래픽 클래스들이 유사한 특성을 보이는 경우 (SMTP와 IMAP 클래스등의 경우) 항상 최선의 결과를 보이지는 않는다. 이러한 문제점을 해결하기 위해 본 발명을 제안한다.
The proposed method combines the Bagof-Of-Flow (BOF) technique with Munz's Markov model. We believe that the Markov model of Munz is simple and effective to capture the characteristics of traffic, and the BOF concept plays an important role in improving the accuracy of the classification. However, applying BOF to packet classification soon does not always produce the best results if traffic classes have similar characteristics (for example, SMTP and IMAP classes). The present invention is proposed to solve such a problem.

[1] Akaike, H., Information theory and an extension of the maximum likelihood principle. 2nd International Symposium on Information Theory, Budapest, 267-281. 1973.[1] Akaike, H., Information theory and an extension of the maximum likelihood principle. 2nd International Symposium on Information Theory, Budapest, 267-281. 1973. [2] Bernaille, L. Teixeira, R., Salamatian, K., "Early application identification," Proc. of ACM International Confernece on Emerging Netwrk Experiments and Technologies (CoNEXT) 2006, Lisboa, Portugal 2006.[2] Bernaille, L. Teixeira, R., Salamatian, K., "Early application identification," Proc. of ACM International Conference on Emerging Network Experiments and Technologies (CoNEXT) 2006, Lisboa, Portugal 2006. [3] Cover, T. and THomas, J., "Elements of Information Theory,'' John Wiley and Sons, NY. 1991.[3] Cover, T. and THOMAS, J., "Elements of Information Theory," John Wiley and Sons, NY, 1991. [4] Crotti, M., Dusi, M., Gringoli, F., and Salgarelli, M., "Traffic classification through simple statistical fingerprinting," SIGCOMM Comput. Commun. Rev., vol. 37, no. 1, pp. 5-16, 2007.[4] Crotti, M., Dusi, M., Gringoli, F., and Salgarelli, M., "Traffic classification through simple statistical fingerprinting," SIGCOMM Comput. Commun. Rev., vol. 37, no. 1, pp. 5-16, 2007. [5] Dainotti, A., Donato W. D., and Pescape, A., "TIE: A Community-oriented Traffic Classification Platform," Lecture Notes in Computer Science, Vol. 5537, pp 64-74, 2009.[5] Dainotti, A., Donato W. D., and Pescape, A., "TIE: A Community-oriented Traffic Classification Platform," Lecture Notes in Computer Science, Vol. 5537, pp 64-74, 2009. [6] Dainotti, A., Pescape, A., and Claffy, K. C., "Issues and Future Directions in Traffic Classification," IEEE Network, January, 2012.[6] Dainotti, A., Pescape, A., and Claffy, K. C., "Issues and Future Directions in Traffic Classification," IEEE Network, January, 2012. [7] Dainotti, A., Donato W. D., Pescape, A., and Rossi, P. S., "Classification of Network Traffic Via Packet-level Hidden Markov Models," Proc. of IEEE Global Telecommunications Conference, Nov. 2008.[7] Dainotti, A., Donato W. D., Pescape, A., and Rossi, P. S., "Classification of Network Traffic Via Packet-level Hidden Markov Models," Proc. of IEEE Global Telecommunications Conference, Nov. 2008. [8] Erman, J., Mahanti, A., Arlitt, M., Cohen, I., and Willamson, C. “Offline/realtime traffic classification using semi-supervised learning", Performance Evaluation, Vol. 64, pp 1194-1213, 2007.[8] Erman, J., Mahanti, A., Arlitt, M., Cohen, I., and Willamson, C. "Performance Evaluation, Vol. 64, pp. 1194 -1213, 2007. [9] Estevez-Tapiador, J. M., Garcia-Teodoro, P., and Diaz-Verdego, J. E.,[9] Estevez-Tapiador, J. M., Garcia-Teodoro, P., and Diaz-Verdego, J.E., "Stochastic Protocol Modeling for Anormaly Based Network Intrusion Detection," Proc. of IEEE International Workshop on Information Assurance, March, 2003."Stochastic Protocol Modeling for Anormaly Based Network Intrusion Detection," Proc. of IEEE International Workshop on Information Assurance, March, 2003. [10] Heetal, S., Shinde, S., and Abhang, S. P., "State of Art Survey of Network Traffic Classification," International Journal of Computer Applications, 2011.[10] Heetal, S., Shinde, S., and Abhang, S. P., "State of Art Survey of Network Traffic Classification," International Journal of Computer Applications, 2011. [11] Karagiannis, T., Broido, A., Brownlee, N., Claffy, K., "Is P2P dying or just hiding?," Proc. 47th annual IEEE Global Telecommunications Conference (Globecom 2004), Dallas, Texas, USA, November/December 2004.[11] Karagiannis, T., Broido, A., Brownlee, N., Claffy, K., "Is P2P dying or just hiding ?," Proc. 47th annual IEEE Global Telecommunications Conference (Globecom 2004), Dallas, Texas, USA, November / December 2004. [12] Lee, Suchul., Kim, H. C., Barman, D., Lee, S., Kim, C., and Kown, T., "NetraMark: A Network Traffic Classification Benchmark," ACM SIGCOMM Computer Communication Review, Vol. 4, No. 1, Jan. 2011."A Network Traffic Classification Benchmark," ACM SIGCOMM Computer Communication Review, Vol., Vol. 12, No. 1, pp. . 4, No. 1, Jan. 2011. [13] Monrose, F. and Masson, G., "HMM Profiles for Network Traffic Classification," Proc. of Workshop on Visualization and Data Mining for Computer Security, pp 9-15, Oct. 2004.[13] Monrose, F. and Masson, G., "HMM Profiles for Network Traffic Classification," Proc. of Workshop on Visualization and Data Mining for Computer Security, pp 9-15, Oct. 2004. [14] Moore, A., and Zuev, D., "Internet traffic classification using Bayesian analysis techniques," ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS) 2005, Banff, Alberta, Canada, June 2005.[14] Moore, A., and Zuev, D., "Internet traffic classification using Bayesian analysis techniques," ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS) 2005, Banff, Alberta, Canada, June 2005. [15] Mu., X., Wu, W., and Enabled C., "A Parallelized Netowrk Traffic Classification Based on Hidden Markov Model," Distributed Computing and Knowledge Discovery, Oct., 2011.[15] Mu., X., Wu, W., and Enabled C., "A Parallelized Netowrk Traffic Classification Based on Hidden Markov Model," Distributed Computing and Knowledge Discovery, Oct., 2011. [16] Munz, G., Dai, H., Braum, L., and Carle, G., "TCP Traffic Classification Using Markov Models," TMA'10 Proceedings of the Second international conference, pp. 127-140, 2010.[16] Munz, G., Dai, H., Braum, L., and Carle, G., "TCP Traffic Classification Using Markov Models," TMA'10 Proceedings of the Second International Conference, pp. 127-140, 2010. [17] Nguyen, T., and Armitage, G., "Training on multiple sub-flows to optimise the use of Machine Learning classifiers in real-world IP networks," Proc. IEEE 31st Conference on Local Computer Networks, Tampa, Florida, USA, November 2006.[17] Nguyen, T., and Armitage, G., "Training on multiple sub-flows to optimize the use of machine learning classifiers in real-world IP networks," Proc. IEEE 31st Conference on Local Computer Networks, Tampa, Florida, USA, November 2006. [18] Nguyen Thuy T. T. and Armitage, G., "A Survey of Techniques for Internet Traffic Classification Using Machine Learning," IEEE Communications Surveys and Tutorials, Vol. 10, No. 4, 2008. [18] Nguyen Thuy T. and Armitage, G., "Survey of Techniques for Internet Traffic Classification Using Machine Learning," IEEE Communications Surveys and Tutorials, Vol. 10, No. 4, 2008. [19] Roughan, M., Sen, S., Spatscheck, O., and Duffield, N., "Class-of-service mapping for QoS: A statistical signature-based approach to IP trafficclassification," Proc. ACM/SIGCOMM Internet Measurement Conference (IMC) 2004, Taormina, Sicily, Italy, October 2004. [19] Roughan, M., Sen, S., Spatscheck, O., and Duffield, N., "Class-of-service mapping for QoS: A statistical signature-based approach to IP traffic classification," Proc. ACM / SIGCOMM Internet Measurement Conference (IMC) 2004, Taormina, Sicily, Italy, October 2004. [20] Zhang J., Xiang, Y. Wang Y., Zhou, W., Xiang, Yong, and Guan, Y., "Network Traffic Classification Using Correlation Information," IEEE Trans. on parallel and distributed systems, Vol. 24, No. 1, pp. 104-117, Jan., 2013.[20] Zhang J., Xiang, Y. Wang Y., Zhou, W., Xiang, Yong, and Guan, Y., "Network Traffic Classification Using Correlation Information," IEEE Trans. on parallel and distributed systems, Vol. 24, No. 1, pp. 104-117, Jan., 2013.

본 발명이 이루고자 하는 기술적 과제는 분류 난이도가 높은 어플리케이션들을 분류하는 인터넷 트래픽 분류 장치 및 방법을 제공하는데 목적이 있다. SUMMARY OF THE INVENTION It is an object of the present invention to provide an Internet traffic classification apparatus and method for classifying applications having a high classification difficulty.

본 발명이 이루고자 하는 다른 기술적 과제는 트래픽의 패턴이 높은 비율로 겹치는 경우 분류성능이 급격히 떨어지는 문제를 해결하는 인터넷 트래픽 분류 장치를 제공하는데 목적이 있다.
It is another object of the present invention to provide an Internet traffic classification apparatus that solves the problem that the classification performance drops sharply when traffic patterns overlap at a high rate.

네트워크의 커넥션 정보를 검출하는 커넥션 정보검출부, 상기 검출된 커넥션 정보를 기초로 그룹을 판단하기 위한 수치를 연산하는 정보연산부, 상기 연산된 수치를 기초로 상기 커넥션 정보인 패킷의 그룹을 분류하는 그룹배정부를 포함할 수 있다.A connection information detecting unit for detecting connection information of a network, an information calculating unit for calculating a value for determining a group based on the detected connection information, a group dividing unit for dividing the group of packets, which are the connection information, It can include government.

상기 정보연산부는, 상기 커넥션 정보를 기초로 백을 형성하는 백 형성부, 상기 형성된 백을 기초로 마르코프모델을 생성하는 마르코프모델 생성부, 상기 생성된 마르코프모델을 기초로 쿨백 라이블러 정보 연산을 위한 값을 산출하는 마르코프모델 연산부를 포함할 수 있다.Wherein the information operation unit comprises: a bag forming unit for forming a bag based on the connection information; a Markov model generating unit for generating a Markov model based on the formed bag; And a Markov model calculation unit for calculating a value of the Markov model.

상기 백 형성부는, 상기 커넥션 정보검출부에서 검출된 처음 4개의 패킷을 기초로 백을 형성할 수 있다.The bag forming unit may form bags on the basis of the first four packets detected by the connection information detecting unit.

상기 마르코프모델 생성부는, 상기 형성된 백을 기초로 해당 어플리케이션의 훈련 마르코프모델을 생성하는 훈련모델 생성부, 상기 형성된 백을 기초로 상기 생성된 어플리케이션의 훈련 마르코프모델과 비교하기 위한 검증 마르코프모델을 생성하는 검증모델 생성부를 포함할 수 있다.The Markov model generation unit may generate a training Markov model of the application based on the formed bag, and generate a verification Markov model for comparison with the training Markov model of the generated application based on the formed bag And a verification model generation unit.

상기 마르코프모델 생성부는, 전이확률 행렬과 유한한 상태공간의 초기확률분포에 의해 마르코프모델이 생성될 수 있다.The Markov model generating unit may generate a Markov model by using a transition probability matrix and an initial probability distribution of a finite state space.

상기 훈련모델 생성부는, 상기 형성된 백을 기초로 에스엠티피(SMTP) 어플리케이션 및 아이엠에이피(IMAP) 어플리케이션의 마르코프모델을 생성할 수 있다.The training model generation unit may generate a Markov model of an SMTP application and an IMAP application based on the formed bag.

상기 검증모델 생성부는, 상기 형성된 백을 기초로 관련있는 커넥션들에 해당하는 마르코프모델을 형성할 수 있다.The verification model generation unit may form a Markov model corresponding to relevant connections based on the formed bag.

상기 마르코프모델 연산부는, 상기 마르코프모델 생성부에서 생성된 마르코프모델을 기초로 마르코프 모델에서 일정 패턴이 관측되는 빈도인 관측값을 연산하는 관측값 연산부, 상기 관측값을 기초로 마르코프 모델에서 일정 패턴이 관측될 가능성인 가능도를 연산하는 가능도 연산부를 포함할 수 있다.The Markov model operation unit includes an observation value operation unit for calculating an observation value, which is a frequency at which a certain pattern is observed in the Markov model based on the Markov model generated by the Markov model generation unit, a predetermined pattern in the Markov model based on the observation value And a likelihood calculator for calculating a likelihood that is likely to be observed.

상기 그룹배정부는, 상기 그룹을 판단하기 위한 수치를 기초로 발산값을 측정하는 것을 특징으로 하는 쿨백라이블러 정보계산부, 상기 계산된 발산값을 기초로 그룹을 선정하는 그룹산출부를 포함할 수 있다.The group allocator may measure a divergence value based on a value for determining the group, and may include a group calculator for selecting a group based on the calculated divergence value .

상기 쿨백라이블러 정보계산부는, 상기 정보연산부에서 연산된 훈련 마르코프모델의 가능도와 검증 마르코프 모델의 가능도를 기초로 쿨백라이블러 정보를 계산하여 발산값을 측정할 수 있다.The cool bag librer information calculation unit may calculate the cool bag librer information based on the possibility of the training Markov model calculated by the information calculation unit and the likelihood of the verification Markov model to measure the divergence value.

상기 그룹산출부는, 상기 측정된 발산값을 비교하여 그룹을 선정할 수 있다.The group calculator may select a group by comparing the measured divergence values.

상기 분류된 그룹의 오분류확률을 측정하는 평가부를 더 포함할 수 있다.And an evaluation unit for measuring a misclassification probability of the classified group.

네트워크의 커넥션 정보를 검출하는 단계, 상기 커넥션 정보를 기초로 그룹을 판단하기 위한 수치로 연산하는 단계, 상기 연산된 수치를 기초로 그룹을 분류하는 단계를 포함할 수 있다.Detecting connection information of the network, computing a value based on the connection information to determine a group, and classifying the group based on the calculated numerical value.

상기 그룹을 판단하기 위한 수치로 연산하는 단계는, 상기 네트워크 정보를 기초로 백을 형성하는 단계, 상기 형성된 백을 기초로 마르코프모델을 생성하는 단계, 상기 생성된 마르코프모델을 기초로 쿨백 라이블러 정보 연산을 위한 값을 산출하는 단계를 포함할 수 있다.The step of calculating the group value includes the steps of: forming a bag based on the network information; generating a Markov model based on the formed bag; calculating, based on the generated Markov model, And calculating a value for the operation.

상기 백을 형성하는 단계는, 상기 네트워크의 커넥션 정보를 검출하는 단계에서 검출된 처음 4개의 패킷을 기초로 백을 형성할 수 있다.The step of forming the bag may form a bag based on the first four packets detected in the step of detecting connection information of the network.

상기 마르코프모델을 생성하는 단계는, 상기 형성된 백을 기초로 해당 어플리케이션의 훈련 마르코프모델을 생성하는 단계, 상기 형성된 백을 기초로 상기 생성된 어플리케이션의 훈련 마르코프모델과 비교하기 위한 검증 마르코프모델을 생성하는 단계를 포함할 수 있다.The step of generating the Markov model may include generating a training Markov model of the application based on the formed bag and generating a verification Markov model for comparison with the training Markov model of the generated application based on the formed bag Step < / RTI >

상기 마르코프모델을 생성하는 단계는, 전이확률 행렬과 유한한 상태공간의 초기확률분포에 의해 마르코프 모델이 생성되는 단계를 포함할 수 있다.The step of generating the Markov model may include a step of generating a Markov model based on the transition probability matrix and the initial probability distribution of the finite state space.

상기 훈련 마르코프모델을 생성하는 단계는, 상기 형성된 백을 기초로 에스엠티피(SMTP) 어플리케이션 및 아이엠에이피(IMAP) 어플리케이션의 마르코프모델을 생성하는 단계를 포함할 수 있다.The step of generating the training Markov model may include generating a Markov model of an SMTP application and an IMAP application based on the formed bag.

상기 검증 마르코프모델을 생성하는 단계는, 상기 형성된 백을 기초로 관련있는 커넥션들의 마르코프모델을 형성할 수 있다.The generating the verification Markov model may form a Markov model of related connections based on the formed bag.

상기 쿨백 라이블러 정보 연산을 위한 값을 산출하는 단계는, 상기 마르코프모델을 생성하는 단계에서 생성된 마르코프모델을 기초로 마르코프 모델에서 일정 패턴이 관측되는 빈도인 관측값을 연산하는 단계, 상기 연산된 관측값을 기초로 마르코프 모델에서 일정 패턴이 관측될 가능성인 가능도를 연산하는 단계를 포함할 수 있다.The step of calculating the value for the cool bag librer information calculation may include calculating an observation value that is a frequency at which a certain pattern is observed in the Markov model based on the Markov model generated at the step of generating the Markov model, And calculating a likelihood that a certain pattern is observed in the Markov model based on the observed value.

상기 그룹을 분류하는 단계는, 상기 그룹을 판단하기 위한 수치들을 기초로 쿨백 라이블러 발산값을 측정하는 단계, 상기 계산된 발산값을 기초로 그룹을 선정하는 단계를 포함할 수 있다.The step of classifying the group may include a step of measuring a cool-white-libler divergence value based on the values for determining the group, and a step of selecting a group based on the calculated divergence value.

상기 쿨백 라이블러 발산값을 측정하는 단계는, 상기 그룹을 판단하기 위한 수치로 연산하는 단계에서 연산된 훈련 마르코프모델의 가능도와 검증 마르코프 모델의 가능도를 기초로 쿨백라이블러 정보를 계산하여 발산값을 측정하는 단계를 포함할 수 있다.The step of measuring the cool-white-libler divergence value may include calculating a cool-white-libler information based on the likelihood of the training Markov model calculated in the step of calculating the group to determine the group and the likelihood of the verification Markov model, And measuring the temperature of the liquid.

상기 그룹을 선정하는 단계는, 상기 측정된 발산값을 비교하여 그룹을 선정할 수 있다.In the step of selecting the group, the group can be selected by comparing the measured divergence values.

상기 분류된 그룹의 오분류확률을 측정하는 단계를 더 포함할 수 있다.
The method may further include measuring a misclassification probability of the classified group.

본 발명에 따른 인터넷 트래픽 분류 장치 및 방법에 따르면, 쿨백라이블러 거리를 사용한 마르코프모델 기반 분류 기법을 사용하여 분류 난이도가 높은 어플리케이션들을 분류할 수 있는 인터넷 트래픽 분류 장치를 제공할 수 있다.According to the Internet traffic classification apparatus and method of the present invention, it is possible to provide an Internet traffic classification apparatus capable of classifying applications having high classification difficulty by using a Markov model-based classification scheme using a Koolback Leibler distance.

쿨백라이블러 정보를 사용하는 마르코프모델 기반 분류 기법과 연관이 있는 트래픽들을 동일 백에 포함시켜 처리하는 기법을 병합하는 기법을 제공함으로, 트래픽의 패턴이 높은 비율로 겹치는 경우에도 높은 분류성능을 보일 수 있다.
The proposed method combines the Markov model based classification scheme using Koolbag liebler information and the technique of including the traffic related to the same back into the same bag so that even if the traffic pattern overlaps at a high rate, have.

도 1은 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 구성을 도시한 블록도이다.
도 2는 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 정보판단부의 구성을 도시한 블록도이다.
도 3은 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 마르코프모델 생성부의 구성을 도시한 블록도이다.
도 4는 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 마르코프모델 연산부의 구성을 도시한 블록도이다.
도 5는 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 그룹배정부의 구성을 도시한 블록도이다.
도 6은 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 평가부에서 평가측도을 도시한 표이다.
도 7은 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 시뮬레이터를 시행하기 위한 가상모델을 설정한 표이다.
도 8은 본 발명의 일 실시예에 따른 인터넷 트래픽 분류 장치에서 실제 트래픽 데이터에 대한 성능 측정결과 비교를 나타낸 표이다.
도 9는 본 발명의 일 실시예인 인터넷 트래픽 분류 방법의 수행과정을 도시한 순서도이다.1 is a block diagram illustrating a configuration of an Internet traffic classification apparatus according to an embodiment of the present invention.
2 is a block diagram illustrating the configuration of an information determination unit of an Internet traffic classification apparatus according to an embodiment of the present invention.
3 is a block diagram illustrating a configuration of a Markov model generation unit of an Internet traffic classification apparatus according to an embodiment of the present invention.
4 is a block diagram illustrating a configuration of a Markov model operation unit of an Internet traffic classification apparatus according to an embodiment of the present invention.
5 is a block diagram illustrating the configuration of a grouping unit of an Internet traffic classification apparatus according to an embodiment of the present invention.
FIG. 6 is a table showing evaluation measures in the evaluation unit of the Internet traffic classification apparatus according to the embodiment of the present invention. FIG.
FIG. 7 is a table in which a virtual model for implementing a simulator of an Internet traffic classification apparatus according to an embodiment of the present invention is set.
8 is a table showing comparison of performance measurement results of actual traffic data in the Internet traffic classification apparatus according to an embodiment of the present invention.
FIG. 9 is a flowchart illustrating a process of performing an Internet traffic classification method according to an embodiment of the present invention.

이하에서 첨부된 도면들을 참조하여 본 발명에 따른 인터넷 트래픽 분류 장치 및 방법에 대해 상세하게 설명한다. 이때 도면에 도시되고 또 이것에 의해서 설명되는 본 발명의 구성과 작용은 적어도 하나의 실시예로서 설명되는 것이며, 이것에 의해서 본 발명의 기술적 사상과 그 핵심 구성 및 작용이 제한되지는 않는다.Hereinafter, an Internet traffic classification apparatus and method according to the present invention will be described in detail with reference to the accompanying drawings. The structure and operation of the present invention shown in the drawings and described by the drawings are described as at least one embodiment, and the technical ideas and the core structure and operation of the present invention are not limited thereby.

본 발명에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어를 선택하였으나, 이는 당해 기술분야에 종사하는 기술자의 의도 또는 관례 또는 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 함을 밝혀두고자 한다.Although the terms used in the present invention have been selected in consideration of the functions of the present invention, it is possible to use general terms that are currently widely used, but this may vary depending on the intention or custom of a person skilled in the art or the emergence of new technology. Also, in certain cases, there may be a term selected arbitrarily by the applicant, in which case the meaning thereof will be described in detail in the description of the corresponding invention. Therefore, it is to be understood that the term used in the present invention should be defined based on the meaning of the term rather than the name of the term, and on the contents of the present invention throughout.

마르코프모델을 살펴보면, 유한한 상태공간 S={1,...,n} 인 이산 마르코프모델 {M(t),t=1,...,m}은 [수학식 1]과 같이 전이확률 행렬 P={p_ij}와 상태공간 S의 초기 확률분포

에 의하여 정해진다.In the Markov model, a discrete Markov model {M (t), t = 1, ..., m} having a finite state space S = {1, ..., n} The matrix P = {p _ij } and the initial probability distribution of the state space S

Lt; / RTI >

쿨백라이블러 정보는 두 확률분포간의 비유사성 측도로 잘 알려져 있다. 미지의 참 확률분포 G(x)로부터 랜덤하게 추출한 n개의 관측치를 {x₁,...,x_n}이라하고 F(x)를 임의의 확률분포라 하자. F(x)로 정의된 모형의 적합도는 참 분포인 G(x)와 확률분포로서의 유사성으로 평가된다고 가정한다. Akaike는 [수학식 2]를 쿨백라이블러 정보(또는 발산)로 사용할 것을 제안하였다. The Kullback Libler information is well known as a dissimilarity measure between two probability distributions. Assume that n observations randomly extracted from the unknown true probability distribution G (x) are {x ₁ , ..., x _n } and F (x) is an arbitrary probability distribution. The fit of the model defined by F (x) is assumed to be evaluated as the similarity as the probability distribution with G (x), the true distribution. Akaike proposed to use [Equation 2] as the coolbag libler information (or divergence).

여기서 E_G는 확률분포 G에 대한 기대값을 나타낸다. 쿨백라이블러 정보 또는 G의 F에 대한 상대적 엔트로피는 이산형 분포에서 [수학식 3]과 같다.Where E _G is the expected value for the probability distribution G. The relative entropy for the Kullbacker information or F for G is given by Equation (3) in the discrete distribution.

여기서 g와 f는 각각 분포함수 G의 F의 확률질량함수를 나타낸다.Where g and f represent the probability mass function of F of the distribution function G, respectively.

분류방법들의 평가에 있어 분류법의 성능은 오류율 또는 오분류확률을 측정하여 계산할 수 있다. 이진분류에서는 총오분류 확률(TPM)을 [수학식 4]과 같이 정의한다.In the evaluation of classification methods, the performance of classification can be calculated by measuring error rate or misclassification probability. In the binary classification, the total classification probability (TPM) is defined as [Equation 4].

[수학식 4]에서 Pr(π₁)는 클래스 k의 사전확률을 나타낸다.Pr (? ₁ ) in Equation (4) represents the prior probability of class k.

그리고 [수학식 5]와 같이 Munz과 Zhang에서 정의된 recall과 precision그리고 F - 측도를 클래스당 성능을 평가하는데 사용한다.And recall and precision defined in Munz and Zhang as shown in [Equation 5] and F - Measurements are used to evaluate performance per class.

둘중 하나의 클래스에 속하는 관측치 총 수를 N이라 하자. 그러면 평가측도를 [표 1]과 같이 간단한 2×2 도수분포표로 나타낼 수 있다.Let N be the total number of observations belonging to one class. Then, the evaluation measure can be represented by a simple 2 × 2 frequency distribution table as shown in [Table 1].

predicted class 1predicted class 1 predicted class 2predicted class 2 TotalTotal True Class 1True Class 1 n₁₁ n ₁₁ n₁₂ n ₁₂ n₁.n ₁ . True Class 2True Class 2 n₂₁ n ₂₁ n₂₂ n ₂₂ n₂.n ₂ . TotalTotal n.₁ n. _One n.₂ n. ₂ NN

따라서 recall₁은 실제 어플리케이션 1 중에서 올바르게 어플리케션 1으로 분류한 비율이며 precision₂는 어플리케이션 2로 분류된 모든 커넥션 중에서 올바르게 분류된 비율을 나타낸다. F-측도는 recall과 precision의 조화평균으로서 전반적인 정확도를 나타낸다. 위의 경험적인 측도를 바탕으로 TPM은 [수학식 7]과 같이 추정된다.Thus, recall ₁ is the correct ratio of application 1 to application 1, and precision ₂ is the correct ratio of all connections classified as application 2. The F-measure represents the overall accuracy as a harmonic average of recall and precision. Based on the empirical measure above, the TPM is estimated as: " (7) "

도 1은 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 구성을 도시한 블록도이다.1 is a block diagram illustrating a configuration of an Internet traffic classification apparatus according to an embodiment of the present invention.

도 1을 참조하면, 인터넷 트래픽 분류 장치(1)는 제어부(10) 및 네트워크 인터페이스부(20)를 포함할 수 있다. Referring to FIG. 1, the Internet traffic classification apparatus 1 may include a control unit 10 and a network interface unit 20.

네트워크 인터페이스부(20)는 네트워크로 부터 패킷을 수신한다. 네트워크는 백본망과 가입자망으로 구성될 수 있다. 백본망은 X.25 망, Frame Relay 망, ATM망, MPLS(Multi Protocol Label Switching) 망 및 GMPLS(Generalized Multi Protocol Label Switching) 망 중 하나 또는 복수의 통합된 망으로 구성될 수 있다. 가입자망은 FTTH(Fiber To The Home), ADSL(Asymmetric Digital Subscriber Line), 케이블망, Wireless LAN(IEEE 802.11b, IEEE 802.11a, IEEE802.11g, IEEE802.11n), WIBro(Wireless Broadband), Wimax 및 HSDPA(High Speed Downlink Packet Access)일 수 있다. 일부 실시예로, 네트워크는 인터넷망일 수 있고, 이동 통신망일 수 있다.The network interface unit 20 receives packets from the network. The network may consist of a backbone network and a subscriber network. The backbone network may be composed of one or a plurality of integrated networks of X.25 network, Frame Relay network, ATM network, MPLS (Multi Protocol Label Switching) network and GMPLS (Generalized Multi Protocol Label Switching) network. The subscriber network may be a fiber to the home (FTTH), an asymmetric digital subscriber line (ADSL), a cable network, a wireless LAN (IEEE 802.11b, IEEE 802.11a, IEEE 802.11g, IEEE 802.11n), WIBro HSDPA (High Speed Downlink Packet Access). In some embodiments, the network may be an Internet network and may be a mobile communication network.

인터넷 트래픽 분류 장치(1)는 네트워크 장치, 서버 데스크톱, 랩톱, 태블릿 또는 핸드헬드 컴퓨터 등의 퍼스널 컴퓨터 시스템일 수 있다. 네트워크 장치는 라우터, 스위칭장비, 게이트웨이, 침입 방지 시스템(IDS: Intrusion Detection System) 및 침입 예방 시스템(IPS: Intrusion Prevention System)The Internet traffic classification device 1 may be a personal computer system such as a network device, a server desktop, a laptop, a tablet or a handheld computer. Network devices include routers, switching equipment, gateways, Intrusion Detection Systems (IDS), and Intrusion Prevention Systems (IPS)

제어부(10)는 단일 칩, 다수의 칩, 또는 다수의 전기 부품상에 구현될 수 있다. 예를 들어, 전용 또는 임베디드 프로세서, 단일 목적 프로세서, 컨트롤러, ASIC, 기타 등등을 비롯하여 여러 가지 아키텍처가 제어부(10)에 대해 사용될 수 있다.The control unit 10 may be implemented on a single chip, a plurality of chips, or a plurality of electrical components. Various architectures may be used for the control unit 10, including, for example, a dedicated or embedded processor, a single purpose processor, a controller, an ASIC,

제어부(10)는 커넥션정보 검출부(100), 정보연산부(200), 그룹배정부(300), 평가부(400)을 포함할 수 있다.The control unit 10 may include a connection information detecting unit 100, an information calculating unit 200, a grouping unit 300, and an evaluating unit 400.

커넥션정보 검출부(100)는 네트워크상에서 전달되는 정보를 검출하여 정보연산부(200)로 전달할 수 있다.The connection information detection unit 100 can detect information transmitted on the network and transmit the information to the information operation unit 200.

커넥션정보는 인터넷상에서 전송제어 프로토콜(TCP) 커넥션으로 전달되는 정보일 수 있다. 또한, 커넥션정보는 클라이언트에서 서버로 전달되는 정보 및 서버에서 클라이언트로 전달되는 정보일 수 있다. 그리고 또한, 커넥션정보는 네트워크 트래픽 데이터 일 수 있다. The connection information may be information transmitted over a Transmission Control Protocol (TCP) connection over the Internet. The connection information may be information transmitted from the client to the server and information transmitted from the server to the client. Also, the connection information may be network traffic data.

정보연산부(200)는 상기 전달된 커넥션정보를 기초로 정보연산부(200)에 입력된 연산방법에 의해 정보를 연산하여 판단할 수 있다. 정보연산부(200)는 마르코프모델을 생성하고, 상기 생성된 마르코프모델을 연산할 수 있다. 정보연산부(200)에서 연산된 마르코프 모델은 그룹배정부(300)로 전달될 수 있다.The information operation unit 200 can calculate and determine information by the operation method input to the information operation unit 200 based on the transmitted connection information. The information computing unit 200 can generate a Markov model and calculate the generated Markov model. The Markov model calculated by the information operation unit 200 may be transmitted to the group assignment unit 300.

그룹배정부(300)는 상기 연산된 마르코프 모델을 기초로 그룹을 선정할 수 있다. 상기 그룹은 인터넷 트래픽 정보를 구별하는 그룹일 수 있다. 상기 판단된 그룹은 평가부(400)로 전달될 수 있다.The group assignment unit 300 can select a group based on the calculated Markov model. The group may be a group that distinguishes Internet traffic information. The determined group may be transmitted to the evaluation unit 400. [

평가부(400)는 상기 판단된 그룹의 오류율 또는 오분류확률을 계산할 수 있다.The evaluation unit 400 may calculate the error rate or the misclassification probability of the determined group.

도 2는 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 정보판단부의 구성을 도시한 블록도이다.2 is a block diagram illustrating the configuration of an information determination unit of an Internet traffic classification apparatus according to an embodiment of the present invention.

도 2를 참조하면, 정보연산부(200)는 백 형성부(210), 마르코프모델 생성부(220) 및 마르코프모델 연산부(230)를 포함할 수 있다.Referring to FIG. 2, the information operation unit 200 may include a bag forming unit 210, a Markov model generating unit 220, and a Markov model calculating unit 230.

백 형성부(210)는 커넥션정보 검출부(100)에서 전달된 커넥션 정보들 중 첫 4개의 패킷으로 백을 형성할 수 있다. 상기 패킷은 포트번호일 수 있다.The bag forming unit 210 may form a bag with the first four packets of the connection information transmitted from the connection information detecting unit 100. [ The packet may be a port number.

백 형성부(210)는 트래픽의 성향이 기존의 분류방법으로 분류가 어려운 어플리케이션 SMTP(포트 25)와 IMAP(포트 143)에 중점을 둔다. Bernaille는 TCP 커넥션에서 처음 4개의 패킷만으로도 알려진 어플리케이션을 정확하게 분류할 수 있다고 하였으므로 백 형성부(210)는 교환된 첫 4개의 패킷만으로 백을 형성할 수 있다.The bag forming unit 210 places emphasis on application SMTP (port 25) and IMAP (port 143), in which the tendency of traffic is difficult to classify by existing classification methods. Since Bernaille said that the first four packets in the TCP connection can accurately classify known applications, the bag forming unit 210 can form a bag with only the first four packets exchanged.

커넥션들의 백을 형성하는 방법은 여러 가지가 있다. Zhang등은 다섯 가지의 동일한 값(소스아이피, 소스포트, 목적지아이피, 목적지포트, 프로토콜)을 가지는 상관성 높은 패킷들로 구성된 “플로우” 개념을 사용하였다. 그들은 개별 커넥션 대신 BOF를 만들어서 배정을 BOF 그룹으로 하였다.There are several ways to form a bag of connections. Zhang et al. Used a "flow" concept consisting of five highly similar packets with the same values (source IP, source port, destination IP, destination port, protocol). They created a BOF instead of an individual connection and assigned assignments to the BOF group.

백 형성부(210)는 포트번호 만으로 백의 개념을 만들었다. 동일한 다섯 가지 값을 가지는 백을 형성하는 것은 시간이 많이 걸리므로 실시간 분류에는 적합하지 않을 수 있다. 따라서 본 발명의 방법은 BOF에 비하여 상관성이 조금은 떨어지지만 빠르고 편하게 사용할 수 있다.The bag forming unit 210 has created the bag concept only by the port number. Formation of a bag having the same five values is time-consuming and may not be suitable for real-time classification. Therefore, the method of the present invention can be used quickly and conveniently although the correlation is slightly lower than that of BOF.

상기 백에서 상태공간은 패킷의 방향과 페이로드 길이의 4 구간([0, 99], [100, 299], [300, MSS-1], [MSS])의 조합으로 정의된다. 최대수열크기(MSS)의 값은 전송제어 프로토콜 커넥션에서 교환된다. 방향은 클라이언트에서 서버의 경우 또는 서버에서 클라이언트의 경우 두 가지 이므로 각 단계는 n개의 다른 상태를 가진다. 따라서 본 발명의 모형은 4 단계 좌에서 우로의 모델이며 상태공간은 S={0,1,...,7}이다. 0부터 3까지의 상태는 클라이언트에서 서버 방향으로 페이로드 길이 구간을 나타내며 4부터 7 상태는 서버에서 클라이언트 방향의 길이 구간을 나타낸다. The state space in the back is defined by a combination of four directions ([0, 99], [100, 299], [300, MSS-1], [MSS]) of packet direction and payload length. The value of the maximum sequence size (MSS) is exchanged at the transmission control protocol connection. Since there are two directions in the direction of a client from a server or from a server to a client, each step has n different states. Thus, the model of the present invention is a 4-step left-to-right model and the state space is S = {0, 1, ..., 7}. The state from 0 to 3 indicates the payload length interval from the client to the server, and the states from 4 to 7 indicate the length of the client to server direction.

예를 들면 상태 수열 0-4-1-4는 클라이언트가 처음(핸드세이크 이후)에 [0, 99]바이트 크기의 패킷을 보내면 서버가 [0, 99]바이트 패킷으로 응답하며, 그 후 클라이언트는 [100, 299] 크기의 패킷을 보내면 마지막으로 서버가 [0, 99] 크기의 패킷으로 응답을 하는 것일 수 있다.For example, in the status sequence 0-4-1-4, if the client sends a packet of size [0, 99] bytes for the first time (after handshake), the server responds with a packet of [0, 99] bytes, If you send a packet of size [100, 299], the server may be the last to respond with a packet of size [0, 99].

본 발명의 훈련 자료에서 SMTP와 IMAP의 상태 수열을 조사하여 보니 두 어플리케이션 모두 0-4-1-4가 가장 빈도가 높은 수열임을 확인하였다. 이 이외에도 1-4-1-4와 0-4-0-4의 수열을 공유하고 있다. 본 발명의 일 실시예에서 분류모형을 두 개의 공유수열(0-4-1-4, 1-4-1-4)만 가지는 경우에 설명할 수 있다. 이후에 추가로 어플리케이션 별 유일한 수열을 포함하는 경우의 결과로 확장할 수 있다.Examining the status sequences of SMTP and IMAP in the training data of the present invention, it was confirmed that 0-4-1-4 is the most frequent sequence in both applications. In addition, they share the sequence of 1-4-1-4 and 0-4-0-4. In an embodiment of the present invention, the classification model can be described as having only two shared sequences (0-4-1-4, 1-4-1-4). And can be extended as a result of including a unique sequence for each application thereafter.

마르코프모델 생성부(220)는 백 형성부(210)에서 그룹화된 백을 기초로 마르코프모델을 생성할 수 있다. 상기 마르코프모델은 훈련 마르코프모델 및 검증 마르코프모델이 될 수 있다. The Markov model generation unit 220 can generate a Markov model based on bags grouped in the bag formation unit 210. [ The Markov model may be a training Markov model and a Verification Markov model.

상기 형성된 백에 대하여 마르코프모델 M^TE, M^k 이 생성될 수 있다. M^TE 는 검증 마르코프모델일 수 있고, M^k 는 훈련 마르코프모델일 수 있다.For the formed bag, Markov models M ^TE , M ^k Can be generated. M ^TE May be a verification Markov model, ^Mk Can be a training Markov model.

상기 생성된 마르코프모델은 마르코프모델 연산부(230)로 전달될 수 있다. The generated Markov model may be transmitted to the Markov model operation unit 230.

마르코프모델 연산부(230)는 마르코프모델 생성부(220)에서 전달된 마르코프모델을 기초로 일정 패턴이 관측되는 빈도인 관측값을 구하는 연산을 할 수 있다. 또한 상기 관측값을 기초로 상기 마르코프 모형에서 관측값이 관측될 가능성인 가능도를 구하는 연산을 할 수 있다. 상기 연산된 가능도는 쿨백라이블러 정보계산부(240)로 전달될 수 있다. The Markov model operation unit 230 can calculate an observation value that is a frequency at which a certain pattern is observed based on the Markov model transmitted from the Markov model generation unit 220. [ Also, it is possible to calculate the likelihood that the observed value is likely to be observed in the Markov model based on the observed value. The calculated likelihood may be transmitted to the cool bag librer information calculation unit 240.

도 3은 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 마르코프모델 생성부의 구성을 도시한 블록도이다.3 is a block diagram illustrating a configuration of a Markov model generation unit of an Internet traffic classification apparatus according to an embodiment of the present invention.

도 3을 참조하면, 마르코프모델 생성부(220)는 훈련모델 생성부(222) 및 검증모델 생성부(224)를 포함할 수 있다.Referring to FIG. 3, the Markov model generation unit 220 may include a training model generation unit 222 and a verification model generation unit 224.

훈련모델 생성부(222)는 훈련 마르코프모델일 수 있다. 상기 훈련 마르코프모델은 특정 어플리케이션에 따른 마르코프모델일 수 있다.The training model generation unit 222 may be a training Markov model. The training Markov model may be a Markov model for a specific application.

훈련모델 생성부(222)는 알려진 네트웍 어플리케이션별로 별도의 마르코프모델을 생성할 수 있다. 훈련모델 생성부(222)의 어플리케이션 k로부터 생성된 마르코프모델을 M_k ^TR(t)라 할 수 있다. 전이확률과 초기 확률분포에 대한 경험분포는 Munz 등의 방법을 사용할 수 있다.The training model generation unit 222 can generate a separate Markov model for each known network application. The Markov model generated from the application k of the training model generation unit 222 may be M _k ^TR (t). Munz et al. Can use the transition probability and probability distributions for initial probability distributions.

검증모델 생성부(224)는 검증 마르코프모델일 수 있다. 검증모델 생성부(224)는 개별 커넥션별로 마르코프모델을 생성하거나 관련있는 커넥션들의 그룹별로 마르코프모델을 생성할 수 있다. 각 커넥션 그룹별로 훈련모델 생성부(222)에서 한 것처럼 마르코프모델 M_k ^TR(t) 을 생성할 수 있다.The verification model generation unit 224 may be a verification Markov model. The verification model generation unit 224 can generate a Markov model for each connection or generate a Markov model for each group of related connections. It is possible to generate the Markov model M _k ^TR (t) as in the training model generation unit 222 for each connection group.

도 4는 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 마르코프모델 연산부의 구성을 도시한 블록도이다.4 is a block diagram illustrating a configuration of a Markov model operation unit of an Internet traffic classification apparatus according to an embodiment of the present invention.

도 4를 참조하면, 마르코프모델 연산부(230)는 관측값 연산부(232) 및 가능도 연산부(234)를 포함할 수 있다.Referring to FIG. 4, the Markov model operation unit 230 may include an observed value operation unit 232 and a likelihood operation unit 234.

상기 훈련모델 생성부(222)에서 공유패턴이 0-4-1-4 및 1-4-1-4 라 할 수 있다. M_k는 어플리케이션 k의 마르코프모델일 수 있다. The sharing patterns in the training model generation unit 222 may be 0-4-1-4 and 1-4-1-4. M _k may be a Markov model of application k.

관측값 연산부(232)에서 두 패턴을 마르코프모델의 관측값 O₁(0-4-1-4), O₂(1-4-1-4)라 할 수 있다. App1과 App2에서 O₁의 비율을 각각 p₁과 p₂라 할 수 있다. 각 모형별 초기 상태확률은 k = 1, 2 에 따라서 [수학식 8]이 될 수 있다.In the observed value calculation unit 232, the two patterns can be referred to as the observed values O ₁ (0-4-1-4) and O ₂ (1-4-1-4) of the Markov model. The proportions of O ₁ in App ₁ and App ₂ can be p ₁ and p ₂ , respectively. The initial state probability of each model can be expressed by Equation (8) according to k = 1, 2.

본 발명에서 백은 4개의 단계가 있으므로 마르코프모델의 관측값의 확률을 계산하려면 초기 상태확률분포와 3개의 전이확률 행렬이 필요하다. P_ij ^k를 M_k에서 상태 i에서 상태 j로의 한 계단 전이확률이라 할 수 있으며, 다음의 [수학식 9]로 부터 산출될 수 있다.Since there are four steps in the present invention, an initial state probability distribution and three transition probability matrices are required to calculate the probability of the observed value of the Markov model. P _ij ^k is a step transition probability from the state i to the state j at M _k , and can be calculated from the following equation (9).

각 모형에서 관측값 O₁과 O₂의 빈도를 바탕으로 3개의 전이행렬 P₁₂ ^k, P₂₃ ^k 그리고 P₃₄ ^k를 생성할 수 있다. 처음 둘의 전이 행렬은 [수학식 10]과 같고 나머지 P₃₄ ^k도 유사하게 구할 수 있다.In each model, three transition matrices P ₁₂ ^k , P ₂₃ ^k and P ₃₄ ^k can be generated based on the frequencies of observations O ₁ and O ₂ . The first two transition matrices are as in Equation (10) and the remaining P ₃₄ ^k can be similarly obtained.

가능도 연산부(234)는 관측값 연산부(232)에서 연산된 관측값을 기초로 가능도를 연산할 수 있다. 가능도 연산부(234)는 관측값 연산부(232)에서 연산된 O₁과 O₂가 마르코프모형의 유일한 두 관측값이라 할 때, 각 커넥션에 대하여 [수학식 11] 로 정의하면 l_k(O_j)를 모형 M_k에서 관측값 O_j의 가능도라 할 수 있다. The likelihood calculator 234 can calculate the degree of likelihood based on the observed values calculated by the observed value calculator 232. Possible operation unit 234 when defined as when the O ₁ and O ₂ calculated from the observed value calculating section 232 be called only two observations of a Markov model, Equation 11, for each connection l _k (O _j ) Is a possible value of the observed value O _j in the model M _k .

각 가능도는 [수학식 12]일 수 있다.Each likelihood can be [Equation 12].

도 5는 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 그룹배정부의 구성을 도시한 블록도이다.5 is a block diagram illustrating the configuration of a grouping unit of an Internet traffic classification apparatus according to an embodiment of the present invention.

그룹배정부(300)는 쿨백라이블러 정보계산부(310) 및 그룹산출부(320)을 포함할 수 있다. The group assigning unit 300 may include a cool bag librer information calculating unit 310 and a group calculating unit 320.

쿨백라이블러 정보계산부(310)는 쿨백라이블러 정보 I(M^TE;M₁)과 I(M^TE;M₂)를 계산하여 발산값을 측정할 수 있다. 즉, k = 1, 2에서 [수학식 13]과 같다.The cool back liebler information calculation unit 310 can measure the divergence value by calculating the cool back liebler information I (M ^TE ; M ₁ ) and I (M ^TE ; M ₂ ). That is, k = 1, 2 is expressed by Equation (13).

[수학식 13]에서 l^TE(O_j)와 l^TR(O_j)는 검증 마르코프모델과 훈련 마르코프모델에서 관측값 O_j의 가능도 값이다. 본 발명에서 마르코프모델은 4 단계와 8개의 상태 공간으로 구성되어 있으므로 가능한 관측값의 총 수는 8⁴=4096가 될 수 있다. [수학식 13] 에서 0으로 나누게 되는 경우를 피하기 위하여 l_k ^TR(O_j)=0 이며 l^TE(O_j)≠0가 되는 j에 대하여 l_k ^TR(O_j) = 10^-5를 사용할 수 있다. In Equation 13, l ^TE (O _j ) and l ^TR (O _j ) are likelihood values of the observed value O _{j in} the verification Markov model and the training Markov model. In the present invention, since the Markov model consists of 4 stages and 8 state spaces, the total number of possible observations can be 8 ⁴ = 4096. L _k ^TR (O _j ) = 0 and l _k ^TR (O _j ) = 10 ^-5 is used for j where ^TE (O _j ) ≠ 0 in order to avoid the case of dividing by 0 in Equation (13) .

그룹산출부(320)는 쿨백라이블러 정보계산부(310)에서 산출된 발산값을 비교하여 그룹을 산출할 수 있다. 만약 I(M^TE;M₁) < I(M^TE;M₂)이면, 검증 커넥션 그룹을 M₁로 배정할 수 있다. 또한 I(M^TE;M₁) > I(M^TE;M₂)이면 검증 커넥션 그룹을 M₂로 배정할 수 있다.The group calculating unit 320 may calculate the group by comparing the divergence values calculated by the coolbag reveler information calculating unit 310. [ If I (M ^TE ; M ₁ ) < I (M ^TE ; M ₂ ), then the verification connection group may be assigned M ₁ . If I (M ^TE ; M ₁ )> I (M ^TE ; M ₂ ), then the verification connection group can be assigned to M ₂ .

또한 그룹배정 방법은 쿨백라이블러 이외에 다른 방법이 있을 수 있다. 이러한 다른 방법에는 Majority 방법이 있을 수 있다. Majority 방법은 각 개별 배정 중 가장 많이 배정된 곳에 그룹 전체를 배정하는 방법일 수 있다.There may also be other methods of group assignment other than a cool bag librer. There may be a Majority method in these other ways. The Majority method can be a method of assigning the entire group to the most assigned of each individual assignment.

또 다른 그룹배정 방법으로는 4096d 배정법이 있을 수 있다. 4096d 배정법은 4096 차원에서의 유클리드 거리에 기반한 방법이다. 본 발명은 8⁴= 4096가지의 가능한 관측값이 있으므로 크기가 n인 BOF 하나를 4096 차원의 공간상의 점으로 대응시킬 수 있다. 예를 들어 크기 10인 검증 BOF가 4개의 O1, 2개의 O2, 1개의 O4, 그리고 3개의 O5로 이루어져 있다면 이 BOF는 4096 차원의 공간상의 좌표는 (4, 2, 0, 1, 3, 0,...,0)로 표현할 수 있다. 적절한 표준화 과정 후에 유클리디언 거리를 이용하여 검증 BOF의 소속 클래스를 정할 수 있다. Another group assignment method is the 4096d assignment method. The 4096d assignment method is based on the Euclidean distance in the 4096 dimension. Since the present invention has 8 ⁴ = 4096 possible observations, one BOF of size n can be associated with a point on space of 4096 dimension. For example, if the verification BOF of size 10 consists of 4 O1, 2 O2, 1 O4, and 3 O5, then this BOF has 4096 dimensional spatial coordinates (4, 2, 0, 1, 3, 0 , ..., 0). After the appropriate standardization process, the class of the verification BOF can be determined using the Euclidean distance.

도 6은 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 평가부에서 평가측도을 도시한 표이다.FIG. 6 is a table showing evaluation measures in the evaluation unit of the Internet traffic classification apparatus according to the embodiment of the present invention. FIG.

도 6을 참조하면, recall 계산식은 패턴이 셋 이상인 경우에도 적용할 수 있다. precision을 구하려면 App1과 App2 의 비율 r을 알아야 하고 베이즈 정리를 적용하면 된다. 두 어플리케이션의 비율을 모르는 경우에는 r=1로 가정할 수 있다.Referring to FIG. 6, the recall calculation formula can be applied even when the number of patterns is three or more. To get the precision, we need to know the ratio r between App1 and App2 and apply Bayes theorem. If you do not know the ratio of the two applications, you can assume r = 1.

상기 recall 및 precision 을 기초로 [수학식 14]과 같이 F-측도를 구할 수 있다. Based on the above recall and precision, an F-measure can be obtained as shown in Equation (14).

또한 본 발명의 실시예에 따른 쿨백라이블러 정보 이외에 다른 방법을 적용하여 그룹을 배정하였을 경우도 평가측도의 계산이 가능하다. Also, when the group is allocated by applying another method other than the cool bag librer information according to the embodiment of the present invention, the evaluation measure can be calculated.

Majority의 경우 X_k는 모형 M_k에서 크기 n인 그룹에서 관측값 O₁의 개수라 할 수 있다. 이때, X_k는 이항분포를 따른다. 즉, X_k ~ Bin(n,p_k) 로 표현할 수 있다. 그러면 P(X₁ > n/2)는 해당 그룹이 M₁으로 배정될 확률이 될 수 있다. 따라서 [표 2] 와 같이 recall과 precision을 계산할 수 있다. P(X_k = n/2)는 미정으로 분류할 확률이다.In the case of majority, X _k is the number of observations O _{1 in} the group of size n in model M _k . At this time, X _k follows the binomial distribution. That is, X _k ~ Bin (n, p _k ). Then P (X ₁ > n / 2) can be the probability that the group is assigned to M ₁ . Therefore, recall and precision can be calculated as shown in [Table 2]. P (X _k = n / 2) is the probability to classify as undefined.

M1M1 M2M2 recallrecall

precision

undecided

4096d 방법의 경우는 이항분포 대신에 다항 분포를 이용하여 평가측도를 계산할 수 있다.In the case of the 4096d method, the evaluation measure can be calculated using the polynomial distribution instead of the binomial distribution.

그리고 또한 그룹배정뿐만 아니라 개별배정으로 분류를 시행할 경우에도 평가측도를 계산할 수 있다. 개별배정은 관측값의 가능도에 따라서 수행된다. 따라서 결정 규칙은 O_j를 [수학식 15]과 같이 분류한다.Also, evaluation measures can be calculated when classification as individual assignment as well as group assignment. Individual assignments are performed according to the likelihood of the observations. Therefore, the decision rule classifies O _j as [Equation 15].

계산의 편리함을 위하여 p₁ > p₂로 가정한다.

이고

이므로 O₁이 관측되면 App1으로 O₂이면 App2로 배정한다. 확률은 희박하지만 만일 p₁ = p₂이면 undecided로 한다. 결정규칙이 주어지면 평가측도 recall은 [수학식 16]와 같다.For convenience of calculation, p ₁ > p ₂ .

ego

If so O ₁ is observed when O ₂ is assigned to the App1 App2. The probability is small, but if p ₁ = p _2, it is undecided. If a decision rule is given, the evaluation measure recall is as shown in [Equation 16].

여기서 P_k(O_j)는 모형 M_k에서 관측값 O_j의 비율을 나타내고 I_A는 집합 A의 지시함수이다. 위의 recall계산식은 패턴이 셋 이상인 경우에도 적용할 수 있다. precision을 구하려면 App1과 App2 의 비율 r을 알아야 하고 베이즈 정리를 적용하면 된다. 두 어플리케이션의 비율을 모르는 경우에는 r=1로 가정할 수 있다. 평가 측도를 요약하면 아래 [표 3]와 같다.Where P _k (O _j ) denotes the ratio of the observed value O _j in the model M _k and I _A is the indicator function of the set A. The above recall formula can be applied even if the pattern is three or more. To get the precision, we need to know the ratio r between App1 and App2 and apply Bayes theorem. If you do not know the ratio of the two applications, you can assume r = 1. The evaluation measures are summarized in [Table 3] below.

M1M1 M2M2 recallrecall p1p1 1-p21-p2
precision
precision

도 7은 본 발명의 실시예에 따른 인터넷 트래픽 분류 장치의 시뮬레이터를 시행하기 위한 가상모델을 설정한 표이다.FIG. 7 is a table in which a virtual model for implementing a simulator of an Internet traffic classification apparatus according to an embodiment of the present invention is set.

도 7을 참조하면, 이후의 표기에서는 대문자 약자를 이용하여 평가 측도를 표시하며 첨자는 전처럼 어플리케이션을 나타낸다.Referring to FIG. 7, in the following notation, capital letters are used to indicate an evaluation measure, and the suffix indicates an application as before.

R_k 는 recall_k _,P_k 는 precision_k , F_k 는 F - 측도, U_k 는 undecided_k , Maj 는 Majority , K-L 는 쿨백라이블러를 나타낸다.R _k Is the recall _k _, P _k Is the precision _k , F _k Is an F-measure, U _k is undecided _k , Maj is Majority, and KL is a Kullback librer.

첫번째 시뮬레이션에서 p₁=0.6, q₁=0.4, p₂=0.5, q₂=0.5로 설정한다.In the first simulation, p ₁ = 0.6, q ₁ = 0.4, p ₂ = 0.5, and q ₂ = 0.5 are set.

그룹group

Performance measure

Maj 10 .6331 .6268 .6299 .2007 .3770 .6940 .4885 .2461 100 .9729 .6789 .7997 .0103 .4602 .9649 .6232 .0796

K-L

10 .6331 .6268 .6299 0 .6230 .6294 .6262 0 100 .8211 .8582 .8393 0 .8644 .8285 .8461 0

4096d

10 .6331 .6268 .6299 0 .6230 .6294 .6262 0 100 .8689 .8252 .8465 0 .8159 .8616 .8381 0 Individual Assignments .6 .5454 .5714 0 .5 .5555 .5263 0

첫번째 시뮬레이션은 두 개의 패턴만 있는 경우이다. 예상대로 그룹배정이 개별배정보다 더 낳은 성능을 보여준다. 그리고 K-L과 4096d가 Maj 보다 좋은 결과를 보여준다. 이 경우 개별배정(50%)과 Maj(46%)에서의 낮은 R₂의 값이 크기 100인 K-L에서는 86%까지 증가한다. 모형 M₂에서는 O₁과 O₂가 각각 50%이기 때문에 개별배정과 Maj의 경우에는, 실제 절반의 App2를 App1으로 잘못 배정하게 된다. 그렇지만 이런 경우에서도 K-L은 그룹 크기가 커지면 성능이 향상된다.The first simulation is when there are only two patterns. As expected, group assignments show better performance than individual assignments. And KL and 4096d show better results than Maj. In this case, the value of low R ₂ in individual doses (50%) and Maj (46%) increases to 86% in KL of size 100. In model M ₂ , O ₁ and O ₂ are 50% each, so in the case of individual assignment and Maj, the actual half of App 2 is incorrectly assigned to App 1. However, even in this case, KL improves performance when the group size increases.

두번째 시뮬레이션에서는 첫번째 시뮬레이션과 비슷하지만 어플리케이션 2에 추가패턴이 있다. 두번째 시뮬레이션에서 p₁=0.6, q₁=0.4, p₂=0.5, q₂=0.4로 설정한다.In the second simulation, it is similar to the first simulation, but there are additional patterns in Application 2. In the second simulation, p ₁ = 0.6, q ₁ = 0.4, p ₂ = 0.5, and q ₂ = 0.4 are set.

그룹group

Performance measure

Maj 10 One .5004 .6670 0 1.4e-04 One 2.9e-04 1.4e-03 100 One .5 .6667 0 6.3e-25 One 1.2e-24 5.1e-24 K-L 10 One .7415 .8515 0 .6513 One .7888 0 100 One .9999 .9999 0 .9999 One .9999 0

4096d

10 .6331 .6513 .6421 0 .6611 .6431 .6519 0 100 .9729 .9196 .9455 0 .9150 .9712 .8423 0 Individual Assignments One .5263 .6897 0 .One One .1818 0

두번째 시뮬레이션의 경우 개별배정이 그룹배정 Maj 보다 더 좋은 경우이나 둘 다 R₂나 F₂를 보면 아주 안 좋은 성능치를 보여주고 있다. K-L은 99%의 높은 성능을 보여준다.In the second simulation, individual assignment is better than group assignment Maj, or both R ₂ and F ₂ show very poor performance. KL has a high performance of 99%.

세번째 시뮬레이션에서는 p₁=0.6, q₁=0.3, p₂=0.5, q₂=0.4로 설정한다.In the third simulation, p ₁ = 0.6, q ₁ = 0.3, p ₂ = 0.5, and q ₂ = 0.4 are set.

그룹group

Performance measure

Maj 10 .8497 .6927 .7632 1.0e-01 .3770 .8884 .5293 .2461 100 .9999 .6848 .8129 1.3e-05 .4602 .9999 .6303 .0796

K-L

10 .8463 .8973 .8710 0 .9031 .8546 .8782 0 100 .9999 .9999 .9999 0 .9999 .9999 .9999 0

4096d

10 .7704 .7866 .7784 0 .7910 .7750 .7830 0 100 .9857 .9808 .9832 0 .9807 .9856 .9831 0 Individual Assignments .7 .5833 .6363 0 .5 .6250 .5560 0

세번째 시뮬레이션에서는 각 어플리케이션 마다 유일한 패턴이 하나씩 추가된 경우이다. 여전히 K-L이 가장 좋은 성능을 보여주고 있지만 두번째 시뮬레이션에 비하면 4096d와의 차이는 축소되었다. 즉, 어느 한 쪽에 유일한 패턴이 있는 경우에 K-L 방법이 상대적으로 더 뛰어난 성능을 보여준다.In the third simulation, a unique pattern is added for each application. K-L still has the best performance, but the difference from 4096d is reduced compared to the second simulation. That is, the K-L method shows relatively better performance when there is only one pattern on either side.

네번째 시뮬레이션에서는 p₁=0.56, q₁=0.11, p₂=0.46, q₂=0.07로 설정한다.In the fourth simulation, p ₁ = 0.56, q ₁ = 0.11, p ₂ = 0.46, and q ₂ = 0.07 are set.

그룹group

Performance measure

Maj 10 One .6884 .8155 0 .3057 One .4682 .2417 100 One .5909 .7429 0 .2413 One .3888 .0665

K-L

10 One .9982 .9991 0 .9983 One .9991 0 100 One One One 0 One One One 0

4096d

10 One .9830 .9914 0 .9827 One .9913 0 100 One One One 0 One One One 0 Individual Assignments One .6536 .7905 0 .47 One .6395 0

네번째 시뮬레이션은 실제 네트웍 트래픽 상황과 가장 가까운 경우이다. Maj는 대부분의 경우에 개별배정보다 좋지 않은 결과를 보여주고 있다. 반면에 K-L과 4096d는 전처럼 아주 우수한 성능을 보여준다.The fourth simulation is the closest to the actual network traffic situation. Maj has shown worse results than individual assignments in most cases. On the other hand, the K-L and 4096d show excellent performance as before.

도 8은 본 발명의 일 실시예에 따른 인터넷 트래픽 분류 장치에서 실제 트래픽 데이터에 대한 성능 측정결과 비교를 나타낸 표이다.8 is a table showing comparison of performance measurement results of actual traffic data in the Internet traffic classification apparatus according to an embodiment of the present invention.

도 8을 참조하면, 본 발명의 일 실시예에 따른 네트워크의 상황을 시뮬레이트하기 위해 실제 패킷 트레이스들을 수집하였다. 그리고 pcap 라이브러리 함수를 사용하여 이 트레이스들로부터 유효 TCP 연결들을 추출하였다. 본 발명의 목적상 유효 TCP 연결은 클라이언트와 서버간에 통신이 삼방향(three-way) TCP 핸드쉐이크로 시작되고 적어도 4개의 패킷이 그 후에 교환되는 TCP 연결이다. 본 발명은 그 중에서 SMTP(포트 25)와 IMAP(포트 143) 연결을 구별해내었다. 실제 네트워크환경에서 제공하는 트레이스 파일은 매우 방대하여 위의 방법으로 얻어낸 SMTP 연결은 160,000 개에 이르렀고 IMAP은 30,000 연결에 이르렀다. 이 연결들을 10중 교차타당성 기법을 사용하여 트레이닝 집합과 테스팅 집합으로 구분하였다. 즉 임의로 10분지 9 연결을 추출하여 타겟 마르코프모델을 훈련하는데 사용하였고 나머지 10분지 1 연결을 테스팅 목적으로 사용하였다. 각각의 연결에 대해서는 삼방향 TCP 핸드쉐이크 (SYN, SYN/ACK, final ACK)를 제거하고 핸드쉐이크 이후의 패킷 중 첫 4개의 패킷만 추출하였다.Referring to FIG. 8, actual packet traces are collected to simulate the situation of a network according to an embodiment of the present invention. We then used the pcap library function to extract valid TCP connections from these traces. For purposes of the present invention, a valid TCP connection is a TCP connection where communication between a client and a server begins with a three-way TCP handshake and at least four packets are then exchanged. The present invention distinguished SMTP (port 25) and IMAP (port 143) connections among them. The trace files provided in the actual network environment are so large that the SMTP connection obtained in the above method reaches 160,000 and IMAP reaches 30,000 connections. These connections were classified into a training set and a testing set using a 10-point cross validation technique. In other words, arbitrary 10-point 9 linkages were extracted and used to train the target Markov model and the remaining 10-point 1 link was used for testing purposes. For each connection, the three-way TCP handshake (SYN, SYN / ACK, final ACK) was removed and only the first four packets of the packets after the handshake were extracted.

[표 8]는 총 개의 IMAP 패킷과 개의 SMTP 패킷이 주어진 시간동안 각 관찰기간에 관찰된 횟수를 보여준다.Table 8 shows the total number of IMAP packets and the number of SMTP packets observed for each observation period for a given time period.

........

........ Total IMAP

........

SMTP

........ ........

주어진 O_j개의 트래픽 관찰에 대해 경험적 가능도(empirical likelihood)를 각 모델 M_k에 대해 [수학식 17] 같이 계산한다.For each given O _j traffic observation, an empirical likelihood is calculated for each model M _k as: < EMI ID = 17.0 >

은 첫 단계에서 상태 s1의 비율이며

는 총 N_k개의 연결에 대한 전이행렬이다. 성능 측정치인 recall₁은 경험적 가능도

와

에 의해 계산된다. 다른 성능 측정치도 비슷한 방법으로 계산된다.

Is the ratio of state s1 in the first step

Is the transition matrix for a total of N _k connections. The performance measure, recall ₁ ,

Wow

Lt; / RTI > Other performance measures are calculated in a similar way.

IMAPIMAP SMTPSMTP 패턴pattern 비율ratio 패턴pattern 비율ratio 0-4-1-40-4-1-4 0.55640.5564 0-4-1-40-4-1-4 0.46110.4611 0-4-1-00-4-1-0 0.21890.2189 0-4-0-40-4-0-4 0.28700.2870 1-4-1-41-4-1-4 0.11340.1134 0-4-0-00-4-0-0 0.09060.0906 0-4-0-00-4-0-0 0.07900.0790 1-4-1-41-4-1-4 0.06870.0687 0-4-1-10-4-1-1 0.01080.0108 1-4-2-41-4-2-4 0.01500.0150 0-4-0-40-4-0-4 0.01080.0108 1-4-0-41-4-0-4 0.01410.0141

[표 9]은 각 어플리케이션의 패턴을 빈도의 내림차순으로 정렬한 결과이다. 양 어플리케이션은 모두 0-4-1-4 패턴을 가장 빈번한 패턴으로 가지고 있고, 그밖에 0-4-0-4, 0-4-0-0, 1-4-1-4 등의 패턴들이 자주 나타나는 것을 볼 수 있다. Table 9 shows the result of sorting patterns of each application in decreasing order of frequency. Both applications have 0-4-1-4 patterns in the most frequent pattern, and 0-4-0-4, 0-4-0-0, 1-4-1-4, etc. Can be seen.

[표 9]은 SMTP와 IMAP 어플리케이션이 Zhang의 BOF 기법이나 Munz의 단순한 마르코프모델 기법 등의 기존 방법에 의해 분류가 어려운 이유를 보여준다. [표 9]에서 보듯이

이고

이므로 SMTP의 패턴 0-4-1-4 와 1-4-1-4 는 IMAP으로 잘못 분류된다. 따라서 단순한 분류기법으로는 SMTP의 리콜 비율이 1-0.5298 을 넘을 수 없다. Maj 기법이 사용하는 그룹 기반 분류는 개별 분류보다 SMTP의 리콜 비율을 더 악화시킨다. SMTP 패턴의 반 이상이 잘못 분류되기 때문이다.[Table 9] shows why SMTP and IMAP applications are difficult to classify by existing methods such as Zhang's BOF technique or Munz's simple Markov model technique. As shown in [Table 9]

ego

, The SMTP patterns 0-4-1-4 and 1-4-1-4 are incorrectly classified as IMAP. Therefore, the simple reclassification technique can not exceed the recall rate of 1-0.5298. The group-based classification used by the Maj technique makes the recall rate of SMTP worse than individual classification. This is because more than half of the SMTP patterns are misclassified.

쿨백라이블러 정보를 사용하는 방법은 가장 좋은 성능을 보인다. 이 경우 SMTP 리콜 비율(R₂)은 여러가지 백 사이즈에 대해 90%를 넘으며 백 사이즈 100의 경우는 100%에 육박한다. IMAP의 리콜 비율(R₁)도 매우 높기는 하지만 Maj 기법에 비해 다소 낮은 수치를 보인다. 하지만 Maj 기법에서 IMAP의 리콜 비율이 높은 이유는 많은 SMTP 패킷이 IMAP 패킷으로 오분류되기 때문임을 명심해야 한다. 4096d 기법도 SMTP의 경우 90% 근방의 상당히 높은 리콜 비율을 보인다. 하지만 IMAP 패킷에 대해서는 65% 정도로 리콜 비율이 떨어진다. 이는 4096 차원에서의 단순한 유클리디언 거리 정보가 서로 트래픽 패턴이 심하게 오버랩되는 어플리케이션을 구별해내기에는 충분한 정보가 못되기 때문인 것으로 판단된다. 더욱이 4096d 기법은 모든 차원에서의 거리를 동등하게 취급하므로 특정 어플리케이션에 나타나는 고유한 패턴의 경우 거리 차가 많이 나지 않으면 구별해낼 수 없다는 문제점이 있다.Using the CoolBag librer information shows the best performance. In this case, the SMTP Recall Ratio (R ₂ ) exceeds 90% for various bag sizes and approaches 100% for 100 size bags. The IMAP recall rate (R ₁ ) is also very high, but somewhat lower than the Maj technique. However, it should be kept in mind that the reason for the high recall rate of IMAP in the Maj technique is that many SMTP packets are misclassified as IMAP packets. The 4096d technique also shows a significantly higher recall rate of around 90% for SMTP. However, the recall rate drops to 65% for IMAP packets. It is considered that the reason is that sufficient information is not enough to distinguish applications in which the Euclidean distance information in the 4096 dimension is severely overlapped with each other. Furthermore, since the 4096d technique treats the distances in all dimensions equally, there is a problem that unique patterns appearing in a specific application can not be distinguished unless the distance difference is large.

도 9는 본 발명의 일 실시예인 인터넷 트래픽 분류 방법의 수행과정을 도시한 순서도이다.FIG. 9 is a flowchart illustrating a process of performing an Internet traffic classification method according to an embodiment of the present invention.

도 9를 참조하면, TCP 커넥션정보 검출부(100)는 네트워크상의 TCP 정보를 검출한다(S100). 상기 TCP 정보는 실제 네트워크상의 인터넷 트래픽 정보일 수 있다. 또한 상기 TCP 정보는 기존의 분류방법으로 분류가 어려운 어플리케이션 SMTP 와 IMAP 포트에서 검출된 정보일 수 있다. 상기 검출된 TCP 정보는 백 형성부(210)로 전달될 수 있다.Referring to FIG. 9, the TCP connection information detecting unit 100 detects TCP information on the network (S100). The TCP information may be Internet traffic information on an actual network. Also, the TCP information may be information detected in an application SMTP and an IMAP port that are difficult to classify by an existing classification method. The detected TCP information may be transmitted to the bag forming unit 210.

백 형성부(210)는 상기 검출된 TCP 정보의 처음 4개의 패킷을 바탕으로 백(그룹)을 형성한다(S110). 상기 형성된 백(그룹)은 포트번호만으로 형성될 수 있다. 상기 형성된 백은 마르코프모델 생성부(220)로 전달될 수 있다.The bag forming unit 210 forms a bag (group) based on the first four packets of the detected TCP information (S110). The formed bag (group) may be formed of only the port number. The formed bag may be transmitted to the Markov model generation unit 220.

정보연산부(200)은 상기 형성된 백을 기초로 마르코프 모델로 생성시키고 쿨백 라이블러 정보에 맞는 가능도 값으로 연산한다(S120). 상기 마르코프모델은 어플리케이션 1,2 및 검증 마르코프모델일 수 있다.The information operation unit 200 generates a Markov model based on the formed bag and calculates a probability value corresponding to the cool bag librer information (S120). The Markov model may be an application 1, 2 and a verification Markov model.

훈련모델 생성부(220)는 상기 형성된 백을 기초로 어플리케이션 1의 마르코프 모델을 생성한다(S121). 상기 어플리케이션 1은 SMTP(포트 25) 또는 IMTP(포트 143)일 수 있다. The training model generation unit 220 generates a Markov model of the application 1 based on the formed bag (S121). The application 1 may be SMTP (port 25) or IMTP (port 143).

관측값 연산부(232)는 상기 생성된 어플리케이션 1의 마르코프모델을 기초로 관측값이 연산된다(S122). 상기 연산된 관측값은 가능도 연산을 위해 가능도 연산부(234)로 전달될 수 있다.The observed value computing unit 232 computes an observation value based on the Markov model of the generated application 1 (S122). The calculated observation value may be transmitted to the likelihood calculator 234 for likelihood calculation.

가능도 연산부(234)는 상기 연산된 관측값을 기초로 가능도 값을 연산한다(S123). 상기 가능도 값은 쿨백라이블러 정보를 계산하기 위한 값일 수 있다. The likelihood calculator 234 calculates a likelihood value based on the calculated observations (S123). The likelihood value may be a value for calculating the cool bag librer information.

훈련모델 생성부(220)는 상기 형성된 백을 기초로 검증 마르코프 모델을 생성한다(S124). The training model generation unit 220 generates a verification Markov model based on the formed bag (S124).

관측값 연산부(232)는 상기 생성된 검증 마르코프모델을 기초로 관측값이 연산된다(S125). 상기 연산된 관측값은 가능도 연산을 위해 가능도 연산부(234)로 전달될 수 있다.The observation value computing unit 232 computes an observation value based on the generated verification Markov model (S125). The calculated observation value may be transmitted to the likelihood calculator 234 for likelihood calculation.

가능도 연산부(234)는 상기 연산된 관측값을 기초로 가능도 값을 연산한다(S126). 상기 가능도 값은 쿨백라이블러 정보를 계산하기 위한 값일 수 있다. The likelihood calculator 234 calculates a likelihood value based on the calculated observations (S126). The likelihood value may be a value for calculating the cool bag librer information.

훈련모델 생성부(220)는 상기 형성된 백을 기초로 어플리케이션 2의 마르코프 모델을 생성한다(S127). 상기 어플리케이션 2는 SMTP(포트 25) 또는 IMTP(포트 143)일 수 있다. The training model generation unit 220 generates a Markov model of the application 2 based on the formed bag (S127). The application 2 may be SMTP (port 25) or IMTP (port 143).

관측값 연산부(232)는 상기 생성된 어플리케이션 2의 마르코프모델을 기초로 관측값이 연산된다(S128). 상기 연산된 관측값은 가능도 연산을 위해 가능도 연산부(234)로 전달될 수 있다.The observation value computing unit 232 computes observation values based on the Markov model of the generated application 2 (S128). The calculated observation value may be transmitted to the likelihood calculator 234 for likelihood calculation.

가능도 연산부(234)는 상기 연산된 관측값을 기초로 가능도 값을 연산한다(S129). 상기 가능도 값은 쿨백라이블러 정보를 계산하기 위한 값일 수 있다. The likelihood calculator 234 calculates a likelihood value based on the calculated observation value (S129). The likelihood value may be a value for calculating the cool bag librer information.

쿨백라이블러 정보계산부(310)는 단계(S123)에서 연산된 가능도와 단계(S126)에서 연산된 가능도를 기초로 쿨백라이블러 발산값을 구한다(S130). 상기 발산값은 그룹 산출부(320)로 전달될 수 있다.The cool-back liebler information calculation unit 310 obtains the cool-white liebler divergence value based on the possibility calculated in step S123 and the degree of possibility calculated in step S126 (S130). The divergence value may be transmitted to the group calculator 320.

쿨백라이블러 정보계산부(310)는 단계(S126)에서 연산된 가능도와 단계(S129)에서 연산된 가능도를 기초로 쿨백라이블러 발산값을 구한다(S140). 상기 발산값은 그룹 산출부(320)로 전달될 수 있다.The cool-back liebler information calculation unit 310 obtains the cool-white liebler divergence value based on the possibility calculated in step S126 and the degree of possibility calculated in step S129 (S140). The divergence value may be transmitted to the group calculator 320.

그룹 산출부(320)는 단계(S130)의 발산값과 단계(S140)의 발산값을 비교하여 그룹을 산출한다(S150). 상기 산출된 그룹은 오류를 평가하기 위해 평가부(400)로 전달된다.The group calculator 320 compares the divergence value of the step S130 with the divergence value of the step S140 to calculate a group (S150). The calculated group is transmitted to the evaluation unit 400 for evaluating the error.

평가부(400)는 상기 산출된 그룹을 기초로 오류율 또는 오분류 확률을 측정하여 계산한다(S160). 상기 오류를 측정하는 방법은 Munz 와 Zhang에서 정의된 recall 과 precision 그리고 F-측도를 이용할 수 있다.The evaluating unit 400 measures and calculates an error rate or a misclassification probability based on the calculated group (S160). The method of measuring the error can use the recall, precision and F-measure defined in Munz and Zhang.

본 발명은 또한 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can also be embodied as computer-readable codes on a computer-readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may be implemented in the form of a carrier wave (for example, transmission via the Internet) . The computer-readable recording medium may also be distributed over a networked computer system so that computer readable code can be stored and executed in a distributed manner.

이상에서 본 발명의 바람직한 실시예에 대해 도시하고 설명하였으나, 본 발명은 상술한 특정의 바람직한 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형실시가 가능한 것은 물론이고, 그와 같은 변경을 청구범위 기재의 범위 내에 있게 된다.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is clearly understood that the same is by way of illustration and example only and is not to be taken by way of limitation in the embodiment in which said invention is directed. It will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

20: 네트워크 인터페이스부 100: 커넥션정보 검출부
200: 정보연산부 210: 백 형성부
220: 마르코프모델 생성부 222: 훈련모델 생성부
230: 마르코프모델 연산부 224: 검증모델 생성부
232: 관측값 연산부 234: 가능도 연산부
310: 쿨백라이블러정보 계산부 320: 그룹산출부20: Network interface unit 100: Connection information detection unit
200: information calculating unit 210: bag forming unit
220: Markov model generation unit 222: Training model generation unit
230: Markov model operation unit 224: Verification model generation unit
232: observed value arithmetic unit 234: likelihood arithmetic unit
310: Cool Bag Libra information calculation unit 320: Group calculation unit

Claims

A connection information detector for detecting connection information of a network;
An information calculating unit for calculating a value for determining a group based on the detected connection information; And
And a grouping unit for classifying a group of packets which are the connection information based on the calculated values.

The method according to claim 1,
Wherein the information operation unit comprises:
A bag forming unit for forming bags on the basis of the connection information;
A Markov model generating unit for generating a Markov model based on the formed bag; And
And a Markov model operation unit for calculating a value for a cool bag librer information calculation based on the generated Markov model.

3. The method of claim 2,
The bag-
And forms a bag based on the first four packets detected by the connection information detecting unit.

3. The method of claim 2,
Wherein the Markov model generating unit comprises:
A training model generation unit for generating a training Markov model of the application based on the formed bag; And
And a verification model generator for generating a verification Markov model for comparison with a training Markov model of the generated application based on the formed bag.

3. The method of claim 2,
Wherein the Markov model generating unit comprises:
Wherein a Markov model is generated by a transition probability matrix and an initial probability distribution of a finite state space.

5. The method of claim 4,
Wherein the training-
And generates a Markov model of an SMTP application and an IMAP application based on the formed bag.

5. The method of claim 4,
The verification model generation unit generates,
And forms a Markov model corresponding to relevant connections based on the formed bag.

3. The method of claim 2,
The Markov model calculation unit calculates,
An observation value calculation unit for calculating an observation value that is a frequency at which a certain pattern is observed in the Markov model based on the Markov model generated by the Markov model generation unit; And
And a likelihood calculator for calculating likelihood that a certain pattern will be observed in the Markov model based on the observed value.

The method according to claim 1,
Wherein the group assignment unit comprises:
And a divergence value is measured based on a numerical value for determining the group. And
And a group calculation unit for selecting a group based on the calculated divergence value.

10. The method of claim 9,
The cool-bag librer information calculation unit may calculate,
And calculates a cool-white-liebler information based on the likelihood of the training Markov model calculated by the information calculating unit and the likelihood of the verification Markov model, and measures the divergence value.

10. The method of claim 9,
The group calculator calculates,
And comparing the measured divergence values to select a group.

The method according to claim 1,
Further comprising an evaluation unit for measuring an error classification probability of the classified group.

Detecting connection information of the network;
Calculating a value for determining a group based on the connection information; And
And classifying the group based on the calculated numerical values.

14. The method of claim 13,
The step of calculating by numerical value for judging the group includes:
Forming a bag based on the network information;
Generating a Markov model based on the formed bag; And
And calculating a value for a cool bag librer information calculation based on the generated Markov model.

15. The method of claim 14,
Wherein forming the bag comprises:
Wherein a bag is formed based on the first four packets detected in the step of detecting connection information of the network.

15. The method of claim 14,
Wherein the step of generating the Markov model comprises:
Generating a training Markov model of the application based on the formed bag; And
And generating a verification Markov model for comparison with a training Markov model of the generated application based on the formed bag.

15. The method of claim 14,
Wherein the step of generating the Markov model comprises:
And generating a Markov model based on the initial probability distribution of the transition probability matrix and the finite state space.

17. The method of claim 16,
Wherein the generating the training Markov model comprises:
And generating a Markov model of an SMTP application and an IMAP application based on the formed bag.

17. The method of claim 16,
Wherein the generating the verification Markov model comprises:
And forming a Markov model of related connections based on the formed bag.

15. The method of claim 14,
Wherein the step of calculating the value for the cool-
Calculating an observation value that is a frequency at which a certain pattern is observed in the Markov model based on the Markov model generated at the step of generating the Markov model; And
And computing a likelihood that a certain pattern will be observed in the Markov model based on the computed observations.

14. The method of claim 13,
Wherein classifying the group comprises:
Measuring a cool-white-libler divergence value based on the values for determining the group; And
And selecting a group based on the calculated divergence value.

22. The method of claim 21,
Wherein the step of measuring the cool-
Calculating a cool-leav- er information based on the likelihood of the training Markov model and the likelihood of the verification Markov model calculated in the step of calculating the group to determine the group, and measuring the divergence value Traffic classification method.

22. The method of claim 21,
Wherein the step of selecting the group comprises:
And comparing the measured divergence values to select a group.

14. The method of claim 13,
Further comprising the step of measuring a misclassification probability of the classified group.