WO2010116036A1

WO2010116036A1 - Method and device for identifying applications which generate data traffic flows

Info

Publication number: WO2010116036A1
Application number: PCT/FI2010/050275
Authority: WO
Inventors: Matti Hirvonen; Jukka-Pekka Laulajainen
Original assignee: Valtion Teknillinen Tutkimuskeskus
Priority date: 2009-04-09
Filing date: 2010-04-08
Publication date: 2010-10-14
Also published as: FI20095393A0

Abstract

A method for identifying an application that generates a data traffic flow to a communication network comprises making (101) a first preliminary identification of the application on the basis of properties of first data frames transferred at the beginning of the data traffic flow, making (102) a second preliminary identification of the application on the basis of statistical properties associated with a portion of the data traffic flow transferred after the first data frames, and identifying (103) the application at least partly on the basis of the first preliminary identification and the second preliminary identification. As the identification is made in two phases, it is possible to utilise both the characteristics of the beginning of the data traffic flow and the statistical properties related to a later phase of the data traffic flow, and therefore to improve the successfulness of the identification.

Description

METHOD AND DEVICE FOR IDENTIFYING APPLICATIONS WHICH GENERATE DATA TRAFFIC FLOWS

Field of the invention

The invention relates generally to a method and a device for identifying applications which generate data traffic flows. Furthermore, the invention relates to a network element and a computer program suitable for identifying applications which generate data traffic flows.

Background

In conjunction with telecommunications, it is commonplace to have a need to identify applications which generate data traffic flows in order to be able to handle the data traffic flows in an appropriate manner in a network element that can be for example a router, a switch, a terminal device, or any other device arranged to control the data traffic flows. An application can be, for example, the electronic mail, the file transfer protocol (FTP), the hypertext transfer protocol (HTTP), the Secure Shell (SSH), the voice transfer e.g. the Voice over Internet (VoIP), or any other application that generates a data traffic flow to a communication network. The identification of application may be needed, for example, for management and optimisation of the quality of service (QoS), for an intrusion detection system (IDS), and/or for an intrusion prevention system (IPS). In present telecommunication systems, it is very often difficult or even impossible to identify an application generating an arriving data traffic flow merely on the basis of a port number related to the data traffic flow and/or on the basis of payload data analysis, because many applications are arranged to use dynamically allocated port numbers and, especially in a case of hostile activities, an application can pose as another application and/or use encryption for intentionally avoiding identification.

Publication US2006277288 discloses a system in which applications that generate data traffic flows are identified by analyzing network traffic and network host information. The network host information may be collected by network host monitors associated with network hosts. Network traffic and network host information are evaluated against data traffic flow profiles to identify data traffic flows. If a data traffic flow is identified with high certainty and are associated with previously identified applications, then data traffic flow policies can be applied to the data traffic flows to block, throttle, accelerate, enhance, or transform the data traffic flows. If a data traffic flow is identified with lesser certainty or is not associated with a previously identified application, then a new data traffic flow profile can be created from further analysis of network traffic information, network host information, and possibly additional network host information collected to enhance the analysis. Hence, the application identification system is able to dynamically modify the set of data traffic flow profiles being used in order to keep in touch with changing circumstances. As the data traffic flows are identified at least partly by analyzing the network traffic, the above-discussed application identification system is able to at least in some extent to identify applications that pose as another application and/or use encryption. An inconvenience related to the above-discussed application identification system is that it may be in some situations difficult to distinguish between applications that generate data traffic flows having mutually similar traffic characteristics.

Summary

In accordance with a first aspect of the invention there is provided a new device for identifying an application generating a data traffic flow. The device according to the invention comprises a processing system arranged to:

- make a first preliminary identification of the application on the basis of properties of one or more first data frames of the data traffic flow, the one or more first data frames being transferred at the beginning of the data traffic flow,

- make a second preliminary identification of the application on the basis of statistical properties associated with a portion of the data traffic flow transferred after the transferring of the first data frames, and - identify the application at least partly on the basis of the first preliminary identification and the second preliminary identification.

As the identification of the application is carried out in two phases, it is possible to utilise both the characteristics of the beginning of the data traffic flow and also the statistical properties related to a later portion of the data traffic flow. Hence, the successfulness of the identification of the application is improved compared with the prior art described earlier in this document. The first and second preliminary identifications of the application can be made, for example, using first and second classification data obtained with the K-means clustering algorithm. Details of the K-means clustering algorithm can be found, for example, from the book "Clustering

Algorithms", J. A. Hartigan (1975), Wiley.

In accordance with a second aspect of the invention there is provided a new method for identifying an application generating a data traffic flow. The method according to the invention comprises:

- making a first preliminary identification of the application on the basis of properties of one or more first data frames of the data traffic flow, the one or more first data frames being transferred at the beginning of the data traffic flow,

- making a second preliminary identification of the application on the basis of statistical properties associated with a portion of the data traffic flow transferred after the transferring of the first data frames, and

- identifying the application at least partly on the basis of the first preliminary identification and the second preliminary identification.

In accordance with a third aspect of the invention there is provided a new network element. The network element according to the invention is arranged to receive a data traffic flow generated by an application and comprises a processing system arranged to:

- make a second preliminary identification of the application on the basis of statistical properties associated with a portion of the data traffic flow trans- ferred after the transferring of the first data frames, and

- identify the application at least partly on the basis of the first preliminary identification and the second preliminary identification.

The network element can be, for example, an IP-router (Internet Protocol), Ethernet switch, ATM-switch (Asynchronous Transfer Mode), base station of a mobile communications network, an MPLS-switch (Multiprotocol Label Switching), or a combination of two or more of the aforementioned.

The network element can be as well a user terminal device that can be, for example, a mobile phone, a palmtop computer, a personal digital assistant, or a combination of two or more of the aforementioned. The network element can be as well a home or office sited network element such as e.g. an Ethernet switch, an IP- router (Internet Protocol) or a WLAN-AP (Wireless Local Area Network - Access Point).

In accordance with a fourth aspect of the invention there is provided a new computer program for identifying an application generating a data traffic flow. The computer program according to the invention comprises computer executable instructions for controlling a programmable processor to:

A computer program product according to the invention comprises a computer readable medium, e.g. a compact disc (CD) or a random access memory (RAM), encoded with a computer program according to the invention.

A number of exemplifying embodiments of the invention are described in accompanied dependent claims.

Various exemplifying embodiments of the invention both as to constructions and to methods of operation, together with additional objects and advantages thereof, will be best understood from the following description of specific exemplifying embodiments when read in connection with the accompanying drawings.

The verb "to comprise" is used in this document as an open limitation that does not exclude the existence of also unrecited features. The features recited in depending claims are mutually freely combinable unless otherwise explicitly stated.

Brief description of the figures

The exemplifying embodiments of the invention and their advantages are explained in greater detail below with reference to the accompanying drawings, in which:

figure 1 shows a high-level flow chart of a method according to an embodiment of the invention for identifying an application generating a data traffic flow,

figure 2 shows a flow chart of a method according to an embodiment of the invention for identifying an application generating a data traffic flow,

figure 3 shows a flow chart of a method according to an embodiment of the invention for identifying an application generating a data traffic flow,

figures 4a and 4b show a flow chart of making a preliminary identification of an application in a method according to an embodiment of the invention for identifying the application, figure 5 shows a schematic illustration of a device according to an embodiment of the invention for identifying an application generating a data traffic flow, and

figure 6 shows a schematic illustration of a network element according to an embodiment of the invention for identifying an application generating a data traffic flow.

Description of the exemplifying embodiments

Figure 1 shows a high-level flow chart of a method according to an embodiment of the invention for identifying an application generating a data traffic flow. The method contains two classification phases 101 and 102 and a phase 103 for making a final decision on the basis or results obtained in those two classification phases. The classification phase 101 comprises making a first preliminary identification of the application on the basis of properties of one or more first data frames of the data traffic flow. The one or more first data frames are data frames that are transferred at the beginning of the data traffic flow. Hence, it is possible to utilise infor- mation that is present in the data traffic flow only at the beginning of the data traffic flow. The first data frames can be for example data frames that are transferred during a negotiation phase related to establishing of the data traffic flow. The negotiation phase may comprise for example hand-shaking and/or other initialisation actions of the data traffic flow. The data frames can be for example IP-packets (Internet Protocol), Ethernet frames, or other protocol data units (PDU). The classification phase 102 comprises making a second preliminary identification of the application on the basis of statistical properties associated with a portion of the data traffic flow transferred after the transferring of the first data frames. In the classification phase 102 it is possible to utilise information that is present only when the data traffic flow is at the steady state, i.e. after possible initiations are already done. The portion of the data traffic flow used for the second preliminary identification of the application may comprise for example a first pre-determined number of data frames transferred after a second pre-determined number of earlier transferred data frames. The above-mentioned first and second pre- determined numbers can be e.g. 200 and 800, respectively, in which case the portion of the data traffic flow used for the second preliminary identification of the ap- plication comprises data frames 201 -1000 in the temporal order of transmission. The phase 103 comprises identifying the application at least partly on the basis of the first preliminary identification made in the classification phase 101 and the second preliminary identification made in the classification phase 102. In addition to the results of the first and second preliminary identifications, it is possible to use parameters related to algorithms used in the classification phases 101 and 102. These parameters may include, depending on the algorithms being used, for example an indicator of reliability of the first preliminary identification and an indicator of reliability of the second preliminary identification.

In a method according to an embodiment of the invention, the first preliminary identification of the application is made, in the classification phase 101 , on the basis of at least one of the following properties of the one or more first data frames transferred at the beginning of the data traffic flow: payload size, header size, uplink/downlink-transfer direction, a port number. The properties of the first one or more data frames that are selected to be used in the first preliminary identification of the application constitute a feature vector of the data traffic flow for the first preliminary identification. For example, if the number of the first data frames is N, the feature vector of the data traffic flow can be for example:

- Feature 1 : The payload size and uplink/downlink direction of a first trans- ferred data frame,

- Feature 2: The payload size and uplink/downlink direction of a second transferred data frame,

- Feature 3: The payload size and uplink/downlink direction of a third transferred data frame, - Feature N: The payload size and uplink/downlink direction of an N^th transferred data frame.

The first preliminary identification of the application can be made, for example, using the feature vector and classification data obtained with the K-means clustering algorithm, the Gaussian Mixture Model, the spectral clustering, the AutoClass clustering algorithm, or the density based spatial clustering of applications with noise (DBSCAN). It is also possible to use two or more of the aforementioned algorithms and to use a suitable logic, e.g. the voting principle, for determining the result of the first preliminary identification on the basis of results obtained with different classification data based on different algorithms. An embodiment of the invention in which the K-means clustering algorithm is used will be described in more details in later parts of this document. More detailed information about the Gaussian Mixture Model and the spectral clustering can be found e.g. from Ber- naille, L, Teixeira, R., and Salamatian, K. 2006. Early application identification. In Proceedings of the 2006 ACM CoNEXT Conference (Lisboa, Portugal, December 04-07, 2006). CoNEXT '06. ACM, New York, NY, 1 -12. More detailed information about the AutoClass clustering algorithm and the about density based spatial clus- tering of applications with noise (DBSCAN) can be found e.g. from Erman J., Arlitt M., and Mahanti A. (2006) Traffic Classification using Clustering Algorithms. In: Proceedings of the 2006 SIGCOMM workshop on mining network data. New York, NY, USA: ACM Press, p. 281 -286.

In a method according to an embodiment of the invention, the second preliminary identification of the application is made, in the classification phase 102, on the basis of at least one of the following statistical properties related to the portion of the data traffic flow transferred after the transferring of the first data frames:

- Average data frame size,

- Minimum data frame size, - Maximum data frame size,

- Standard deviation of data frame sizes,

- Number of data frame size variations,

- Total payload size, i.e. sum of payload sizes of all data frames used for the second preliminary identification of the application, - Total header size, i.e. sum of header sizes of all data frames used for the second preliminary identification of the application,

- Total payload size to uplink, i.e. sum of payload sizes of all data frames to uplink used for the second preliminary identification of the application,

- Total payload size to downlink, i.e. sum of payload sizes of all data frames to downlink used for the second preliminary identification of the application, - Total header size to uplink, i.e. sum of header sizes of all data frames to uplink used for the second preliminary identification of the application,

- Total header size to downlink, i.e. sum of header sizes of all data frames to downlink used for the second preliminary identification of the applica- tion,

- Number of data frames containing payload to uplink,

- Number of data frames containing payload to downlink,

- Number of push data frames to uplink,

- Number of push data frames to downlink, - Average inter-arrival time to uplink,

- Average inter-arrival time to downlink,

- Minimum inter-arrival time to uplink,

- Minimum inter-arrival time to downlink,

- Maximum inter-arrival time to uplink, - Maximum inter-arrival time to downlink,

- Standard deviation of inter-arrival times to uplink,

- Standard deviation of inter-arrival times to downlink,

The statistical properties that are selected to be used in the second preliminary identification of the application constitute a feature vector of the data traffic flow for the second preliminary identification.

The second preliminary identification of the application can be made, for example, using the feature vector and classification data obtained with the K-means clustering algorithm, the Gaussian Mixture Model, the spectral clustering, the AutoClass clustering algorithm, or the density based spatial clustering of applications with noise (DBSCAN). It is also possible to use two or more of the aforementioned algorithms and to use a suitable logic, e.g. the voting principle, for determining the result of the second preliminary identification on the basis of results obtained with different classification data based on different algorithms. An embodiment of the invention in which the K-means clustering algorithm is used will be described in more details in later parts of this document. It should be noted that it is not neces- sary to use a same algorithm for both the first preliminary identification in the classification phase 101 and the second preliminary identification in the classification phase 102, but the algorithm used for each of the classification phases 101 and 102 can be selected from the viewpoints of the needs and requirements related to the classification phase, 101 or 102, under consideration.

In a method according to an embodiment of the invention, accuracy indicators are calculated for the first preliminary identification of the application and for the second preliminary identification of the application. The application that has a better accuracy is selected in the phase 103 from among the one or two applications proposed by the first and second preliminary identifications. The selected application represents the identified application i.e. the final identification of the application. The accuracy indicators are preferably parameters related to algorithms used in the classification phases 101 and 102.

Figure 2 shows a flow chart of a method according to an embodiment of the inven- tion for identifying an application generating a data traffic flow. The K-means clustering algorithm is used for obtaining first classification data for the first preliminary identification of the application and for obtaining second classification data for the second preliminary identification of the application. The first classification data includes first cluster descriptions and first cluster compositions that are used in the first preliminary identification of the application, and the second classification data includes second cluster descriptions and second cluster compositions that are used in the second preliminary identification of the application. Details of the K- means clustering algorithm can be found, for example, from the book "Clustering Algorithms", J. A. Hartigan (1975), Wiley. The first preliminary identification of the application is made in phases 201 and 21 1 , and the second preliminary identification of the application is made in phases 202 and 212.

The phase 201 comprises selecting a first cluster of one or more applications on a basis of a first feature vector based on properties of data frames transferred at the beginning of the data traffic flow. The algorithm finds out whether a feature vector corresponds to any cluster, and if it does, it discovers it. Every cluster has its own density measure. This density measure is the standard deviation of distances from applications within a cluster to the centroid of this cluster. The density measure together with a pre-determined threshold value is used when discovering whether the feature vector corresponds to a coverage area of a certain cluster. When assigning the feature vector to a cluster, the selected cluster is not always the clos- est one. It may happen, for example, that the feature vector is closer to the centroid of a cluster A than to the centroid of a cluster B but the feature vector is not within the coverage area of the cluster A but the feature vector is within the coverage area of the cluster B. In this case, the cluster B is a better selection for the outcome of the phase 201 than the cluster A.

The phase 21 1 comprises selecting a first application candidate from the selected first cluster of one or more applications. It also is possible that the result of the first preliminary identification of the application is that the first application candidate is unknown. The phase 202 comprises selecting a second cluster of one or more applications on a basis of a second feature vector based on statistical properties of a portion of the data traffic flow that is transferred later than the data frames used for the first preliminary identification of the application. The phase 212 comprises selecting a second application candidate from the selected second cluster of one or more applications. It also is possible that the result of the second preliminary identification of the application is that the second application candidate is un- known.

A phase 203 comprises making a final decision on the application to be identified at least partly on the basis of the first application candidate and the second application candidate. If the first and second application candidates are the same, the final decision on the application to be identified is preferably the application pro- posed by both the first and second preliminary identifications of application. Exemplifying alternatives for making the final decision in cases where the first and second preliminary identifications of application propose different application candidates are described below.

In a method according to an embodiment of the invention, accuracy indicators are calculated for the first preliminary identification of the application and for the second preliminary identification of the application. The application candidate that has a better accuracy is selected in the phase 203 from among the first and second application candidates proposed by the first and second preliminary identifications, respectively. Each accuracy indicator is calculated as a proportional distance DCC/DDM, wherein the DCC is a distance between a feature vector of the data traffic flow and a centroid of a selected cluster and the DDM is the standard deviation of distances from applications within the selected cluster to the centroid of the selected cluster. The feature vector of the data traffic flow is either the first feature vector used in the first preliminary identification or the second feature vector used in the second preliminary identification.

In a method according to an embodiment of the invention, an occurrence probability of the first application candidate is used as an accuracy indicator for the first preliminary identification and an occurrence probability of the second application candidate is used as an accuracy indicator for the second preliminary identification. An occurrence probability of an application is a probability of occurrence of the application within all applications of a corresponding cluster. As a purely exemplifying case we can be assume that applications A, B, and C constitute a certain cluster X of applications, and, after knowing that the cluster to be selected is the cluster X, the application to be identified is the application A with a probability p, the application B with a probability q, and the application C with the probability 1 - p - q. In this case, the occurrence probabilities of the applications A, B, and C are p, q, and 1 - p - q, respectively. Estimates for the occurrence probabilities can be determined for example on the basis of usage statistics related to the applications under considerations.

It is possible that one or both of the first and second preliminary identifications produce a result that the application is unknown, i.e. either one or both of the first and second application candidates can indicate that the application is unknown. If both the first and second application candidates indicate that the application is unknown, the final decision on the application to be identified is preferably such that the application is unknown. Exemplifying alternatives for making the final decision in cases where only one of the first preliminary identification and the second preliminary identification proposes a known application are described below. In a method according to an embodiment of the invention, the final decision on the application to be identified is the application proposed by the first preliminary identification in a case in which only the second preliminary identification indicates that the application is unknown. Correspondingly, the final decision on the application to be identified is the application proposed by the second preliminary identification in a case in which only the first preliminary identification indicates that the application is unknown.

In a method according to an embodiment of the invention, the final decision on the application to be identified is the application proposed by the first preliminary iden- tification if:

- the second preliminary identification indicates that the application is unknown, and

- the closest cluster found in the second preliminary identification process contains the said application proposed by the first preliminary identifica- tion, the closest cluster being the cluster whose centroid is closest to the feature vector of the data traffic flow in the second preliminary identification process.

The final decision is that the application is unknown if the said closest cluster does not contain the said application proposed by the first preliminary identification.

Correspondingly, the final decision on the application to be identified is the application proposed by the second preliminary identification if:

- the first preliminary identification indicates that the application is unknown, and

- the closest cluster found in the first preliminary identification process con- tains the said application proposed by the second preliminary identification, the closest cluster being the cluster whose centroid is closest to the feature vector of the data traffic flow in the first preliminary identification process. The final decision is that the application is unknown if the said closest cluster does not contain the said application proposed by the second preliminary identification.

Figure 3 shows a flow chart of a method according to an embodiment of the invention for identifying an application generating a data traffic flow. Phases 301 , 311 , 302, 312, and 303 are similar to the phases 201 , 21 1 , 202, 212, and 203 shown in figure 2, respectively. The method comprises training phases 304 and 305. In the training phase 304, the K-means clustering algorithm and first pre-determined training data flows are used for forming the first classification data for the purpose of the first preliminary identification of the application. In the training phase 305, the K-means clustering algorithm and second pre-determined training data flows are used for forming the second classification data for the purpose of the second preliminary identification of the application. The first classification data produced in the training phase 304 includes first cluster descriptions and first cluster compositions that are used in the first preliminary identification of the application, and the second classification data produced in the training phase 305 includes second cluster descriptions and second cluster compositions that are used in the second preliminary identification of the application. The cluster descriptions may include for example information defining the centroids and the density measures of the clusters.

The training phases 304 and 305 are preferably implemented using an offline trainer. The trainer takes the first and second training data flows as input and uses those training data flows to capture the characteristic patterns of the desired application types. The trainer divides the training data flows into clusters using the K- means clustering algorithm. After the clustering, the trainer outputs the cluster de- scriptors and cluster compositions for the purposes of the first and second preliminary identification of the application. The offline training can be done only once but it is also possible to update the cluster descriptors and cluster compositions within a certain period of time. In the long run, it may be good to update in order to keep in touch with possible changes in behaviour of applications.

The K-Means clustering algorithm that can be used in the training phases 304 and 305 can be described with the aid of the following steps: 1. calculating the distances between feature vectors and cluster centroids, each feature vector corresponding to a certain training data flow,

2. assigning each feature vector to a cluster the centroid of which is closest to that feature vector,

3. calculating new cluster centroids based on the assigned feature vectors, and

4. go back to step 1 and continue until the cluster centroids do not substantially move.

After required iterations, clusters have formed. All feature vectors used in the train- ing contain the ground truth about the applications generating the training data flows. Therefore, the trainer knows which applications are related to which cluster. The trainer also calculates the distributions of the feature vectors inside each cluster. The distribution describes whether a cluster is very tight or if it is spread far and wide. This information can be used when recognizing unknown data traffic flows. Without this property, all data traffic flows, including those corresponding no training data flow, would be related to some application at the first and second preliminary identifications. Consequently, the trainer outputs the final cluster centroids, the distribution of the applications for each cluster and the standard deviation of the distances to the cluster centroid for each cluster.

Figures 4a and 4b show a flow chart of making a preliminary identification of an application in a method according to an embodiment of the invention for identifying the application. The process depicted in figures 4a and 4b can be used both for the first preliminary identification of the application, the phases 201 and 21 1 in figure 2 and the phases 301 and 311 in figure 3, and for the second preliminary iden- tification of the application, the phases 202 and 212 in figure 2 and the phases 302 and 312 in figure 3.

Figure 4a depicts an exemplifying cluster assignment process that may correspond, for example, to the phase 201 and/or the phase 202 shown in figure 2, as well as the phase 301 and/or the phase 302 shown in figure 3. A phase 421 of the cluster assignment process comprises initialisation of variables i, is_near, min_dist_in, and min_dist. A phase 422 comprises calculation of a distance D(i) between a feature vector of a data traffic flow and the centroid of the cluster i. A decision phase 423 comprises checking whether the feature vector belongs to the coverage area of the cluster i, i.e. checking whether the distance D(i) is less than the standard deviation of the distances in the cluster i multiplied with a predetermined threshold value T. A decision phase 424 comprises checking whether the distance D(i) is smaller than the so far smallest distance over clusters whose coverage areas comprise the feature vector, i.e. it is checked whether D(i) < min_dist_in. A phase 425 comprises setting the cluster i as the so far closest cluster whose coverage area comprises the feature vector, i.e. the variable clus- ter_id_in is set to i, setting the variable is_near to '1 ' in order to indicate that the feature vector belongs to an coverage area of at least one cluster, and setting the so far smallest distance over the clusters whose coverage areas comprise the fea- ture vector to the D(i), i.e. the variable min_dist_in is set to D(i). A decision phase

426 comprises checking whether the distance D(i) is smaller than the so far smallest distance over all clusters, i.e. it is checked whether D(i) < min_dist. A phase

427 comprises setting the cluster i as the so far closest cluster, i.e. the variable clusterjd is set to i, and setting the so far smallest distance over all clusters to the D(i), i.e. the variable min_dist is set to D(i). A decision phase 430 comprises checking whether there are any clusters left to be inspected. A phase 431 comprises shifting to the next cluster to be inspected, i.e. the variable i is incremented by one, and moving back to the phase 422. A decision phase 428 comprises checking whether the feature vector belongs to a coverage area of any cluster, i.e. checking whether the variable is_near is one or still zero. If the feature vector belongs to the coverage area of at least one cluster, i.e. is_near = 1 , the selected cluster is indicated in a phase 429 by the variable cluster_id_in and the distance from the feature vector to the centroid of the selected cluster is indicated by the variable min_dist_in. If the feature vector does not belong to a coverage area of any cluster, i.e. is_near = 0, it is indicated in a phase 432 that the application that corresponds to the feature vector is unknown, the variable clusterjd indicates the cluster the centroid of which is closest to the feature vector, and the variable min dist indicates the distance from the feature vector to the centroid of this clos- est cluster. In the case in which the feature vector belongs to the coverage area of at least one cluster, i.e. is_near = 1 , the above-described cluster assignment process is continued by an application labelling process for providing an indication of an application corresponding to the feature vector or an indication that the applica- tion is unknown.

Figure 4b depicts an exemplifying labelling process. The labelling process depicted in figure 4b may correspond for example to the phases 21 1 and 212 shown in figure 2 and the phases 311 and 312 shown in figure 3. In the exemplifying labelling process depicted in figure 4b, a port number is utilised. A decision phase 433 comprises checking whether the data traffic flow under consideration uses a standard port number of a known application. A phase 434 comprises determining an application that corresponds to the standard port number. A decision phase 435 comprises checking whether the selected cluster to which the data traffic flow was assigned contains any application corresponding to that standard port num- ber. A phase 436 comprises setting the determined application to be the outcome of the application labelling process, i.e. the outcome of the first or second preliminary identification of application. If the selected cluster does not contain any application corresponding to that standard port number, the data flow will be labelled with the dominant application of the selected cluster. A phase 437 comprises de- termining the dominant application among all applications of the selected cluster. If the data traffic flow is using a non-standard port number, it will be labelled according to the dominant application among those applications that use non-standard destination port numbers and belong to the selected cluster, or the data traffic flow will be labelled as unknown if the selected cluster does not contain any applica- tions that use non-standard destination port numbers. A decision phase 438 comprises checking whether the selected cluster contains any application(s) that utilise^) non-standard port numbers. A phase 439 comprises determining a dominant, e.g. most probable, application that uses a non-standard port number among all applications of the selected cluster. A phase 440 comprises setting the applica- tion to be unknown, i.e. the outcome of the first or second preliminary identification of application is that the application is unknown. Figure 5 shows a schematic illustration of a device according to an embodiment of the invention for identifying an application generating a data traffic flow. The device can be for example a part of a network element that can be either an operator controlled network element, a home or office sited network element, or a user ter- minal device. The device comprises a processing system 501 arranged to:

- make a second preliminary identification of the application on the basis of statistical properties associated with a portion of the data traffic flow transferred after the transferring of the first data frames, and

The processing system 501 may comprise one or more processor units. Each processing unit can be a programmable processor, an application specific circuit, or a field programmable circuit. The device may further comprise a memory unit 502 and a data interface 503 for communicating with external systems.

In a device according to an embodiment of the invention, the processing system 501 is arranged to use data frames transferred during a negotiation phase related to establishing of the data traffic flow as the above-mentioned one or more first data frames.

In a device according to an embodiment of the invention, the processing system 501 is arranged to make the first preliminary identification of the application on the basis of at least one of the following properties of the above-mentioned one or more first data frames: payload size, header size, uplink/downlink-transfer direction, a port number. In a device according to an embodiment of the invention, the processing system 501 is arranged to make the first preliminary identification of the application with the aid of first classification data obtained with the K-means clustering algorithm.

In a device according to an embodiment of the invention, the processing system 501 is arranged to make the first preliminary identification of the application with the aid of first classification data obtained with one of the following: the Gaussian Mixture Model, the spectral clustering, the AutoClass clustering algorithm, the density based spatial clustering of applications with noise (DBSCAN).

In a device according to an embodiment of the invention, the portion of the data traffic flow used for the second preliminary identification of the application comprises a first pre-determined number of data frames transferred after a second predetermined number of earlier transferred data frames.

In a device according to an embodiment of the invention, the processing system 501 is arranged to make the second preliminary identification of the application on the basis of at least one of the following statistical properties related to the portion of the data traffic flow transferred after the first data frames: average frame size, minimum frame size, maximum frame size, standard deviation of frame size, average inter-arrival time, standard deviation of the inter-arrival time.

In a device according to an embodiment of the invention, the processing system 501 is arranged to make the second preliminary identification of the application with the aid of second classification data obtained with the K-means clustering algorithm.

In a device according to an embodiment of the invention, the processing system 501 is arranged to make the second preliminary identification of the application with the aid of second classification data obtained with one of the following: the Gaussian Mixture Model, the spectral clustering, the AutoClass clustering algorithm, the density based spatial clustering of applications with noise (DBSCAN).

In a device according to an embodiment of the invention, the processing system 501 is arranged to calculate accuracy indicators for the first preliminary identifica- tion of the application and for the second preliminary identification of the application, and to select the application that has a better accuracy from among the one or two applications according to the first and second preliminary identifications, the selected application representing the identified application.

In a device according to an embodiment of the invention, the processing system 501 is arranged to use first classification data obtained with the K-means clustering algorithm for the first preliminary identification of the application and second classification data obtained with the K-means clustering algorithm for the second preliminary identification of the application, and to calculate each accuracy indica- tor as DCC/DDM. The DCC is a distance between a feature vector of the data traffic flow and a centroid of a selected cluster of one or more applications, and the DDM is a standard deviation of distances from applications within the selected cluster of applications to the centroid of the selected cluster of one or more applications.

In a device according to an embodiment of the invention, the processing system 501 is arranged to use first classification data obtained with the K-means clustering algorithm for the first preliminary identification of the application and second classification data obtained with the K-means clustering algorithm for the second preliminary identification of the application, and to use an occurrence probability of the application according to the first preliminary identification as an accuracy indicator for the application according to the first preliminary identification and an occurrence probability of the application according to the second preliminary identification as an accuracy indicator for the application according to the second preliminary identification. The occurrence probability of an application is the probability of occurrence of the application within all applications of a corresponding cluster of one or more applications.

In a device according to an embodiment of the invention, the processing system 501 is arranged to use the K-means clustering algorithm and first pre-determined training data flows for obtaining first classification data to be used for the first pre- liminary identification of the application and to use the K-means clustering algo- rithm and second pre-determined training data flows for obtaining second classification data to be used for the second preliminary identification of the application.

Figure 6 shows a schematic illustration of a network element 600 according to an embodiment of the invention for identifying an application generating a data traffic flow. The network element comprises a processing system 601 arranged to:

The network element comprises preferably a transmitter 604 for transmitting data traffic flows to a communication network and/or a receiver 605 for receiving data traffic flows from the communication network. Alternatively, the network element may comprise a data interface (not shown) for connecting to an external transmitter and/or to an external receiver. The network element may further comprise a memory unit 602 or a data interface (not shown) for connecting to an external memory unit.

A network element according to an embodiment of the invention comprises/is at least one of the following: an IP-router (Internet Protocol), Ethernet switch, ATM- switch (Asynchronous Transfer Mode), base station of a mobile communications network, MPLS-switch (Multiprotocol Label Switching), a WLAN-AP (Wireless Local Area Network - Access Point).

A network element according to an embodiment of the invention is a user terminal device and comprises/is at least one of the following: a mobile phone, a palmtop computer, a personal digital assistant, a personal computer, a lap-top computer. A computer program according to an embodiment of the invention comprises a program code for controlling a programmable processor to identify an application generating a data traffic flow. The program code comprises computer executable instructions for controlling the programmable processor to:

The computer executable instructions can be e.g. subroutines and/or functions.

A computer program product according to an embodiment of the invention is stored in a computer readable medium. The computer readable medium can be e.g. a CD-ROM (Compact Disc Read Only Memory) or a RAM-device (Random Access Memory).

A computer program product according to an embodiment of the invention is car- ried in a signal that is receivable from a communication network.

A computer readable medium, e.g. a CD-ROM (Compact Disc Read Only Memory) or a RAM-device (Random Access Memory), according to an embodiment of the invention is encoded with a computer program according to an embodiment of the invention.

The specific examples provided in the description given above should not be construed as limiting. Therefore, the invention is not limited merely to the embodiments described above, many variants being possible.

Claims

Claims:

1. A device for identifying an application generating a data traffic flow, the device comprising a processing system (501 ) arranged to make a first preliminary identification of the application on the basis of properties of one or more first data frames of the data traffic flow, the one or more first data frames being transferred at the beginning of the data traffic flow, characterized in that the processing system is further arranged to:

2. A device according to claim 1 , wherein the one or more first data frames are data frames transferred during a negotiation phase related to establishing of the data traffic flow.

3. A device according to claim 1 or 2, wherein the processing system is arranged to make the first preliminary identification of the application on the basis of at least one of the following properties of the one or more first data frames: pay- load size, header size, uplink/downlink-transfer direction, a port number.

4. A device according to any of claims 1 -3, wherein the processing system is arranged to make the first preliminary identification of the application with the aid of first classification data obtained with the K-means clustering algorithm.

5. A device according to any of claims 1 -3, wherein the processing system is arranged to make the first preliminary identification of the application with the aid of first classification data obtained with one of the following: the Gaussian Mixture Model, the spectral clustering, the AutoClass clustering algorithm, the density based spatial clustering of applications with noise (DBSCAN).

6. A device according to any of the claims 1 -5, wherein the portion of the data traffic flow used for the second preliminary identification of the application comprises a first pre-determined number of data frames transferred after a second predetermined number of earlier transferred data frames.

7. A device according to any of the claims 1 -6, wherein the processing system is arranged to make the second preliminary identification of the application on the basis of at least one of the following statistical properties related to the portion of the data traffic flow transferred after the transferring of the first data frames: average frame size, minimum frame size, maximum frame size, standard deviation of frame size, average inter-arrival time, standard deviation of inter-arrival time.

8. A device according to any of the claims 1 -7, wherein the processing system is arranged to make the second preliminary identification of the application with the aid of second classification data obtained with the K-means clustering algorithm.

9. A device according to any of the claims 1 -7, wherein the processing system is arranged to make the second preliminary identification of the application with the aid of second classification data obtained with one of the following: the Gaussian Mixture Model, the spectral clustering, the AutoClass clustering algorithm, the density based spatial clustering of applications with noise (DBSCAN).

10. A device according to any of the claims 1 -9, wherein the processing system is arranged to calculate accuracy indicators for the first preliminary identification of the application and for the second preliminary identification of the application, and to select the application that has a better accuracy from among the one or two applications according to the first and second preliminary identifications, the selected application representing the identified application.

11. A device according to claim 10, wherein the processing system is arranged to use first classification data obtained with the K-means clustering algorithm for the first preliminary identification of the application and second classification data obtained with the K-means clustering algorithm for the second preliminary identification of the application, and to calculate each accuracy indicator as DCC/DDM, the DCC being a distance between a feature vector of the data traffic flow and a cen- troid of a selected cluster of one or more applications, and the DDM being a standard deviation of distances from applications within the selected cluster of one or more applications to the centroid of the selected cluster of one or more applications.

12. A device according to claim 10, wherein the processing system is arranged to use first classification data obtained with the K-means clustering algorithm for the first preliminary identification of the application and second classification data obtained with the K-means clustering algorithm for the second preliminary identification of the application, and to use an occurrence probability of the application ac- cording to the first preliminary identification as an accuracy indicator for the application according to the first preliminary identification and an occurrence probability of the application according to the second preliminary identification as an accuracy indicator for the application according to the second preliminary identification, an occurrence probability of an application being a probability of occurrence of the application within all applications of a corresponding cluster of one or more applications.

13. A device according to claim 1 , wherein the processing system is arranged to use the K-means clustering algorithm and first pre-determined training data flows for obtaining first classification data for the first preliminary identification of the ap- plication and to use the K-means clustering algorithm and second pre-determined training data flows for obtaining second classification data for the second preliminary identification of the application.

14. A method for identifying an application generating a data traffic flow, the method comprising making (101 , 201 , 21 1 , 301 , 311 ) a first preliminary identifica- tion of the application on the basis of properties of one or more first data frames of the data traffic flow, the one or more first data frames being transferred at the beginning of the data traffic flow, characterized in that the method further comprises:

- making (102, 202, 212, 302, 312) a second preliminary identification of the application on the basis of statistical properties associated with a portion of the data traffic flow transferred after the transferring of the first data frames, and - identifying (103, 203, 303) the application at least partly on the basis of the first preliminary identification and the second preliminary identification.

15. A method according to claim 14, wherein the one or more first data frames are data frames transferred during a negotiation phase related to establishing of the data traffic flow.

16. A method according to claim 14 or 15, wherein the first preliminary identification of the application is made on the basis of at least one of the following properties of the one or more first data frames: payload size, header size, uplink/downlink-transfer direction, a port number.

17. A method according to any of claims 14-16, wherein the first preliminary identification of the application is made (421 -440) using first classification data obtained with the K-means clustering algorithm.

18. A method according to any of claims 14-16, wherein the first preliminary identification of the application is made using first classification data obtained with one of the following: the Gaussian Mixture Model, the spectral clustering, the Auto- Class clustering algorithm, the density based spatial clustering of applications with noise (DBSCAN).

19. A method according to any of the claims 14-18, wherein the portion of the data traffic flow used for the second preliminary identification of the application comprises a first pre-determined number of data frames transferred after a second pre-determined number of earlier transferred data frames.

20. A method according to any of the claims 14-19, wherein the second preliminary identification of the application is made on the basis of at least one of the following statistical properties related to the portion of the data traffic flow transferred after the transferring of the first data frames: average frame size, minimum frame size, maximum frame size, standard deviation of frame size, average inter-arrival time, standard deviation of inter-arrival time.

21. A method according to any of the claims 14-20, wherein the second preliminary identification of the application is made (421 -440) using second classification data obtained with the K-means clustering algorithm.

22. A method according to any of the claims 14-20, wherein the second prelimi- nary identification of the application is made using second classification data obtained with one of the following: the Gaussian Mixture Model, the spectral clustering, the AutoClass clustering algorithm, the density based spatial clustering of applications with noise (DBSCAN).

23. A method according to any of the claims 14-22, wherein accuracy indicators are calculated for the first preliminary identification of the application and for the second preliminary identification of the application, and the application that has a better accuracy is selected from among the one or two applications according to the first and second preliminary identifications, the selected application representing the identified application.

24. A method according to claim 23, wherein the K-means clustering algorithm is used for obtaining first classification data for the first preliminary identification of the application and for obtaining second classification data for the second preliminary identification of the application, and each accuracy indicator is calculated as DCC/DDM, the DCC being a distance between a feature vector of the data traffic flow and a centroid of a selected cluster of one or more applications and the DDM being a standard deviation of distances from applications within the selected cluster of one or more applications to the centroid of the selected cluster of one or more applications.

25. A method according to claim 23, wherein the K-means clustering algorithm is used for obtaining first classification data for the first preliminary identification of the application and for obtaining second classification data for the second preliminary identification of the application, and an occurrence probability of the application according to the first preliminary identification is used as an accuracy indicator for the application according to the first preliminary identification and an occur- rence probability of the application according to the second preliminary identification is used as an accuracy indicator for the application according to the second preliminary identification, an occurrence probability of an application being a probability of occurrence of the application within all applications of a corresponding cluster of one or more applications.

26. A method according to claim 14, wherein the method comprises using (304) the K-means clustering algorithm and first pre-determined training data flows for obtaining first classification data for the first preliminary identification of the application and using (305) the K-means clustering algorithm and second predetermined training data flows for obtaining second classification data for the second preliminary identification of the application.

27. A network element (600) arranged to receive a data traffic flow generated by an application, the network element comprising a processing system (601 ) arranged to make a first preliminary identification of the application on the basis of properties of one or more first data frames of the data traffic flow, the one or more first data frames being transferred at the beginning of the data traffic flow, charac- terized in that the processing system is further arranged to:

28. A network element according to claim 27, wherein the network element comprises at least one of the following: an IP-router (Internet Protocol), Ethernet switch, ATM-switch (Asynchronous Transfer Mode), base station of a mobile communications network, MPLS-switch (Multiprotocol Label Switching), a WLAN- AP (Wireless Local Area Network - Access Point).

29. A network element according to claim 27, wherein the network element is a user terminal device and comprises at least one of the following: a mobile phone, a palmtop computer, a personal digital assistant.

30. A computer program for identifying an application generating a data traffic flow, the computer program comprising computer executable instructions for controlling a programmable processor to make a first preliminary identification of the application on the basis of properties of one or more first data frames of the data traffic flow, the one or more first data frames being transferred at the beginning of the data traffic flow, characterized in that the computer program further comprises computer executable instructions for controlling the programmable processor to:

31. A computer readable medium, characterized in that the computer readable medium is encoded with a computer program according to claim 30.