CN108141377B

CN108141377B - Early classification of network flows

Info

Publication number: CN108141377B
Application number: CN201580083836.5A
Authority: CN
Inventors: 亚历山大·阿列克谢耶维奇·谢罗夫; 瓦列里·尼古拉耶维奇·格卢霍夫
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2015-10-12
Filing date: 2015-10-12
Publication date: 2020-08-07
Anticipated expiration: 2035-10-12
Also published as: WO2017065627A1; CN108141377A

Abstract

The invention provides a network flow early classification method. The method includes a training phase comprising: capturing a full-length training network flow; assigning classes to the captured full-length training network flow; training a prediction model on the full-length training network flow; acquiring a truncated training network flow from the full-length training network flow; applying a predictive model trained on a full-length training network stream to a truncated training network stream to obtain a plurality of training classes; comparing the training classes predicted using the predictive model on the truncated training network stream to the assigned classes; and training the correction model on the truncated training network flow by considering the comparison of the training classes and the assigned classes. The prediction phase comprises: receiving first few packets of an unclassified network flow, said unclassified network flow being an object of an early classification; obtaining a truncated unclassified network flow from an unclassified network flow; applying the prediction model to the truncated unclassified network flow and outputting a prediction model classification result; applying a correction model to the truncated unclassified network flow by considering the prediction model classification result, and outputting a correction model classification result; and combining the corrected model classification results to make a final prediction of the unclassified network flow.

Description

Early classification of network flows

Technical Field

The present invention relates generally to the field of network flow classification. In particular, the present invention relates to machine learning techniques for early network traffic classification and identification for various applications, such as traffic flows. The classification is related to the management of the traffic that should be done in real time.

Background

The prior art includes some known methods for classifying network traffic. The simplest way to perform network traffic classification depends on the port number. This method is a relatively fast classification method since the packet load is not detected. However, this method is only applicable when a fixed port is used, and cannot detect masquerading traffic.

Stream-based analytical methods demonstrate their effectiveness. A flow refers to a collection of packets that share the same transport layer protocol, the same source and destination IP addresses, and the same source and destination TCP or UDP port numbers.

Deep Packet Inspection (DPI) is another method for classifying network traffic based on a pattern matching technique. DPI can only detect packet payloads with headers and is therefore a computationally more demanding method. Advantageously, the method operates in a port agnostic manner by comparing application characteristics with network flows to be classified. DPI, however, requires the maintenance of a large database of application characteristics.

Other methods include behavioral analysis and statistical analysis. The former takes a network host computer construction template as a characteristic, and the latter is based on a machine learning classification algorithm of training statistical characteristics. Features may be built on packet level statistics or stream level statistics. Fig. 6 shows a comparison of packet-level statistics and stream-level statistics according to the prior art. As shown in fig. 6, for packet level statistics, the flows belonging to a particular application are averaged for each of the first few packets of the flow. For flow level statistics, all packets for a particular flow are averaged.

The machine learning process of classification problems involves a two-phase workflow. The first stage, referred to as the training stage, involves the use of instances of feature generalization labels. The second phase of the workflow, referred to as the prediction phase, involves predicting unlabeled instances, i.e., unclassified instances. Typically, the training phase ends with a generative or discriminative statistical model. Such models can be built off-line from packet grabbing during network traffic analysis. Prior to the training phase, packets should be combined in the stream and the stream mapped to the feature space. In addition, a label should be assigned to each stream according to the application that generated the stream. Once the model is built on the labeled data set, it can be applied to the unlabeled stream, i.e., the unclassified stream, to infer the application of the unclassified stream during the prediction phase. The prediction phase does not require significant computational resources and can therefore be performed in real time.

As it is processed online, the packets of the unclassified stream arrive one by one, so that at any time the prediction phase does not need to process the whole unclassified stream, but only the first few packets, which can be called cut-off streams. The earlier the prediction stage can classify a flow using only the first few packets, the earlier it is possible to apply a policy or rule to improve the overall performance of the network. Therefore, the predictive advancement, which is determined by the length of the cut-off flow, is of crucial importance.

The prior art methods related to network traffic classification and its advance are outlined next.

Document US 8095635B 2 proposes to manage network traffic to increase the availability of network traffic. This document relates to network traffic classification based on flow-level statistics that can be obtained using standard NetFlow records. NetFlow specifically refers to the implementation of Cisco on package trace statistics, where equivalent methods include Huachen's NetStream method. Some of the features constructed from the recorded data are independent of the application class, and therefore should be removed to avoid unnecessary computations. The following measure of symmetry uncertainty is proposed in this document to rank the features:

wherein H represents an entropy function, (A)₁，……，A_m) Representing a feature, C represents a flow class, considered as a feature.

It is further proposed in this document to determine the goodness of any given subset S of features by:

the desired feature set may be selected by continuously adding features in descending order of symmetry uncertainty and monitoring the amount of increase in the goodness measure. The proposed method can be implemented with the NetFlow sampling feature. It is shown in this document that packet sampling does not significantly affect the accuracy of the classifier, since a more uniform large set of flows can be obtained in the sampling when the sampling rate is lower, and thus a more accurate classification can be obtained sensorially.

Disadvantageously, however, this approach does not address the problem of early classification of network flows.

Zhengzheng Xing, Jian Pei and Philip S.Yu propose the following types of early time series classification (ECTS) methods in the early time series classification (early time series) published by the Knowledge and information Systems (Knowledge and information Systems), 2012:31:1, page 105-:

s[i]，1≤i≤L (3)

wherein L represents the full length of the time series.

Based on the 1-nearest neighbor (1NN) classification algorithm, the method proposed in this document relies on distance measurements between two time series according to the following equation:

this document proposes to perfect the concept of Minimum Prediction Prefix (MPP). By construction, for time series t, any prefix longer than mpp (t) in 1NN classification is redundant and can be removed to achieve early classification.

However, the ECTS method described in this document is based on the 1NN classification algorithm and uses the following assumptions:

1. any time sequence is a pair of sequences (time stamp, value);

2. all time series are L in length;

3. there is a metric value that defines the distance between the two time series; and

4. the training data set is a sufficiently large and uniform time series of samples.

Inevitably, these assumptions must be met when adapting the method for network flows. First, 1NN classification may not be the best classification algorithm. In fact, there are a number of other classifiers that exhibit superior performance, such as naive bayes classifiers, Support Vector Machines (SVM) classifiers, or decision tree classifiers, among others. Second, even if the euclidean metric (4) is redesigned to compute the packet characteristics, which may be nominal (e.g., header labels) or numerical (e.g., payload size), it is not possible to guarantee "consistency" of the training data set. Third,

assumptions

1 and 3 above require storage of packet-level features, which consume more memory than stream-level features. Finally, the prediction length of each time series that has been reliably classified by the ECTS method is different.

Disclosure of Invention

Recognizing the above disadvantages and problems, the present invention is directed to improving the prior art. In particular, it is an object of the invention to provide an improved early classification of network flows.

Unlike the known classification methods in which the optimal target is set to maximize the classification accuracy, the present invention intends to improve early classification by optimizing the advancement, in particular, as long as the classification accuracy is satisfactory.

The above object of the present invention is achieved by the solution provided in the appended independent claims. Advantageous embodiments of the invention are further defined in the respective dependent claims.

According to a first aspect of the present invention, a method for early classification of a network flow is provided. The method includes a training phase and a prediction phase.

The training phase includes capturing a full-length training network stream. The training phase includes assigning a class to the captured full-length training network stream. The training phase includes training a predictive model over the full-length training network stream. The training phase includes obtaining a truncated training network stream from the full-length training network stream. The training phase includes applying the predictive model trained on the full-length training network flow to the truncated training network flow to obtain a plurality of training classes. The training phase includes comparing a training class predicted using the predictive model on the truncated training network stream to the assigned class. The training phase includes training a correction model on the truncated training network flow by considering a comparison of the training class and the assigned class.

The prediction phase includes receiving the first few packets of an unclassified network flow that is the object of an earlier classification. The prediction phase includes obtaining a truncated unclassified network flow from the unclassified network flows. The prediction phase includes applying the prediction model to the truncated unclassified network flow and outputting a prediction model classification result. The prediction phase includes applying the correction model to the truncated unclassified network flow by taking into account the prediction model classification result and outputting a correction model classification result. The prediction phase includes merging the corrected model classification results to make a final prediction of the unclassified network flow.

Thus, the proposed method solves the problem of early classification of network flows by machine learning classification techniques. Advantageously, the method balances the trade-off between accuracy and earliness of the prediction. Further, the method is based on stream-level features rather than packet-level features. Further, the method is generic enough to avoid the limitations imposed by any particular machine learning classification algorithm.

The proposed method therefore complements a network flow classification algorithm based on reference features, called predictive model, and a correction model that captures the sensitivity of the algorithm to flow truncation. This technique facilitates earlier flow classification with greater accuracy. Once trained in an offline mode, the proposed method can classify truncated unclassified streams in real-time. First, the original algorithm, i.e. the prediction model, is applied to the features computed on the cut-off, then the features used by the correction model are computed, and finally the correction model is applied to improve the prediction made by the original algorithm.

In a first possible implementation form of the method according to the first aspect, training the predictive model over the full-length training network flow comprises: extracting a first feature set from the captured full-length training network flow to obtain a first training feature vector; and training the predictive model on the first training feature vector and the respective classes assigned to the captured full-length training network streams.

In a second possible further implementation form of the method according to the first aspect as such or the first implementation form of the first aspect, the applying the predictive model trained on the full-length training network flow to the truncated training network flow to obtain a plurality of training classes comprises: extracting a first feature set from the truncated training network stream to obtain a second training feature vector; extracting a second feature set from the truncated training network stream to obtain a third training feature vector; and applying the predictive model to the second training feature vector of the truncated training network stream to obtain the training class. Training the correction model on the truncated training network flow by taking into account the comparison of the training class and the assignment class comprises: training the correction model on the third training feature vector, the assignment class, and the training class.

In a third possible further implementation form of the method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the applying the predictive model to the truncated unclassified network flow comprises: extracting a first set of features from the truncated unclassified network stream; computing a first predicted feature vector from the first set of features from the truncated unclassified network flow; and providing the calculated first prediction characteristic vector to the prediction model to obtain a corresponding prediction model classification result.

In a fourth possible further implementation form of the method according to the third implementation form of the first aspect, the applying the correction model to the truncated unclassified network flow comprises: extracting a second set of features from the truncated unclassified network stream; computing a second predicted feature vector by a second set of features from the truncated unclassified network flow; and providing the calculated second prediction feature vector and the obtained prediction model classification result to the correction model to obtain the classification of the truncated unclassified network flow.

In a fifth possible further implementation form of the method according to any of the first to fourth implementation forms of the first aspect, the first set of features extracted from the network flow comprises: a transport layer protocol of the stream; an average size of packets in the stream; a standard deviation of sizes of packets in the stream; inter-arrival times of packets in the stream; a standard deviation of inter-arrival times of packets in the stream; entropy values of application layer loads of packets in the stream; or any combination thereof.

In a sixth possible further implementation form of the method according to any of the second to fifth implementation forms of the first aspect, the second set of features extracted from a network flow comprises a transport layer protocol of the flow, a conditional probability of a classification result of the second prediction model assuming that a second predicted feature vector is calculated over the first L packets of the flow, a conditional probability of a classification result of the second prediction model assuming that a second predicted feature vector is calculated over the first k packets of the flow, where k < L; a relative position of the kth packet in the interruption, k/L; a relative difference between an average size of the first k packets of the interruption and an average size of all packets in the interruption, a relative difference between arrival times of the first k packets of the interruption and arrival times of all packets in the interruption, a relative difference between application layer load values of the first k entropy packets of the interruption and application layer load values of all packets in the interruption, and a relative difference between arrival times of all packets in the interruption, and thus the relative prediction model is based on an arbitrary combination of the characteristics of the flow.

In a seventh possible further implementation form of the method according to the first aspect as such or any of the preceding implementation forms of the first aspect, the obtaining the truncated unclassified network flow from among the unclassified network flows comprises: obtaining a truncated unclassified network flow by discarding the last one or more packets of said first few received packets.

In an eighth possible further implementation form of the method according to the first aspect as such or any of the preceding implementation forms of the first aspect, at least one of the truncated unclassified network flows is obtained by truncating the unclassified network flow by a truncation length.

In a ninth possible further implementation form of the method according to the eighth implementation form of the first aspect, the truncation length is fixed.

In a tenth possible further implementation form of the method according to the eighth or ninth implementation form of the first aspect, the unclassified network flows are additionally truncated by a length smaller than the truncation length to obtain shorter unclassified network flows; and additionally applying the prediction phase to the shorter unclassified network flows to improve the accuracy of the classification.

According to a second aspect of the present invention, there is provided a computer program having a program code for performing the method according to the first aspect of the present invention, when the computer program runs on a computing device.

According to a third aspect of the invention, a network flow early classification device is provided. The apparatus is for performing a training phase and a prediction phase. In the training phase, the device is configured to: capturing a full-length training network flow and allocating a category to the captured full-length training network flow; training a predictive model on the full-length training network stream; acquiring a truncated training network flow from the full-length training network flow; applying the predictive model trained on the full-length training network stream to the truncated training network stream to obtain a plurality of training classes; comparing a training class predicted using the predictive model on the truncated training network stream to the assigned class; and training a correction model on the truncated training network flow by taking into account the comparison of the training classes and the assigned classes. In the prediction phase, the apparatus is to: receiving first few packets of an unclassified network flow, said unclassified network flow being an object of an early classification; obtaining a truncated unclassified network flow from said unclassified network flow by discarding the last received packet or packets; applying the prediction model to the truncated unclassified network flow and outputting a prediction model classification result; applying the correction model to the truncated unclassified network flow by taking into account the prediction model classification result and outputting a correction model classification result; and merging the corrected model classification results to make a final prediction of the unclassified network flow.

It has to be further noted that all devices, elements, units and means described in the present application may be implemented in software or hardware elements or any kind of combination thereof. All steps performed by the respective entities described in the present application and the functionalities described to be performed by the respective entities mean that the respective entities are used for performing the respective steps and functionalities. Even if in the following description of specific embodiments specific functions or steps performed by external entities are not reflected in the description of specific detailed elements of the entities performing said specific steps or functions, it should be clear to a person skilled in the art that these methods and functions can be implemented by corresponding software or hardware elements or any kind of combination thereof.

Drawings

The foregoing aspects and many of the attendant aspects of this invention will become more readily appreciated as the same become better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein:

FIG. 1 is a schematic diagram of a method for early classification of network flows according to an embodiment of the invention;

FIG. 2 is a diagram illustrating a method for early classification of network flows according to another embodiment of the present invention;

FIG. 3 is a diagram illustrating a method for early classification of network flows according to another embodiment of the present invention;

FIG. 4 is a diagram illustrating a method for early classification of network flows according to another embodiment of the present invention;

FIG. 5 is a diagram illustrating a method for early classification of network flows according to another embodiment of the invention;

FIG. 6 illustrates a comparison of packet-level statistics and stream-level statistics;

FIG. 7 illustrates a sort workflow according to the prior art;

FIG. 8 illustrates a sort workflow according to an embodiment of the invention; and

FIG. 9 illustrates an improvement in advancement and accuracy according to an embodiment of the invention.

Detailed Description

Fig. 1 is a schematic diagram of a network flow early classification method according to an embodiment of the present invention.

The method comprises a training phase 101 and a prediction phase 111. The training phase 101 may be performed offline, i.e. before classification, starting from network traffic captured in the form of a training network stream. This phase can be performed off-line because it is not time critical. Alternatively, the prediction phase 111 may be performed online, i.e. classification may be done in real time as the network traffic to be classified is received. The prediction phase 111 may be performed on-line to ensure that early classification can be achieved by optimizing the advancement and classification accuracy.

The method shown schematically comprises the following steps during the training phase 101.

In a first step 102, a full-length training network stream is captured. The training network flow is a network flow captured during a training phase. The network flow may be a Transmission Control Protocol (TCP) or User Datagram Protocol (UDP) network flow that passes an observation point in the network for a certain time interval. The network flow may be characterized as a five-tuple consisting of:

-a source IP address;

-a source port;

-a destination IP address;

-a destination port; and

-a transport layer protocol.

The transport layer protocol is a layer 4 protocol according to the Open Systems Interconnection (OSI) model. For example, the transport layer protocol may be TCP or UDP. Each network flow may be transmitted through a pair of slots created by the communication application.

A network flow, particularly with respect to a packet-switched network, may correspond to a sequence of packets from a source to a destination. The source may be a network entity defined by a source IP address and a source port and the destination may be another network entity defined by a destination IP address and a destination port. Subsequently, capturing the network flow may include capturing the sequence of packets and combining the sequence of packets in the network flow.

According to the RFC 3697 document "IPv 6Flow label Specification" (IPv6Flow L abel Specification) by Rajahalme et al, IETF 2004, a network Flow may be a sequence of packets sent from a particular source to a particular unicast, anycast or multicast destination, the source wishing to refer to the sequence of packets as a Flow, such a network Flow may be made up of a specific transport connection or all packets in a media Flow, according to the RFC 3917 document "IP Flow info output request" (IPFIX) by jittek et al, IETF 2004, a network Flow may be a collection of IP packets that pass through a point of view in the network for a certain time interval, where all packets belonging to a particular Flow may have a set of common characteristics.

A full-length training network flow consists of all packets passing through a pair of slots over its lifetime. Alternatively, and in particular in practice, a full-length training network flow may not consist of all packets passed through a pair of slots over its lifetime, but rather of packets captured over a sufficiently long period of time, i.e., a period of time longer than a threshold.

In a next step 103, a class c is assigned to the captured full-length training network stream. Depending on the application that generated the full-length training network stream, a class c or label may be assigned to each captured full-length training network stream. An application that generates a full-length training network flow may group it and assign a corresponding class c to the flow. Alternatively, the category c may be assigned according to, for example, the platform or service that generated the network flow.

In a next step 104, the predictive model is trained on the full-length training network stream.

In a next step 105, a truncated training network stream is obtained from the full-length training network stream.

In a next step 106, the predictive model trained on the full-length training network stream is applied to the truncated training network stream to obtain a plurality of training classes

In a next step 107, the training classes predicted using the predictive model on the truncated training network stream

Compared to the assignment category c.

In a next step 108, the training classes are considered

The correction model is trained on a truncated training network stream in comparison to the assigned class c.

In the training phase 101, the predictive model is trained 104 on the full-length training net stream and applied 106 to the truncated training net stream to obtain training classes. The correction model is trained 108 on these acquired training classes while also taking into account the results of the prediction model and the results of the comparison 107 of the class assignments 103.

In particular, during the training phase 101, the predictive model is first trained 104 on the full-length training net-flow, and then the predictive model is applied 106 to make predictions for the truncated training net-flow to observe whether the predictive model misclassifies the truncated training net-flow. Thus, the present invention provides for inputting the calibration model to be trained 108. At this time, the correction model is a meta model or, in the case of ensemble learning, a kind of stack generalization.

The method shown schematically comprises the following steps during the prediction phase 111.

In step 112, the first few packets of an unclassified network flow, which is the object of the early classification, are received. In the schematic diagram of fig. 1, step 112 is performed after step 108 of the training phase 101. In this regard, preferably, the training phase 101 is performed offline and the prediction phase 111 is performed online.

In a next step 113, a truncated unclassified network flow is obtained from the unclassified network flows. The truncated unclassified network flow may be obtained online by combining a given number of packets received after receiving said given number of packets. Preferably, the merged packet that truncates the unclassified network flow is a fraction of the total number of packets belonging to the unclassified network flow. A truncated unclassified network flow can be obtained by dropping the last one or more received packets received.

In a next step 114, the predictive model is applied to truncate the unclassified network flows and the predictive model classification results are output

In a next step 115, the classification result is classified by taking into account the prediction model

The correction model is applied to the truncated unclassified network flow, and a correction model classification result is output.

In a next step 116, the corrected model classification results are combined to make a final prediction of the unclassified network flow.

In the prediction phase 111, the class of the flow is predicted by applying 114 a prediction model to the flow and 115 a correction model to the truncated unclassified network flow. In fact, noise is generated when the predictive model is applied to truncated unclassified network flows instead of full-length unclassified network flows. By applying a correction model, this noise can be reduced.

In fact, the prediction phase 111 may start first without waiting for all packets of the network flow to be classified. The correction model may be applied to truncated unclassified network flows that are shorter than the unclassified network flows provided to the prediction model. Thus, during the prediction phase 111, once the prediction model is applied to the full-length trained network flow during the training phase 101, the prediction model may be applied 114 to all received packets. The correction model is then applied 115 multiple times to the truncated unclassified network flow obtained from said received packets provided to the prediction model. Whereby the multiple predictions obtained from the correction model are then merged 116 together.

In step 115, the correction model generates a plurality of predictions, i.e., a plurality of correction model classification results, made over all truncated unclassified network flows obtained 113 from the received 112 packets provided to the prediction model in step 114. The predictions may be merged 116, for example, by scoring, to produce a final prediction of the class of the unclassified network flow.

Fig. 2 is a schematic diagram of a network flow early classification method according to another embodiment of the present invention. According to this embodiment, step 104 of training the predictive model on the full-length training network stream in the embodiment of fig. 1 includes

steps

201 and 202.

In step 201, a first feature set is extracted from the captured full-length training network stream to obtain a first training feature vector x.

In step 202, a predictive model is trained on a first training feature vector x and respective classes c assigned to the captured full-length training network streams.

Fig. 3 is a diagram illustrating a network flow early classification method according to another embodiment of the invention. According to this embodiment, the prediction model trained on the full-length training network flow is applied to the truncated training network flow in the embodiment of FIG. 1 to obtain a plurality of training classes

Step 106 of (a) comprises

steps

301, 302 and 303. Further, the embodiment of FIG. 1 is implemented by considering training classes

The step 108 of training the correction model on the truncated training network stream comprises step 304, compared to the assigned class c.

In step 301, a first feature set is extracted from the truncated training network stream to obtain a second training feature vector

In step 302, a second feature set is extracted from the truncated training network stream to obtain a third training feature vector

In step 303, the predictive model is applied to a second training feature vector of the truncated training network stream

To obtain training classes

In step 304, the feature vector is trained in a third training

Assign class c and train class

And (5) training a correction model.

Fig. 4 is a diagram illustrating a network flow early classification method according to another embodiment of the invention. According to this embodiment, the step 114 of applying the predictive model to truncate the unclassified network flow in the embodiment of fig. 1 comprises steps 401 to 403.

In step 401, a first set of features is extracted from the truncated unclassified network stream.

In step 402, a first predicted feature vector is computed from a first set of features from a truncated unclassified network flow

In step 403, a calculated first predicted feature vector is provided

Giving the prediction model to obtain the corresponding classification result of the prediction model

Fig. 5 is a diagram illustrating a network flow early classification method according to another embodiment of the invention. According to this embodiment, the step 115 of applying a correction model to truncate the unclassified network flow in the embodiment of fig. 1 comprises steps 501 to 503.

In step 501, a second set of features is extracted from the truncated unclassified network stream.

In step 502, a second predicted feature vector is computed from a second set of features from the truncated unclassified network flow

In step 503, a calculated second predicted feature vector is provided

And the obtained classification result of the prediction model

The model is calibrated to obtain a classification that truncates unclassified network flows.

According to another embodiment of the invention, traffic capturing may comprise training a network flow, which may be grouped by, for example, an application generating the network flow, wherein c represents a class or label of the group of applications. The present invention proposes a classification algorithm based on reference features that can learn on a training network stream to predict its application using the features of an unclassified or unlabeled network stream.

In the proposed learning method it is proposed to extract 201 features from a full-length training network stream in the training phase 101 and train 202 the prediction model accordingly. Once the predictive model is adapted to the data, it can be used for prediction. For example, it is proposed to obtain a conditional posterior probability p (c | x) of the prediction model. If x is the first training feature vector calculated for the full-length training network flow, the value p (c | x) represents the probability that the full-length training network flow belongs to class c. The application of the resulting stream can then be derived by the following equation:

class(x)←argmax_cp(c|x) (5)

in the prior art, it is assumed that features have been computed over the full-length training network flow, which means that averaging is performed over all packets of each flow. Fig. 7 shows a corresponding sorting workflow according to the prior art. In the offline training phase, the model, i.e., the predictive model, is trained using a training data set corresponding to the full-length training network flow. In the online prediction phase, the trained prediction model is applied to the received test stream to make predictions in real time.

Now, according to the invention, it is proposed to compute the same feature set on truncated versions of the full-length training network stream. If the average is only found for the first few packets of the full-length training network stream, it is possible to obtain the second training feature vector computed on the truncated training network stream

Because the first training feature vector x and the second training feature vector x

The same set of features, called the first set of features, can be implemented, so it is possible to apply equation (5) above to the second training feature vector

Early classification of cut-off flows is performed according to the following equation:

the advancement of the prediction is determined by the truncation length or time interval between the first and last packet of each truncated network flow to be classified. Although both types of restrictions are eligible, the first one applies in particular. That is, the present invention seeks to seek the most accurate predictions that can be achieved on truncated unclassified network flows based on a fixed intercept length. At the same time, the present invention utilizes a full-length training network stream during the training phase.

Because of the second vector calculated at the shut-off

Is noisy, class predicted according to equation (6)

May be different from the class c calculated on the corresponding full length stream according to equation (5). Hair brushThe clear core concept is to construct a complementary model, called the correction model, which captures the sensitivity of the prediction model to flow truncation. Here "sensitivity" represents the difference between the prediction made by the prediction model on a truncated network flow and the prediction for a corresponding full-length training network flow. The correction model is used for

The features of the pairs may be different from the features of the predictive models. In addition, as a secondary model, the correction model may utilize features derived from predictions made by the predictive model.

The first few packet formations of a flow having characteristics

And the intercepted flow has been classified as belonging to a class by the predictive model

Under the conditions of (1), can be considered

Is the posterior probability that the flow belongs to class c. The class of flow can then be calculated according to the following equation:

FIG. 8 illustrates a classification workflow for early classification of network flows according to an embodiment of the invention.

Both the prediction and correction models are trained in an offline mode. First, it is proposed to extract 201 features x from the full-length training network stream, which are considered as examples of predictive model construction, and train 202 the predictive model on these features and the associated class labels.

Second, it is proposed to pick packet samples for the full-length training network stream and form 105 truncated training network streams. By construction, any truncated training network stream is made up of all packets in the sample stream starting with the first packet and ending at the sampled packet.To build the correction model, it is proposed to extract 301 and 302 features from the truncated training network stream

And

applying 303 the constructed prediction model to

To obtain training classes

Training categories

A guess is represented for the class of the corresponding full-length training network flow computed by the predictive model on the truncated training network flow.

Subsequently, it is proposed to train classes

C predicted by the prediction model on the corresponding training network flow is compared 107 with the feature

And

the model is corrected for the upper training 304.

The model is used after training for online classification during the prediction phase 111. For the network flow to be classified, the features are first calculated 401

And will be characterized

Providing 402 to the predictive model to obtain according to equation (6)

Subsequently, features are computed 502

And will be characterized

The correction model is provided 503 to obtain the class c according to equation (7).

If the truncation length L is fixed and resources are available to repeat the above classification process several times for truncated streams of various lengths shorter than L, where L corresponds to the stream to be classified, several values of c that can be combined can be obtained by, for example, scoring to further improve accuracy.

An embodiment of the method has been tested on traffic capturing, which contains flows generated by 4 consecutive Skype VoIP calls and a background running P2P client. For this test, the flow has been tagged with Deep Packet Inspection (DPI) utility that extracts traffic from Skype. The following table shows the statistics of these traffic captures:

capture	Duration, second	Size, byte	Bag (bag)	TCP packet	UDP packet	Flow of	Skype streams
								1	167.45	6116576	12414	6755	5590	480	35
2	108.45	130444	6734	1330	5331	444	21
								3	189.14	5634602	11389	5752	5620	508	25
4	98.17	2840772	9122	3163	5937	755	24

A Support Vector Machines (SVM) algorithm has been applied to these captures four times in a round robin fashion. During each experiment, an attempt was made to detect Skype flow in one capture using the other three captures as learning. After each experiment, performance metrics such as accuracy, recall, and F1 scores obtained by combining the first two metrics were evaluated.

The following features (elements x and x) have been extracted from the examples provided to the prediction model

)：

-transport layer protocol (TCP or UDP) of the stream;

-average size of packets in the stream;

-standard deviation of the size of packets in the stream;

-inter-arrival times of packets in the stream;

-standard deviation of inter-arrival times of packets in the stream; and

-entropy value of application layer load of packets in the stream.

The following features (elements) have been extracted from the examples provided for the correction model

)：

-transport layer protocol (TCP or UDP) of the stream;

application categories calculated on the first L packets for a stream

And

is/are as follows

-first k (k) for a stream<L) bagsComputed application categories

And

is/are as follows

And

-the relative position of the kth packet in the intercepted stream, k/L;

-the relative difference between the average size of the first k packets of the intercepted flow and the average size of all packets of the intercepted flow;

-the relative difference between the inter-arrival times of the first k packets of the intercepted flow and the inter-arrival times of all packets of the intercepted flow; and

-the relative difference of the entropy values of the application layer loads of the first k packets of the intercepted stream and the entropy values of the application layer loads of all packets of the intercepted stream.

Fig. 9 illustrates improvements in advancement and accuracy according to embodiments of the invention, and in particular illustrates the improvement in accuracy that can be achieved for various intercept length measurements of classified network flows.

Fig. 9 shows the overall F1 score measured for a variation of the truncation length L from 1 to 10 the dark color indicates the accuracy of the baseline SVM algorithm and the light color indicates the improvement of the invention, ranging from 13% to 30% as shown in fig. 9, up to a 30% improvement is achieved for the

values

3, 4, 5.

The horizontal dashed line corresponds to the accuracy of the baseline algorithm for full length flow measurements. The inventive method achieves the same accuracy using only the first 5 packets of each stream, which means that classification can be done faster.

The invention has been described in connection with various embodiments and implementations as examples. Other variations will be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the independent claims. In the claims and the description the term "comprising" does not exclude other elements or steps and the "a" or "an" does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A method for early classification of network flows,

the method comprises a training phase (101) and a prediction phase (111),

the training phase (101) comprises:

capturing (102) a full-length training network stream;

assigning (103) a class to the captured full-length training network flow;

training (104) a predictive model on the full-length training network stream;

-obtaining (105) a truncated training network stream from the full-length training network stream;

applying (106) the predictive model trained on the full-length training network stream to the truncated training network stream to obtain a plurality of training classes;

comparing (107) the training classes predicted using the predictive model on the truncated training network stream with the assigned classes; and

training (108) a correction model on the truncated training network flow by taking into account the comparison of the training classes and the assignment classes; and

the prediction phase (111) comprises:

receiving (112) the first few packets of an unclassified network flow, said unclassified network flow being the object of an early classification;

-obtaining (113) a truncated unclassified network flow from said unclassified network flows;

applying (114) the predictive model to the truncated unclassified network flow and outputting a predictive model classification result;

applying (115) the correction model to the truncated unclassified network flow by taking into account the prediction model classification result and outputting a correction model classification result; and

merging (116) the corrected model classification results to make a final prediction of the unclassified network flow.

2. The method of claim 1,

training (104) the predictive model over the full-length training network stream comprises:

extracting (201) a first feature set from the captured full-length training network stream to obtain a first training feature vector; and

training (202) the predictive model on the first training feature vector and the respective classes assigned to the captured full-length training network streams.

3. The method of claim 2,

applying (106) the predictive model trained on the full-length training network stream to the truncated training network stream to obtain a plurality of training classes comprises:

extracting (301) the first set of features from the truncated training network stream to obtain a second training feature vector;

extracting (302) a second feature set from the truncated training network stream to obtain a third training feature vector; and

applying (303) the predictive model to the second training feature vector of the truncated training network stream to obtain the training class; and

training (108) the correction model on the truncated training network flow by taking into account the comparison of the training class and the assignment class comprises:

training (304) the correction model on the third training feature vector, the assignment class and the training class.

4. The method according to any of the preceding claims,

applying (114) the predictive model to the truncated unclassified network flow comprises:

extracting (401) a first set of features from the truncated unclassified network flow;

computing (402) a first predicted feature vector from the first set of features from the truncated unclassified network flow; and

-providing (403) said calculated first prediction feature vector to said prediction model to obtain a corresponding prediction model classification result.

5. The method of claim 4,

applying (115) the correction model to the truncated unclassified network flow comprises:

extracting (501) a second set of features from the truncated unclassified network stream;

computing (502) a second predicted feature vector by a second set of features from the truncated unclassified network flow; and

providing (503) the calculated second predicted feature vector and the obtained prediction model classification result to the correction model to obtain a classification of the truncated unclassified network flow.

6. The method according to claim 2 or 3,

the first set of features extracted from the network flow comprises:

a transport layer protocol of the stream;

an average size of packets in the stream;

a standard deviation of sizes of packets in the stream;

inter-arrival times of packets in the stream;

a standard deviation of inter-arrival times of packets in the stream;

entropy values of application layer loads of packets in the stream; or

Any combination thereof.

7. The method of claim 3,

the second set of features extracted from the network flow comprises:

a transport layer protocol of the stream;

assume that a second predicted feature vector is calculated over the first L packets of the stream

The conditional probability of the second prediction model classification result in the case of (a);

assume that a second predicted feature vector is computed over the first k packets of the stream

The conditional probability of the second prediction model classification result in case of (a), where k<L；

The relative position of the kth packet in the shut-off stream is k/L;

the relative difference between the average size of the first k packets of the intercepted flow and the average size of all packets in the intercepted flow;

the relative difference in inter-arrival times of the first k packets of the intercepted stream and all packets in the intercepted stream;

the entropy values of the application layer loads of the first k packets of said intercepted flow and the relative differences of the entropy values of the application layer loads of all packets in said intercepted flow; or

Any combination thereof.

8. The method according to any one of claims 1 to 3,

obtaining (113) the truncated unclassified network flow from the unclassified network flow comprises:

the truncated unclassified network flow is obtained by discarding the last one or more of the packets of the received unclassified network flow.

9. The method according to any one of claims 1 to 3,

at least one of the truncated unclassified network flows is obtained by truncating the unclassified network flow by a truncation length (L).

10. The method of claim 9,

the cut length (L) is fixed.

11. The method of claim 9,

additionally truncating the unclassified network flow by a length less than the truncation length (L) to obtain a shorter unclassified network flow, and

additionally applying the prediction phase to the shorter unclassified network flow to improve the accuracy of the classification.

12. A computer-readable storage medium comprising instructions that, when executed, cause a computer to perform the method of any of claims 1-11.

13. A computing device, characterized in that the computing device comprises program code which, when run on a computing device, performs the method according to any of claims 1-11.