Disclosure of Invention
The invention aims to provide a method and a device for classifying network data streams with high accuracy and high recall rate by using less sample marking cost and more comprehensive network data stream characteristics.
To achieve the above object, a first aspect of the present specification provides a method for training a classifier for classifying network data streams, comprising the following steps: respectively extracting a flow load characteristic U1, a flow statistical characteristic U2 and a flow entropy value characteristic U3 from the data flow; classifying the data stream by using a classifier Fi corresponding to a characteristic Ui to obtain a first classification result, wherein Ui is any one of the flow load characteristic U1, the flow statistic characteristic U2 and the flow entropy characteristic U3, and i is 1, 2 or 3; classifying the data stream by adopting a classifier Fj corresponding to a characteristic Uj to obtain a second classification result, wherein Uj is any one of the stream load characteristic U1, the stream statistical characteristic U2 and the stream entropy characteristic U3 which is not equal to Ui, and j is 1, 2 or 3; in the case that the first classification result is the same as the second classification result, the data stream and the first classification result are used as training data for training a classifier Fk corresponding to a feature Uk, where Uk is the above-mentioned stream load feature U1, stream statistical feature U2, and one of stream entropy value features U3 other than Ui, Uj, where k is 1, 2, or 3.
In one embodiment, before the classifying the data stream by using the classifier Fi corresponding to the feature Ui, the method further includes: f1 corresponding to the flow load characteristic U1, F2 corresponding to the flow statistic characteristic U2 and F3 corresponding to the flow entropy characteristic U3 are respectively trained by training sets E1, E2 and E3 based on the calibrated data flow set.
In one embodiment, using the data stream and the first classification result as training data comprises adding the data stream and the first classification result to a current training set of the classifier Fk to obtain a new training set Ek 'of the classifier Fk, the method further comprising retraining the classifier Fk with the new training set Ek'.
In one embodiment, the method for training classifiers for classifying network data streams further includes, after retraining all of classifiers F1, F2, and F3, if any of classifiers F1, F2, and F3 changes, repeating the method until all of classifiers F1, F2, and F3 do not change.
In one embodiment, the method for training a classifier for network data flow classification further includes deriving an integrated classifier by using a majority voting principle when none of F1, F2, and F3 changes.
In one embodiment, the extracting stream load characteristics includes extracting values of tf × idf of characteristic words included in semantic information of the data stream, where tf is a word frequency and idf is an inverse file frequency, that is, a logarithm of a ratio of a number of streams in the training set to a number of streams including the characteristic words.
In one embodiment, the extracting the stream load characteristics comprises removing characteristic words with tf lower than a word frequency threshold value and characteristic words with idf higher than an inverse file frequency threshold value
In one embodiment, the extracting the flow statistics comprises extracting at least one of a flow interval, an intra-flow packet-to-packet interval, a packet size, a number of packets, a number of TCP identification bits, and an activation state.
In one embodiment, any of F1, F2, and F3 is based on at least one algorithm of decision trees, na iotave bayes, support vector machines, association rule learning, neural networks, genetic algorithms.
In one embodiment, F1-F3 are based on the same algorithm.
A second aspect of the present specification provides an apparatus for training a classifier for classification of network data flows, comprising: a feature extraction unit configured to: respectively extracting a flow load characteristic U1, a flow statistical characteristic U2 and a flow entropy value characteristic U3 from the data flow; a first classification unit configured to: classifying the data stream by using a classifier Fi corresponding to a characteristic Ui to obtain a first classification result, wherein Ui is any one of the flow load characteristic U1, the flow statistic characteristic U2 and the flow entropy characteristic U3, and i is 1, 2 or 3; a second classification unit configured to: classifying the data stream by adopting a classifier Fj corresponding to a characteristic Uj to obtain a second classification result, wherein Uj is any one of the stream load characteristic U1, the stream statistical characteristic U2 and the stream entropy characteristic U3 which is not equal to Ui, and j is 1, 2 or 3; a training data acquisition unit configured to: in the case that the first classification result is the same as the second classification result, the data stream and the first classification result are used as training data for training a classifier Fk corresponding to a feature Uk, where Uk is the above-mentioned stream load feature U1, stream statistical feature U2, and one of stream entropy value features U3 other than Ui, Uj, where k is 1, 2, or 3.
In one embodiment, the apparatus for training a classifier for classifying network data streams further includes an initial training unit configured to: before the data stream is classified by the classifier Fi corresponding to the characteristic Ui, F1 corresponding to the stream load characteristic U1, F2 corresponding to the stream statistical characteristic U2, and F3 corresponding to the stream entropy characteristic U3 are trained by training sets E1, E2, and E3 based on the calibrated data stream set, respectively.
In an embodiment, the using the data stream and the first classification result as training data comprises adding the data stream and the first classification result to a current training set of the classifier Fk, thereby obtaining a new training set Ek 'of the classifier Fk, the apparatus further comprising a retraining unit configured to retrain the classifier Fk with the new training set Ek'.
In an embodiment, the apparatus for training a classifier for classifying a network data flow further includes an iteration unit configured to: after retraining all classifiers F1, F2, and F3, if any of classifiers F1, F2, and F3 change, the operations performed by the apparatus are repeated until no more changes occur to any of classifiers F1, F2, and F3.
In one embodiment, the apparatus for training a classifier for network data stream classification further comprises an integration unit configured to derive an integrated classifier by using a majority voting principle when none of F1-F3 is changed.
A third aspect of the present specification provides a computer-readable storage medium having stored thereon instruction code which, when executed in a computer, causes the computer to perform the above-described method of training an apparatus for a classifier for network data flow classification.
A fourth aspect of the present specification provides a method of classifying a network data flow, comprising: extracting the stream characteristics Vi for the data stream, wherein Vi is any one of stream load characteristics V1, stream statistical characteristics V2 and stream entropy value characteristics V3; and inputting the stream characteristics Vi into a classifier Fi corresponding to the characteristics Vi, which is obtained by the method for training the classifier for classifying the network data stream, so as to obtain the type Ci of the data stream.
A fifth aspect of the present specification provides an apparatus for classifying a network data stream, comprising: a feature extraction unit configured to extract a stream feature Vi, Vi being any one of a stream load feature V1, a stream statistical feature V2, and a stream entropy feature V3, for the data stream; and the classification unit is configured to input the stream characteristics Vi into a classifier Fi corresponding to the characteristics Vi, which is obtained by the classifier method for training the classification of the network data stream, so as to obtain the type Ci of the data stream.
A sixth aspect of the present specification provides a computer-readable storage medium having stored thereon instruction codes, which, when executed in a computer, cause the computer to perform the above-mentioned method of classifying a network data stream.
The embodiment of the specification comprehensively and deeply excavates the data characteristics and the behavior expression of the network flow by combining the flow statistic characteristics, the flow load characteristics and the flow entropy characteristics, reasonably introduces a small amount of calibration samples to expand the training samples by using a collaborative learning algorithm, enhances the accuracy of the classifier, and further improves the accuracy and the recall rate of the classifier by using a majority voting principle to collect the classification result of a single classifier by using an integrated learning idea for reference.
Detailed Description
Specific embodiments of the present specification are described below with reference to the accompanying drawings.
Fig. 1 shows a general schematic diagram of modules included in the technical solution of the embodiment of the present specification. The technical scheme of the embodiment of the specification comprises four modules: a data acquisition module 11, a feature extraction module 12, a model training module 13, and a classification implementation module 14.
FIG. 2 shows a general schematic of the steps of an embodiment of the present description implemented in the various modules shown in FIG. 1.
As shown in fig. 2, in the data acquisition module 11, the MAC packet is captured, TCP stream restoration is performed, so as to obtain a network data stream set, the set is divided into a calibration set L and an uncalibrated set U, and the data stream in the calibration set L is calibrated. In the feature extraction module 12, the stream load feature, the stream statistical feature and the stream entropy feature of each data stream of the calibration set L and the uncalibrated set U are extracted and vectorized to be input into the classifier in the following steps. In the model training module 13, the method includes: training a single classifier corresponding to one of the flow load characteristics, the flow statistics characteristics, and the flow entropy characteristics; obtaining three strong classifiers through collaborative learning among the three single classifiers; and obtaining a strong set classifier through a majority voting principle. At the classification implementation module 14, the network data stream may be classified using a single classifier or an integrated classifier obtained in the model training module.
The stream statistics, stream load, and stream entropy characteristics all pertain to network data stream characteristics. The flow statistic characteristics are behavior measurement of the data flow, the flow load characteristics are message content semantics of the data flow, and the flow entropy value characteristics are purity of the data flow. By combining the three characteristics, the data characteristics and the behavior of the network traffic can be comprehensively and deeply mined. By using the collaborative learning algorithm to train the classifier, the data volume needing to be calibrated can be reduced, the data calibration cost is reduced, meanwhile, uncalibrated data samples are reasonably selected, and the classification accuracy is enhanced. In addition, by using an integrated learning mode, a final classification result is obtained by majority voting on a plurality of single classifiers, and the classification accuracy and the recall rate are improved.
Hereinafter, examples of the present specification will be described in more detail.
Fig. 3 is a flow diagram illustrating a method of training a classifier for network data flow classification according to an embodiment of the present description.
The classification accuracy of the network data flow classifier depends on the quality of the training set samples to a great extent, the network data flow is complex and large in quantity, the calibration of the data samples is time-consuming and labor-consuming, and a large number of samples cannot be calibrated. The classifier performs poorly on the training effect of a small number of samples. Therefore, how to obtain an accurate classification model using a small number of calibration samples is a technical problem. In the embodiment of the present specification shown in fig. 3, a Tri-training method is used to achieve this purpose by using the idea of cooperative learning.
As shown in fig. 3, in step 31, a stream load signature U1, a stream statistics signature U2, and a stream entropy signature U3 are extracted from the data stream, respectively. In step 32, the data stream is classified by using a classifier Fi corresponding to a feature Ui, which is any one of the above-mentioned stream load feature U1, stream statistical feature U2, and stream entropy feature U3, where i is 1, 2, or 3, to obtain a first classification result. In step 33, the data stream is classified by using a classifier Fj corresponding to a feature Uj, where Uj is any one of the above-mentioned stream load feature U1, stream statistical feature U2, and stream entropy feature U3 that is not equal to Ui, and j is 1, 2, or 3, to obtain a second classification result. In step 34, in the case that the first classification result is the same as the second classification result, the data stream and the first classification result are used as training data for training the classifier Fk corresponding to the feature Uk, where Uk is the above-mentioned stream load feature U1, the stream statistical feature U2, and one of the stream entropy feature U3 other than Ui and Uj, where k is 1, 2, or 3.
The stream load characteristic is a data load characteristic value of the network stream except a protocol header, and contains rich semantic information of communication data. After the flow is calibrated, extracting a characteristic word set t ═ t1,t2,…tnEach stream data message can be represented as a vector about a feature word: v (d) { (t)1,w1),{t2,w2},…{tn,wnAnd w is a weight coefficient of the feature ti, and tf is an idf value, wherein tf is a word frequency, i.e. a ratio of the number of times that a feature word appears in a certain data stream to an effective word in the data stream. idf is the inverse file frequency, which is the logarithm of the ratio of the number of streams in the training set to the number of streams containing the token. tf idf value is the product of tf and idf. the feature words with tf lower than the word frequency threshold and the words with idf higher than the inverse file frequency threshold are washed. In the embodiment of the present specification, flow data is used as a row vector, and tf × idf values of feature words are used as column vectors to construct a traffic load feature matrix. It should be understood that the calculation method of the flow load feature vector in the embodiments of the present specification is only exemplary, that is, the flow load feature vector may also be calculated in other calculation methods known to those skilled in the art.
Table 1 shows an example of feature words included in a partial data stream.
TABLE 1
In one embodiment, the feature words are stored in a feature word database for use in calculating flow load features.
The flow statistical characteristic is a measure set calculated by counting the network behavior of the data flow. Common flow statistics include flow time interval, packet-to-packet time interval in the flow, packet size, number of packets, number of TCP identification bits, and activation status. The above characteristics can also be used to calculate the mathematical expression, taking the size of the data packet as an example, and can calculate the statistics such as the maximum value, the minimum value, the average value and the variance of the byte number of the data packet of the flow. Meanwhile, according to the direction of the data flow, the method can be further divided into a forward flow characteristic and a backward flow characteristic. Typical flow statistics are shown in table 2. In the embodiment of the present description, the flow data is used as a row vector, and the statistical eigenvalue is used as a column vector to construct a flow statistical eigenvalue matrix. Table 2 shows a table of typical flow statistics.
TABLE 2
The entropy value of the flow represents the degree of misordering of the flow data. The standard calculation formula is well known to those skilled in the art, and specifically, F represents a data flow message, and F representskRepresents the set of all k consecutive characters of the data stream message, in hkRepresents a correspondence fkThe entropy value of (b) is calculated as follows:
according to the formula, for the flow F containing m byte messages, the entropy value feature set H can be obtainedm={h1,h2,…hnAnd in the embodiment of the description, flow data is used as a row vector, and entropy values of different m are used as column vectors to construct a flow entropy value feature matrix.
The classifier Fi corresponding to the feature Ui shown in fig. 3 refers to a single classifier obtained by training a classifier using one of the flow statistical feature, the flow entropy feature, and the flow load feature of the calibration sample set. In one embodiment, the collected data stream is divided into a calibration set L and an uncalibrated set U, and the data stream in the calibration set L is calibrated. In one embodiment, network data flows may be targeted in ten types of FTP, HTTP, SMTP, I MAP, SSH, POP3, BitTorrent, DNS, Cool dog, PPLive. In one embodiment, the magnitude of the number of data streams in the calibration set is hundreds, and the magnitude of the number of data streams in the uncalibrated set is hundreds of thousands, which obviously reduces the calibration cost greatly by the technical scheme of the embodiment of the present specification. Respectively extracting stream load characteristics from the data streams in the calibration set L and vectorizing the stream load characteristics and the calibration types together to obtain a training set E1; respectively extracting stream statistical characteristics from the data streams in the calibration set L and vectorizing the stream statistical characteristics and the calibration types together to obtain a training set E2; and respectively extracting stream entropy characteristics from the data streams in the calibration set L and vectorizing the stream entropy characteristics and the calibration types together to obtain a training set E3. Fi (i ═ 1:3) is input to training sets E1-E3, respectively, to obtain initial classifiers F1-F3 corresponding to flow statistics, flow entropy and flow load characteristics, respectively.
In one embodiment, Fi (i ═ 1:3) is a classification model based on at least one algorithm from among decision trees, naive bayes, support vector machines, association rule learning, neural networks, genetic algorithms. In another embodiment, F1-F3 are based on the same algorithm. In another embodiment, F1-F3 are based on different algorithms.
Fig. 4 shows a simple schematic diagram of the Tri-training method of training the classifier shown in fig. 3. F1-F3 can be used as a main classifier in turn, and the other two can be used as collaborative classifiers to enhance the training set of the main classifier. Taking F3 as an example, the co-classifiers F1 and F2 perform classification and calibration on each sample in the un-calibrated traffic set, and if the calibration results are the same, add the sample and the calibration result to the training set E3 of F3. After the uncalibrated set U is classified by using the classifiers F1 and F2, a new training set E3' of F3 is obtained as a new training set for retraining F3 later. New training sets E1 'and E2' for F1 and F2 can be obtained as well. Classifiers F1-F3 are retrained using new training sets E1 ', E2 ', and E3 ', respectively, to obtain enhanced classifiers F1-F3.
In one embodiment, the enhanced classifier F1-F3 is determined whether the enhancement classifier F1-F3 has changed compared to the classifier F before the training, for example, the enhanced classifiers Fi (i ═ 1:3) and Fj (j ═ 1:3 and j ≠ i) are reused in the Tri-training algorithm, and it is determined whether there are samples U that can be added to the training set Ek (k ≠ 1:3, k ≠ i and k ≠ j) after it calibrates the uncalibrated set U, or it is determined whether a new training set Ek 'can be obtained, indicating that Fk has not changed compared to the classifier F before the training if there are no samples U, or a new training set Ek' cannot be obtained. If any of the classifiers F1-F3 changed, the above method is repeated again until none of the classifiers F1, F2, and F3 changed after the above method was performed, thereby ending the algorithm. And obtaining three strong classifiers F1-F3 through multiple rounds of training and sample updating iteration.
Fig. 5 shows an iterative algorithm of the Tri-tracing method described above. As shown in fig. 5, in step 51, the uncalibrated set U is classified using two other classifiers Fj (j 1:3 and j ≠ i) and Fk (k 1:3, k ≠ i and k ≠ j) for Fi (i 1:3), respectively.
In step 52, if the calibration results of the classifiers Fj (j ═ 1:3 and j ≠ i) and Fk (k ≠ i and k ≠ j) are the same for the sample U in the uncalibrated set U, the sample U and the calibration results are added to the training set Ei (i ═ 1:3) of Fi and the sample U is removed from the uncalibrated set U, so as to obtain a new training set Ei' (i ═ 1:3) of Fi (i ═ 1:3), respectively.
In step 53, the classifier Fi (i ═ 1:3) is retrained with the new training set Ei' (i ═ 1:3), respectively.
In step 54, it is determined whether any of F1, F2, and F3 has changed, and if any of F1-F3 has changed, steps 51-53 are repeated until none of F1-F3 has changed to obtain a strong classifier Fi (i ═ 1: 3).
However, the classification accuracy of a single classifier is greatly biased across different classification sets, and overfitting may also occur across a single classification set. The ensemble learning uses modes of sample set sampling, feature set selection, classification algorithm selection and the like to train different classifiers, and then uses principles of majority voting and the like to finish the aggregation of results, so that not only can the classification accuracy be improved, but also the overfitting of a single classifier can be effectively avoided. In the embodiment of the specification, three differential single classifiers are obtained by adopting different classification algorithms or classification features for the same training set, and then the final classification result of the sample is obtained by using the majority voting principle.
Fig. 6 illustrates an apparatus 600 for training a classifier for network data flow classification according to an embodiment of the present description, including: a feature extraction unit 61 configured to: respectively extracting a flow load characteristic U1, a flow statistical characteristic U2 and a flow entropy value characteristic U3 from the data flow; a first classification unit 62 configured to: classifying the data stream by using a classifier Fi corresponding to a characteristic Ui to obtain a first classification result, wherein Ui is any one of the flow load characteristic U1, the flow statistic characteristic U2 and the flow entropy characteristic U3, and i is 1, 2 or 3; a second classification unit 63 configured to: classifying the data stream by adopting a classifier Fj corresponding to a characteristic Uj to obtain a second classification result, wherein Uj is any one of the stream load characteristic U1, the stream statistical characteristic U2 and the stream entropy characteristic U3 which is not equal to Ui, and j is 1, 2 or 3; a training data acquisition unit 64 configured to: in the case that the first classification result is the same as the second classification result, the data stream and the first classification result are used as training data for training a classifier Fk corresponding to a feature Uk, where Uk is the above-mentioned stream load feature U1, stream statistical feature U2, and one of stream entropy value features U3 other than Ui, Uj, where k is 1, 2, or 3.
In one embodiment, the apparatus 600 for training a classifier for network data flow classification according to an embodiment of the present specification further includes an initial training unit 65 configured to: before the data stream is classified by the classifier Fi corresponding to the characteristic Ui, F1 corresponding to the stream load characteristic U1, F2 corresponding to the stream statistical characteristic U2, and F3 corresponding to the stream entropy characteristic U3 are trained by training sets E1, E2, and E3 based on the calibrated data stream set, respectively.
In an embodiment, wherein using the data stream and the first classification result as training data includes adding the data stream and the first classification result to a current training set of a classifier Fk to obtain a new training set Ek 'of the classifier Fk, the apparatus 600 for training a classifier for network data stream classification according to an embodiment of the present specification further includes a retraining unit 66 configured to retrain the classifier Fk with the new training set Ek'.
In one embodiment, the apparatus 600 for training a classifier for network data flow classification according to the present description further includes an iteration unit 67 configured to: after training classifiers F1, F2, and F3, if any of classifiers F1, F2, and F3 is changed, the operations performed by the above-described apparatus are repeated until no more changes occur in classifiers F1, F2, and F3.
In one embodiment, the apparatus 600 for training classifiers for network data stream classification according to the present description further comprises an integrating unit 68 configured to derive an integrated classifier by using majority voting principles when none of F1-F3 changes anymore.
The dashed boxes in fig. 6 indicate that the elements are optional elements in the embodiment, but not essential elements. For example, the apparatus 600 in the embodiments of the present specification may not include the initial training unit 65, i.e., instead of obtaining the initial single classifier Fi by training the calibration set, the initial single classifier Fi may be obtained in other ways known to those skilled in the art. Likewise, the retraining unit 66, the iterating unit 67 and the integrating unit 68 are also only optional units and are not necessary in this embodiment. The dotted boxes in fig. 7 and 8 hereinafter also indicate the same meaning.
Fig. 7 illustrates a method for classifying network data flows according to an embodiment of the present description, including the following steps: step 71, extracting flow characteristics Vi from the data flow, wherein the Vi is any one of flow load characteristics V1, flow statistical characteristics V2 and flow entropy value characteristics V3; and step 72, inputting the stream characteristics Vi into a classifier Fi corresponding to the characteristics Vi, obtained by training the classifier for classifying the network data stream according to the above, so as to obtain the type Ci of the data stream.
In one embodiment, the method for classifying a network data stream according to the embodiment of the present disclosure shown in fig. 7 further includes a step 73 of obtaining a final type of the data stream through a majority voting principle for the obtained type C1-C3 of the data stream.
Fig. 8 illustrates an apparatus 800 for classifying network data flows according to an embodiment of the present description, including: a feature extraction unit 81 configured to extract a stream feature Vi, Vi being any one of a stream load feature V1, a stream statistical feature V2, and a stream entropy feature V3, for the data stream; and a classification unit 82 configured to input the stream feature Vi into a classifier Fi corresponding to the feature Vi obtained by the method for training the classifier for classifying the network data stream described above, so as to obtain the type Ci of the data stream.
In one embodiment, the apparatus 800 for classifying network data streams according to the embodiment of the present disclosure shown in fig. 8 further includes an integrating unit 83 configured to derive a final type of a data stream by a majority voting principle for the obtained types C1-C3 of the data stream.
In another aspect, embodiments of the present specification also provide a computer-readable storage medium having instruction codes stored thereon, which when executed in a computer, cause the computer to perform the above-mentioned method of training a classifier for network data flow classification.
In yet another aspect, embodiments of the present specification also provide a computer-readable storage medium having computer instruction code stored thereon, which, when executed in a computer, causes the computer to perform the above-mentioned method for classifying network data streams.
The method and the device in the embodiment of the present disclosure may be deployed in any network environment, and classify and analyze traffic of the network environment.
The embodiment of the specification comprehensively and deeply mines the data characteristics and the behavior of the network traffic by combining the traffic statistic characteristics, the traffic load characteristics and the traffic entropy characteristics. The embodiment of the specification also uses a collaborative learning algorithm, uses a small amount of calibration samples, reasonably introduces the calibration samples to expand the training samples, and enhances the accuracy of the classifier. In addition, the embodiment of the specification further improves the accuracy and the recall rate of the classifier by using an integrated learning algorithm and using a majority voting principle to collect the classification result of the single classifier.
It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.