CN107967311B - Method and device for classifying network data streams - Google Patents

Method and device for classifying network data streams Download PDF

Info

Publication number
CN107967311B
CN107967311B CN201711158988.4A CN201711158988A CN107967311B CN 107967311 B CN107967311 B CN 107967311B CN 201711158988 A CN201711158988 A CN 201711158988A CN 107967311 B CN107967311 B CN 107967311B
Authority
CN
China
Prior art keywords
stream
classifier
flow
training
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711158988.4A
Other languages
Chinese (zh)
Other versions
CN107967311A (en
Inventor
续涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201711158988.4A priority Critical patent/CN107967311B/en
Publication of CN107967311A publication Critical patent/CN107967311A/en
Application granted granted Critical
Publication of CN107967311B publication Critical patent/CN107967311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design

Abstract

The embodiment of the specification discloses a method and a device for training a classifier for network data flow classification. The method comprises the following steps: respectively extracting a flow load characteristic U1, a flow statistical characteristic U2 and a flow entropy value characteristic U3 from the data flow; classifying the data stream by using a classifier F i corresponding to the characteristic Ui to obtain a first classification result; classifying the data stream by adopting a classifier Fj corresponding to the characteristic Uj to obtain a second classification result; and in the case that the first classification result is the same as the second classification result, using the data stream and the first classification result as training data for training a classifier Fk corresponding to a feature Uk, wherein Uk is one of the above-mentioned stream load feature U1, stream statistical feature U2, and stream entropy feature U3 except for Ui and Uj.

Description

Method and device for classifying network data streams
Technical Field
The present invention relates to the field of network data flow classification, and more particularly, to a method and apparatus for training a classifier for network data flow classification, and a method and apparatus for classifying network data flow.
Background
The network data flow is the data embodiment of the interaction of network communication parties, is the fundamental stone for developing the network space safety work, and provides data guarantee and technical support for network attack detection, network topology optimization, network charge management and network service promotion. With the rapid development of the internet, network data flow is rapidly growing, and the network data flow becomes more diversified, has higher complexity, and is more complicated to communicate data and instructions, and shows different data characteristics and network behaviors for different application modes and scenes. Taking network attack detection as an example, a network data stream may contain malicious codes, hidden channels, protocol bugs, theft information and the like, and if effective classification and deep analysis can be achieved, such attack traffic can be detected in time, so that defense can be well achieved.
The existing network data flow classification method mainly comprises a method based on flow load characteristic matching and a machine learning classification method based on flow statistical characteristics, and a good classification effect is achieved. In the classification method based on the traffic load characteristics, the characteristic words and the attributes of the characteristic words included in the semantic information of the data stream are extracted to train a classifier. In the classification method based on the flow statistic characteristics, at least one of flow time interval, data packet-to-time interval in flow, data packet size, data packet number, TCP identification bit number and activation state is extracted as the flow statistic characteristics to train the classifier. However, a single type of feature does not fully and effectively describe the network behavior and data characteristics of traffic. Network data flow is complex, calibration of data samples is time-consuming and labor-consuming, and the effect of a small amount of samples on classifier training is not good enough. In addition, due to the influence of data samples, a single classifier is easy to generate classification bias, so that the accuracy of classification is influenced.
Therefore, there is a need for a more efficient scheme that can classify network data streams with high accuracy and high recall at a lower sample labeling cost and with more comprehensive network data stream characteristics.
Disclosure of Invention
The invention aims to provide a method and a device for classifying network data streams with high accuracy and high recall rate by using less sample marking cost and more comprehensive network data stream characteristics.
To achieve the above object, a first aspect of the present specification provides a method for training a classifier for classifying network data streams, comprising the following steps: respectively extracting a flow load characteristic U1, a flow statistical characteristic U2 and a flow entropy value characteristic U3 from the data flow; classifying the data stream by using a classifier Fi corresponding to a characteristic Ui to obtain a first classification result, wherein Ui is any one of the flow load characteristic U1, the flow statistic characteristic U2 and the flow entropy characteristic U3, and i is 1, 2 or 3; classifying the data stream by adopting a classifier Fj corresponding to a characteristic Uj to obtain a second classification result, wherein Uj is any one of the stream load characteristic U1, the stream statistical characteristic U2 and the stream entropy characteristic U3 which is not equal to Ui, and j is 1, 2 or 3; in the case that the first classification result is the same as the second classification result, the data stream and the first classification result are used as training data for training a classifier Fk corresponding to a feature Uk, where Uk is the above-mentioned stream load feature U1, stream statistical feature U2, and one of stream entropy value features U3 other than Ui, Uj, where k is 1, 2, or 3.
In one embodiment, before the classifying the data stream by using the classifier Fi corresponding to the feature Ui, the method further includes: f1 corresponding to the flow load characteristic U1, F2 corresponding to the flow statistic characteristic U2 and F3 corresponding to the flow entropy characteristic U3 are respectively trained by training sets E1, E2 and E3 based on the calibrated data flow set.
In one embodiment, using the data stream and the first classification result as training data comprises adding the data stream and the first classification result to a current training set of the classifier Fk to obtain a new training set Ek 'of the classifier Fk, the method further comprising retraining the classifier Fk with the new training set Ek'.
In one embodiment, the method for training classifiers for classifying network data streams further includes, after retraining all of classifiers F1, F2, and F3, if any of classifiers F1, F2, and F3 changes, repeating the method until all of classifiers F1, F2, and F3 do not change.
In one embodiment, the method for training a classifier for network data flow classification further includes deriving an integrated classifier by using a majority voting principle when none of F1, F2, and F3 changes.
In one embodiment, the extracting stream load characteristics includes extracting values of tf × idf of characteristic words included in semantic information of the data stream, where tf is a word frequency and idf is an inverse file frequency, that is, a logarithm of a ratio of a number of streams in the training set to a number of streams including the characteristic words.
In one embodiment, the extracting the stream load characteristics comprises removing characteristic words with tf lower than a word frequency threshold value and characteristic words with idf higher than an inverse file frequency threshold value
In one embodiment, the extracting the flow statistics comprises extracting at least one of a flow interval, an intra-flow packet-to-packet interval, a packet size, a number of packets, a number of TCP identification bits, and an activation state.
In one embodiment, any of F1, F2, and F3 is based on at least one algorithm of decision trees, na iotave bayes, support vector machines, association rule learning, neural networks, genetic algorithms.
In one embodiment, F1-F3 are based on the same algorithm.
A second aspect of the present specification provides an apparatus for training a classifier for classification of network data flows, comprising: a feature extraction unit configured to: respectively extracting a flow load characteristic U1, a flow statistical characteristic U2 and a flow entropy value characteristic U3 from the data flow; a first classification unit configured to: classifying the data stream by using a classifier Fi corresponding to a characteristic Ui to obtain a first classification result, wherein Ui is any one of the flow load characteristic U1, the flow statistic characteristic U2 and the flow entropy characteristic U3, and i is 1, 2 or 3; a second classification unit configured to: classifying the data stream by adopting a classifier Fj corresponding to a characteristic Uj to obtain a second classification result, wherein Uj is any one of the stream load characteristic U1, the stream statistical characteristic U2 and the stream entropy characteristic U3 which is not equal to Ui, and j is 1, 2 or 3; a training data acquisition unit configured to: in the case that the first classification result is the same as the second classification result, the data stream and the first classification result are used as training data for training a classifier Fk corresponding to a feature Uk, where Uk is the above-mentioned stream load feature U1, stream statistical feature U2, and one of stream entropy value features U3 other than Ui, Uj, where k is 1, 2, or 3.
In one embodiment, the apparatus for training a classifier for classifying network data streams further includes an initial training unit configured to: before the data stream is classified by the classifier Fi corresponding to the characteristic Ui, F1 corresponding to the stream load characteristic U1, F2 corresponding to the stream statistical characteristic U2, and F3 corresponding to the stream entropy characteristic U3 are trained by training sets E1, E2, and E3 based on the calibrated data stream set, respectively.
In an embodiment, the using the data stream and the first classification result as training data comprises adding the data stream and the first classification result to a current training set of the classifier Fk, thereby obtaining a new training set Ek 'of the classifier Fk, the apparatus further comprising a retraining unit configured to retrain the classifier Fk with the new training set Ek'.
In an embodiment, the apparatus for training a classifier for classifying a network data flow further includes an iteration unit configured to: after retraining all classifiers F1, F2, and F3, if any of classifiers F1, F2, and F3 change, the operations performed by the apparatus are repeated until no more changes occur to any of classifiers F1, F2, and F3.
In one embodiment, the apparatus for training a classifier for network data stream classification further comprises an integration unit configured to derive an integrated classifier by using a majority voting principle when none of F1-F3 is changed.
A third aspect of the present specification provides a computer-readable storage medium having stored thereon instruction code which, when executed in a computer, causes the computer to perform the above-described method of training an apparatus for a classifier for network data flow classification.
A fourth aspect of the present specification provides a method of classifying a network data flow, comprising: extracting the stream characteristics Vi for the data stream, wherein Vi is any one of stream load characteristics V1, stream statistical characteristics V2 and stream entropy value characteristics V3; and inputting the stream characteristics Vi into a classifier Fi corresponding to the characteristics Vi, which is obtained by the method for training the classifier for classifying the network data stream, so as to obtain the type Ci of the data stream.
A fifth aspect of the present specification provides an apparatus for classifying a network data stream, comprising: a feature extraction unit configured to extract a stream feature Vi, Vi being any one of a stream load feature V1, a stream statistical feature V2, and a stream entropy feature V3, for the data stream; and the classification unit is configured to input the stream characteristics Vi into a classifier Fi corresponding to the characteristics Vi, which is obtained by the classifier method for training the classification of the network data stream, so as to obtain the type Ci of the data stream.
A sixth aspect of the present specification provides a computer-readable storage medium having stored thereon instruction codes, which, when executed in a computer, cause the computer to perform the above-mentioned method of classifying a network data stream.
The embodiment of the specification comprehensively and deeply excavates the data characteristics and the behavior expression of the network flow by combining the flow statistic characteristics, the flow load characteristics and the flow entropy characteristics, reasonably introduces a small amount of calibration samples to expand the training samples by using a collaborative learning algorithm, enhances the accuracy of the classifier, and further improves the accuracy and the recall rate of the classifier by using a majority voting principle to collect the classification result of a single classifier by using an integrated learning idea for reference.
Drawings
The embodiments of the present specification may be made more clear by describing the embodiments with reference to the attached drawings:
FIG. 1 shows a general schematic of modules included in embodiments of the present description;
FIG. 2 shows a general schematic of the steps of an embodiment of the present description implemented in the various modules shown in FIG. 1;
FIG. 3 is a flow diagram illustrating a method of training a classifier for network data flow classification in accordance with an embodiment of the present description;
FIG. 4 shows a simple schematic diagram of the Tri-training method of training the classifier shown in FIG. 3;
FIG. 5 shows an iterative algorithm of the Tri-training method;
FIG. 6 illustrates an apparatus for training classifiers for network data flow classification in accordance with an embodiment of the present description;
FIG. 7 illustrates a method of classifying network data flows in accordance with an embodiment of the present description; and
fig. 8 illustrates an apparatus for classifying network data flows according to an embodiment of the present description.
Detailed Description
Specific embodiments of the present specification are described below with reference to the accompanying drawings.
Fig. 1 shows a general schematic diagram of modules included in the technical solution of the embodiment of the present specification. The technical scheme of the embodiment of the specification comprises four modules: a data acquisition module 11, a feature extraction module 12, a model training module 13, and a classification implementation module 14.
FIG. 2 shows a general schematic of the steps of an embodiment of the present description implemented in the various modules shown in FIG. 1.
As shown in fig. 2, in the data acquisition module 11, the MAC packet is captured, TCP stream restoration is performed, so as to obtain a network data stream set, the set is divided into a calibration set L and an uncalibrated set U, and the data stream in the calibration set L is calibrated. In the feature extraction module 12, the stream load feature, the stream statistical feature and the stream entropy feature of each data stream of the calibration set L and the uncalibrated set U are extracted and vectorized to be input into the classifier in the following steps. In the model training module 13, the method includes: training a single classifier corresponding to one of the flow load characteristics, the flow statistics characteristics, and the flow entropy characteristics; obtaining three strong classifiers through collaborative learning among the three single classifiers; and obtaining a strong set classifier through a majority voting principle. At the classification implementation module 14, the network data stream may be classified using a single classifier or an integrated classifier obtained in the model training module.
The stream statistics, stream load, and stream entropy characteristics all pertain to network data stream characteristics. The flow statistic characteristics are behavior measurement of the data flow, the flow load characteristics are message content semantics of the data flow, and the flow entropy value characteristics are purity of the data flow. By combining the three characteristics, the data characteristics and the behavior of the network traffic can be comprehensively and deeply mined. By using the collaborative learning algorithm to train the classifier, the data volume needing to be calibrated can be reduced, the data calibration cost is reduced, meanwhile, uncalibrated data samples are reasonably selected, and the classification accuracy is enhanced. In addition, by using an integrated learning mode, a final classification result is obtained by majority voting on a plurality of single classifiers, and the classification accuracy and the recall rate are improved.
Hereinafter, examples of the present specification will be described in more detail.
Fig. 3 is a flow diagram illustrating a method of training a classifier for network data flow classification according to an embodiment of the present description.
The classification accuracy of the network data flow classifier depends on the quality of the training set samples to a great extent, the network data flow is complex and large in quantity, the calibration of the data samples is time-consuming and labor-consuming, and a large number of samples cannot be calibrated. The classifier performs poorly on the training effect of a small number of samples. Therefore, how to obtain an accurate classification model using a small number of calibration samples is a technical problem. In the embodiment of the present specification shown in fig. 3, a Tri-training method is used to achieve this purpose by using the idea of cooperative learning.
As shown in fig. 3, in step 31, a stream load signature U1, a stream statistics signature U2, and a stream entropy signature U3 are extracted from the data stream, respectively. In step 32, the data stream is classified by using a classifier Fi corresponding to a feature Ui, which is any one of the above-mentioned stream load feature U1, stream statistical feature U2, and stream entropy feature U3, where i is 1, 2, or 3, to obtain a first classification result. In step 33, the data stream is classified by using a classifier Fj corresponding to a feature Uj, where Uj is any one of the above-mentioned stream load feature U1, stream statistical feature U2, and stream entropy feature U3 that is not equal to Ui, and j is 1, 2, or 3, to obtain a second classification result. In step 34, in the case that the first classification result is the same as the second classification result, the data stream and the first classification result are used as training data for training the classifier Fk corresponding to the feature Uk, where Uk is the above-mentioned stream load feature U1, the stream statistical feature U2, and one of the stream entropy feature U3 other than Ui and Uj, where k is 1, 2, or 3.
The stream load characteristic is a data load characteristic value of the network stream except a protocol header, and contains rich semantic information of communication data. After the flow is calibrated, extracting a characteristic word set t ═ t1,t2,…tnEach stream data message can be represented as a vector about a feature word: v (d) { (t)1,w1),{t2,w2},…{tn,wnAnd w is a weight coefficient of the feature ti, and tf is an idf value, wherein tf is a word frequency, i.e. a ratio of the number of times that a feature word appears in a certain data stream to an effective word in the data stream. idf is the inverse file frequency, which is the logarithm of the ratio of the number of streams in the training set to the number of streams containing the token. tf idf value is the product of tf and idf. the feature words with tf lower than the word frequency threshold and the words with idf higher than the inverse file frequency threshold are washed. In the embodiment of the present specification, flow data is used as a row vector, and tf × idf values of feature words are used as column vectors to construct a traffic load feature matrix. It should be understood that the calculation method of the flow load feature vector in the embodiments of the present specification is only exemplary, that is, the flow load feature vector may also be calculated in other calculation methods known to those skilled in the art.
Table 1 shows an example of feature words included in a partial data stream.
Figure BDA0001475159580000081
TABLE 1
In one embodiment, the feature words are stored in a feature word database for use in calculating flow load features.
The flow statistical characteristic is a measure set calculated by counting the network behavior of the data flow. Common flow statistics include flow time interval, packet-to-packet time interval in the flow, packet size, number of packets, number of TCP identification bits, and activation status. The above characteristics can also be used to calculate the mathematical expression, taking the size of the data packet as an example, and can calculate the statistics such as the maximum value, the minimum value, the average value and the variance of the byte number of the data packet of the flow. Meanwhile, according to the direction of the data flow, the method can be further divided into a forward flow characteristic and a backward flow characteristic. Typical flow statistics are shown in table 2. In the embodiment of the present description, the flow data is used as a row vector, and the statistical eigenvalue is used as a column vector to construct a flow statistical eigenvalue matrix. Table 2 shows a table of typical flow statistics.
Figure BDA0001475159580000082
Figure BDA0001475159580000091
TABLE 2
The entropy value of the flow represents the degree of misordering of the flow data. The standard calculation formula is well known to those skilled in the art, and specifically, F represents a data flow message, and F representskRepresents the set of all k consecutive characters of the data stream message, in hkRepresents a correspondence fkThe entropy value of (b) is calculated as follows:
Figure BDA0001475159580000092
according to the formula, for the flow F containing m byte messages, the entropy value feature set H can be obtainedm={h1,h2,…hnAnd in the embodiment of the description, flow data is used as a row vector, and entropy values of different m are used as column vectors to construct a flow entropy value feature matrix.
The classifier Fi corresponding to the feature Ui shown in fig. 3 refers to a single classifier obtained by training a classifier using one of the flow statistical feature, the flow entropy feature, and the flow load feature of the calibration sample set. In one embodiment, the collected data stream is divided into a calibration set L and an uncalibrated set U, and the data stream in the calibration set L is calibrated. In one embodiment, network data flows may be targeted in ten types of FTP, HTTP, SMTP, I MAP, SSH, POP3, BitTorrent, DNS, Cool dog, PPLive. In one embodiment, the magnitude of the number of data streams in the calibration set is hundreds, and the magnitude of the number of data streams in the uncalibrated set is hundreds of thousands, which obviously reduces the calibration cost greatly by the technical scheme of the embodiment of the present specification. Respectively extracting stream load characteristics from the data streams in the calibration set L and vectorizing the stream load characteristics and the calibration types together to obtain a training set E1; respectively extracting stream statistical characteristics from the data streams in the calibration set L and vectorizing the stream statistical characteristics and the calibration types together to obtain a training set E2; and respectively extracting stream entropy characteristics from the data streams in the calibration set L and vectorizing the stream entropy characteristics and the calibration types together to obtain a training set E3. Fi (i ═ 1:3) is input to training sets E1-E3, respectively, to obtain initial classifiers F1-F3 corresponding to flow statistics, flow entropy and flow load characteristics, respectively.
In one embodiment, Fi (i ═ 1:3) is a classification model based on at least one algorithm from among decision trees, naive bayes, support vector machines, association rule learning, neural networks, genetic algorithms. In another embodiment, F1-F3 are based on the same algorithm. In another embodiment, F1-F3 are based on different algorithms.
Fig. 4 shows a simple schematic diagram of the Tri-training method of training the classifier shown in fig. 3. F1-F3 can be used as a main classifier in turn, and the other two can be used as collaborative classifiers to enhance the training set of the main classifier. Taking F3 as an example, the co-classifiers F1 and F2 perform classification and calibration on each sample in the un-calibrated traffic set, and if the calibration results are the same, add the sample and the calibration result to the training set E3 of F3. After the uncalibrated set U is classified by using the classifiers F1 and F2, a new training set E3' of F3 is obtained as a new training set for retraining F3 later. New training sets E1 'and E2' for F1 and F2 can be obtained as well. Classifiers F1-F3 are retrained using new training sets E1 ', E2 ', and E3 ', respectively, to obtain enhanced classifiers F1-F3.
In one embodiment, the enhanced classifier F1-F3 is determined whether the enhancement classifier F1-F3 has changed compared to the classifier F before the training, for example, the enhanced classifiers Fi (i ═ 1:3) and Fj (j ═ 1:3 and j ≠ i) are reused in the Tri-training algorithm, and it is determined whether there are samples U that can be added to the training set Ek (k ≠ 1:3, k ≠ i and k ≠ j) after it calibrates the uncalibrated set U, or it is determined whether a new training set Ek 'can be obtained, indicating that Fk has not changed compared to the classifier F before the training if there are no samples U, or a new training set Ek' cannot be obtained. If any of the classifiers F1-F3 changed, the above method is repeated again until none of the classifiers F1, F2, and F3 changed after the above method was performed, thereby ending the algorithm. And obtaining three strong classifiers F1-F3 through multiple rounds of training and sample updating iteration.
Fig. 5 shows an iterative algorithm of the Tri-tracing method described above. As shown in fig. 5, in step 51, the uncalibrated set U is classified using two other classifiers Fj (j 1:3 and j ≠ i) and Fk (k 1:3, k ≠ i and k ≠ j) for Fi (i 1:3), respectively.
In step 52, if the calibration results of the classifiers Fj (j ═ 1:3 and j ≠ i) and Fk (k ≠ i and k ≠ j) are the same for the sample U in the uncalibrated set U, the sample U and the calibration results are added to the training set Ei (i ═ 1:3) of Fi and the sample U is removed from the uncalibrated set U, so as to obtain a new training set Ei' (i ═ 1:3) of Fi (i ═ 1:3), respectively.
In step 53, the classifier Fi (i ═ 1:3) is retrained with the new training set Ei' (i ═ 1:3), respectively.
In step 54, it is determined whether any of F1, F2, and F3 has changed, and if any of F1-F3 has changed, steps 51-53 are repeated until none of F1-F3 has changed to obtain a strong classifier Fi (i ═ 1: 3).
However, the classification accuracy of a single classifier is greatly biased across different classification sets, and overfitting may also occur across a single classification set. The ensemble learning uses modes of sample set sampling, feature set selection, classification algorithm selection and the like to train different classifiers, and then uses principles of majority voting and the like to finish the aggregation of results, so that not only can the classification accuracy be improved, but also the overfitting of a single classifier can be effectively avoided. In the embodiment of the specification, three differential single classifiers are obtained by adopting different classification algorithms or classification features for the same training set, and then the final classification result of the sample is obtained by using the majority voting principle.
Fig. 6 illustrates an apparatus 600 for training a classifier for network data flow classification according to an embodiment of the present description, including: a feature extraction unit 61 configured to: respectively extracting a flow load characteristic U1, a flow statistical characteristic U2 and a flow entropy value characteristic U3 from the data flow; a first classification unit 62 configured to: classifying the data stream by using a classifier Fi corresponding to a characteristic Ui to obtain a first classification result, wherein Ui is any one of the flow load characteristic U1, the flow statistic characteristic U2 and the flow entropy characteristic U3, and i is 1, 2 or 3; a second classification unit 63 configured to: classifying the data stream by adopting a classifier Fj corresponding to a characteristic Uj to obtain a second classification result, wherein Uj is any one of the stream load characteristic U1, the stream statistical characteristic U2 and the stream entropy characteristic U3 which is not equal to Ui, and j is 1, 2 or 3; a training data acquisition unit 64 configured to: in the case that the first classification result is the same as the second classification result, the data stream and the first classification result are used as training data for training a classifier Fk corresponding to a feature Uk, where Uk is the above-mentioned stream load feature U1, stream statistical feature U2, and one of stream entropy value features U3 other than Ui, Uj, where k is 1, 2, or 3.
In one embodiment, the apparatus 600 for training a classifier for network data flow classification according to an embodiment of the present specification further includes an initial training unit 65 configured to: before the data stream is classified by the classifier Fi corresponding to the characteristic Ui, F1 corresponding to the stream load characteristic U1, F2 corresponding to the stream statistical characteristic U2, and F3 corresponding to the stream entropy characteristic U3 are trained by training sets E1, E2, and E3 based on the calibrated data stream set, respectively.
In an embodiment, wherein using the data stream and the first classification result as training data includes adding the data stream and the first classification result to a current training set of a classifier Fk to obtain a new training set Ek 'of the classifier Fk, the apparatus 600 for training a classifier for network data stream classification according to an embodiment of the present specification further includes a retraining unit 66 configured to retrain the classifier Fk with the new training set Ek'.
In one embodiment, the apparatus 600 for training a classifier for network data flow classification according to the present description further includes an iteration unit 67 configured to: after training classifiers F1, F2, and F3, if any of classifiers F1, F2, and F3 is changed, the operations performed by the above-described apparatus are repeated until no more changes occur in classifiers F1, F2, and F3.
In one embodiment, the apparatus 600 for training classifiers for network data stream classification according to the present description further comprises an integrating unit 68 configured to derive an integrated classifier by using majority voting principles when none of F1-F3 changes anymore.
The dashed boxes in fig. 6 indicate that the elements are optional elements in the embodiment, but not essential elements. For example, the apparatus 600 in the embodiments of the present specification may not include the initial training unit 65, i.e., instead of obtaining the initial single classifier Fi by training the calibration set, the initial single classifier Fi may be obtained in other ways known to those skilled in the art. Likewise, the retraining unit 66, the iterating unit 67 and the integrating unit 68 are also only optional units and are not necessary in this embodiment. The dotted boxes in fig. 7 and 8 hereinafter also indicate the same meaning.
Fig. 7 illustrates a method for classifying network data flows according to an embodiment of the present description, including the following steps: step 71, extracting flow characteristics Vi from the data flow, wherein the Vi is any one of flow load characteristics V1, flow statistical characteristics V2 and flow entropy value characteristics V3; and step 72, inputting the stream characteristics Vi into a classifier Fi corresponding to the characteristics Vi, obtained by training the classifier for classifying the network data stream according to the above, so as to obtain the type Ci of the data stream.
In one embodiment, the method for classifying a network data stream according to the embodiment of the present disclosure shown in fig. 7 further includes a step 73 of obtaining a final type of the data stream through a majority voting principle for the obtained type C1-C3 of the data stream.
Fig. 8 illustrates an apparatus 800 for classifying network data flows according to an embodiment of the present description, including: a feature extraction unit 81 configured to extract a stream feature Vi, Vi being any one of a stream load feature V1, a stream statistical feature V2, and a stream entropy feature V3, for the data stream; and a classification unit 82 configured to input the stream feature Vi into a classifier Fi corresponding to the feature Vi obtained by the method for training the classifier for classifying the network data stream described above, so as to obtain the type Ci of the data stream.
In one embodiment, the apparatus 800 for classifying network data streams according to the embodiment of the present disclosure shown in fig. 8 further includes an integrating unit 83 configured to derive a final type of a data stream by a majority voting principle for the obtained types C1-C3 of the data stream.
In another aspect, embodiments of the present specification also provide a computer-readable storage medium having instruction codes stored thereon, which when executed in a computer, cause the computer to perform the above-mentioned method of training a classifier for network data flow classification.
In yet another aspect, embodiments of the present specification also provide a computer-readable storage medium having computer instruction code stored thereon, which, when executed in a computer, causes the computer to perform the above-mentioned method for classifying network data streams.
The method and the device in the embodiment of the present disclosure may be deployed in any network environment, and classify and analyze traffic of the network environment.
The embodiment of the specification comprehensively and deeply mines the data characteristics and the behavior of the network traffic by combining the traffic statistic characteristics, the traffic load characteristics and the traffic entropy characteristics. The embodiment of the specification also uses a collaborative learning algorithm, uses a small amount of calibration samples, reasonably introduces the calibration samples to expand the training samples, and enhances the accuracy of the classifier. In addition, the embodiment of the specification further improves the accuracy and the recall rate of the classifier by using an integrated learning algorithm and using a majority voting principle to collect the classification result of the single classifier.
It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether these functions are performed in hardware or software depends on the particular application of the solution and design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (26)

1. A method of training a classifier for classification of network data flows, comprising the steps of:
respectively extracting a flow load characteristic U1, a flow statistical characteristic U2 and a flow entropy value characteristic U3 from the data flow;
classifying the data stream by using a classifier Fi corresponding to a characteristic Ui to obtain a first classification result, wherein Ui is any one of the flow load characteristic U1, the flow statistic characteristic U2 and the flow entropy characteristic U3, and i is 1, 2 or 3;
classifying the data stream by adopting a classifier Fj corresponding to a characteristic Uj to obtain a second classification result, wherein Uj is any one of the stream load characteristic U1, the stream statistical characteristic U2 and the stream entropy characteristic U3 which is not equal to Ui, and j is 1, 2 or 3;
in the case that the first classification result is the same as the second classification result, the data stream and the first classification result are used as training data for training a classifier Fk corresponding to a feature Uk, where Uk is the above-mentioned stream load feature U1, stream statistical feature U2, and one of stream entropy value features U3 other than Ui, Uj, where k is 1, 2, or 3.
2. The method of training a classifier for classification of a network data flow according to claim 1, further comprising, prior to said classifying the data flow with a classifier Fi corresponding to a feature Ui: f1 corresponding to the flow load characteristic U1, F2 corresponding to the flow statistic characteristic U2 and F3 corresponding to the flow entropy characteristic U3 are respectively trained by training sets E1, E2 and E3 based on the calibrated data flow set.
3. The method of training a classifier for classification of a network data stream according to claim 1 or 2, the taking of the data stream and the first classification result as training data comprising adding the data stream and the first classification result to a current training set of a classifier Fk, thereby obtaining a new training set Ek 'of the classifier Fk, the method further comprising retraining the classifier Fk with the new training set Ek'.
4. The method of training classifiers for network data flow classification as recited in claim 3 further comprising, after retraining all classifiers F1, F2, and F3, if any of classifiers F1, F2, and F3 changes, repeating the method until none of classifiers F1, F2, and F3 changes.
5. The method of training a classifier for network data flow classification as claimed in claim 4 further comprising deriving an integrated classifier by using majority voting rules when none of F1, F2, and F3 changes anymore.
6. The method for training a classifier for classifying network data streams according to any one of claims 1-2 and 4-5, wherein the extracting stream load characteristics includes extracting values of tf _ idf of feature words included in semantic information of the data streams, wherein tf is a word frequency, and idf is an inverse file frequency, i.e. a logarithm of a ratio of the number of streams in the training set to the number of streams containing the feature words.
7. The method of training a classifier for network data flow classification of claim 6 wherein said extracting flow load features includes removing feature words with tf below a word frequency threshold and feature words with idf above an inverse file frequency threshold.
8. The method of training a classifier for classification of network data flows according to any of claims 1-2, 4-5, wherein said extracting flow statistical characteristics includes at least one of extracting flow time interval, intra-flow packet-to-time interval, packet size, number of packets, number of TCP identification bits, and activation status.
9. The method of training a classifier for network data flow classification of any of claims 1-2, 4-5, 7 wherein any of the classifiers F1, F2 and F3 is based on at least one algorithm of decision tree, naive bayes, support vector machine, association rule learning, neural network, genetic algorithm.
10. The method of training a classifier for network data flow classification of claim 9, wherein F1, F2 and F3 are based on the same algorithm.
11. An apparatus to train a classifier for network data flow classification, comprising:
a feature extraction unit configured to: respectively extracting a flow load characteristic U1, a flow statistical characteristic U2 and a flow entropy value characteristic U3 from the data flow;
a first classification unit configured to: classifying the data stream by using a classifier Fi corresponding to a characteristic Ui to obtain a first classification result, wherein Ui is any one of the flow load characteristic U1, the flow statistic characteristic U2 and the flow entropy characteristic U3, and i is 1, 2 or 3;
a second classification unit configured to: classifying the data stream by adopting a classifier Fj corresponding to a characteristic Uj to obtain a second classification result, wherein Uj is any one of the stream load characteristic U1, the stream statistical characteristic U2 and the stream entropy characteristic U3 which is not equal to Ui, and j is 1, 2 or 3;
a training data acquisition unit configured to: in the case that the first classification result is the same as the second classification result, the data stream and the first classification result are used as training data for training a classifier Fk corresponding to a feature Uk, where Uk is the above-mentioned stream load feature U1, stream statistical feature U2, and one of stream entropy value features U3 other than Ui, Uj, where k is 1, 2, or 3.
12. The apparatus for training a classifier for network data flow classification as claimed in claim 11 further comprising an initial training unit configured to: before the data stream is classified by the classifier Fi corresponding to the characteristic Ui, F1 corresponding to the stream load characteristic U1, F2 corresponding to the stream statistical characteristic U2, and F3 corresponding to the stream entropy characteristic U3 are trained by training sets E1, E2, and E3 based on the calibrated data stream set, respectively.
13. The apparatus for training a classifier for classification of a network data stream according to claim 11 or 12, wherein taking the data stream and the first classification result as training data comprises adding the data stream and the first classification result to a current training set of a classifier Fk, thereby obtaining a new training set Ek 'of the classifier Fk, the apparatus further comprising a retraining unit configured to retrain the classifier Fk with the new training set Ek'.
14. The apparatus for training a classifier for network data flow classification of claim 13, further comprising an iteration unit configured to: after retraining all classifiers F1, F2, and F3, if any of classifiers F1, F2, and F3 change, the operations performed by the apparatus are repeated until no more changes occur to any of classifiers F1, F2, and F3.
15. The apparatus for training classifiers for network data flow classification as claimed in claim 14 further comprising an integration unit configured to derive an integrated classifier by using majority voting discipline when none of F1, F2 and F3 changes anymore.
16. The apparatus for training a classifier for classifying network data streams according to any one of claims 11-12 and 14-15, wherein said extracting stream load characteristics includes extracting a value of tf idf of a feature word included in semantic information of the data stream, where tf is a word frequency and idf is an inverse file frequency, i.e. a logarithm of a ratio of a number of streams in the training set to a number of streams including the feature word.
17. The apparatus for training a classifier for network data flow classification of claim 16 wherein said extracting flow load features includes removing feature words with tf below a word frequency threshold and feature words with idf above an inverse file frequency threshold.
18. The apparatus for training a classifier for classification of network data flows according to any of claims 11-12, 14-15, wherein said extracted flow statistical features include at least one of extracted flow time interval, intra-flow packet-to-time interval, packet size, number of packets, number of TCP identification bits, and activation status.
19. The apparatus for training a classifier for network data flow classification as claimed in any one of claims 11-12, 14-15, 17 wherein any one of the classifiers F1, F2 and F3 is based on at least one algorithm of decision tree, naive bayes, support vector machine, association rule learning, neural network, genetic algorithm.
20. The apparatus for training a classifier for network data flow classification of claim 19, wherein F1, F2 and F3 are based on the same algorithm.
21. A computer-readable storage medium having stored thereon instruction code, which, when executed in a computer, causes the computer to perform the method of any one of claims 1-10.
22. A method of classifying a network data stream comprising
Extracting the stream characteristics Vi for the data stream, wherein Vi is any one of stream load characteristics V1, stream statistical characteristics V2 and stream entropy value characteristics V3; and
inputting said stream characteristics Vi into a classifier Fi corresponding to characteristics Vi obtained by the method according to any one of claims 1-10, to obtain the type Ci of said data stream.
23. The method of classifying a network data stream as recited in claim 22 further comprising deriving a final type of data stream by a majority voting principle for said obtained type of data stream C1-C3.
24. An apparatus for classifying a network data stream, comprising
A feature extraction unit configured to extract any one of a stream feature Vi, a Vi stream load feature V1, a stream statistical feature V2, and a stream entropy value feature V3 for the data stream; and
a classification unit configured to input the stream characteristics Vi into a classifier Fi corresponding to the characteristics Vi obtained by the method according to any one of claims 1 to 10 to obtain the type Ci of the data stream.
25. The apparatus for classifying a network data stream according to claim 24, further comprising an integrating unit configured to derive a final type of data stream by majority voting principle for obtained type C1-C3 of said data stream.
26. A computer-readable storage medium having stored thereon instruction code, which, when executed in a computer, causes the computer to perform the method of claim 22 or 23.
CN201711158988.4A 2017-11-20 2017-11-20 Method and device for classifying network data streams Active CN107967311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711158988.4A CN107967311B (en) 2017-11-20 2017-11-20 Method and device for classifying network data streams

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711158988.4A CN107967311B (en) 2017-11-20 2017-11-20 Method and device for classifying network data streams

Publications (2)

Publication Number Publication Date
CN107967311A CN107967311A (en) 2018-04-27
CN107967311B true CN107967311B (en) 2021-06-29

Family

ID=62001312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711158988.4A Active CN107967311B (en) 2017-11-20 2017-11-20 Method and device for classifying network data streams

Country Status (1)

Country Link
CN (1) CN107967311B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359109B (en) * 2018-08-23 2022-05-27 创新先进技术有限公司 Data processing method and system based on distributed stream computing
CN109309630B (en) * 2018-09-25 2021-09-21 深圳先进技术研究院 Network traffic classification method and system and electronic equipment
CN110059726A (en) * 2019-03-22 2019-07-26 中国科学院信息工程研究所 The threat detection method and device of industrial control system
CN112560878A (en) * 2019-09-10 2021-03-26 华为技术有限公司 Service classification method and device and Internet system
CN110781950B (en) * 2019-10-23 2023-06-30 新华三信息安全技术有限公司 Message processing method and device
CN112836214A (en) * 2019-11-22 2021-05-25 南京聚铭网络科技有限公司 Communication protocol hidden channel detection method
CN112380406B (en) * 2020-11-15 2022-11-18 杭州光芯科技有限公司 Real-time network traffic classification method based on crawler technology
CN112423324B (en) * 2021-01-22 2021-04-30 深圳市科思科技股份有限公司 Wireless intelligent decision communication method, device and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102315974A (en) * 2011-10-17 2012-01-11 北京邮电大学 Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows
US8311956B2 (en) * 2009-08-11 2012-11-13 At&T Intellectual Property I, L.P. Scalable traffic classifier and classifier training system
CN103870751A (en) * 2012-12-18 2014-06-18 中国移动通信集团山东有限公司 Method and system for intrusion detection
CN104270392A (en) * 2014-10-24 2015-01-07 中国科学院信息工程研究所 Method and system for network protocol recognition based on tri-classifier cooperative training learning
CN106559261A (en) * 2016-11-03 2017-04-05 国网江西省电力公司电力科学研究院 A kind of substation network intrusion detection of feature based fingerprint and analysis method
CN106657141A (en) * 2017-01-19 2017-05-10 西安电子科技大学 Android malware real-time detection method based on network flow analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100034102A1 (en) * 2008-08-05 2010-02-11 At&T Intellectual Property I, Lp Measurement-Based Validation of a Simple Model for Panoramic Profiling of Subnet-Level Network Data Traffic

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8311956B2 (en) * 2009-08-11 2012-11-13 At&T Intellectual Property I, L.P. Scalable traffic classifier and classifier training system
CN102315974A (en) * 2011-10-17 2012-01-11 北京邮电大学 Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows
CN103870751A (en) * 2012-12-18 2014-06-18 中国移动通信集团山东有限公司 Method and system for intrusion detection
CN104270392A (en) * 2014-10-24 2015-01-07 中国科学院信息工程研究所 Method and system for network protocol recognition based on tri-classifier cooperative training learning
CN106559261A (en) * 2016-11-03 2017-04-05 国网江西省电力公司电力科学研究院 A kind of substation network intrusion detection of feature based fingerprint and analysis method
CN106657141A (en) * 2017-01-19 2017-05-10 西安电子科技大学 Android malware real-time detection method based on network flow analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Traffic classification using clustering algorithms;Jeffrey Erman等;《Proceedings of the 2006 SIGCOMM workshop on Mining network data》;20060930;全文 *
基于多分类器的网络流量分类研究;张炜;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160630;全文 *

Also Published As

Publication number Publication date
CN107967311A (en) 2018-04-27

Similar Documents

Publication Publication Date Title
CN107967311B (en) Method and device for classifying network data streams
CN110597734B (en) Fuzzy test case generation method suitable for industrial control private protocol
CN108900432B (en) Content perception method based on network flow behavior
CN109218223B (en) Robust network traffic classification method and system based on active learning
CN113158390B (en) Network attack traffic generation method for generating countermeasure network based on auxiliary classification
JP7082533B2 (en) Anomaly detection method and anomaly detection device
CN108632278A (en) A kind of network inbreak detection method being combined with Bayes based on PCA
CN110611640A (en) DNS protocol hidden channel detection method based on random forest
CN109753797B (en) Dense subgraph detection method and system for stream graph
CN111224946A (en) TLS encrypted malicious traffic detection method and device based on supervised learning
CN114124482A (en) Access flow abnormity detection method and device based on LOF and isolated forest
Xiao et al. Novel dynamic multiple classification system for network traffic
Li et al. Improving attack detection performance in NIDS using GAN
CN113821793A (en) Multi-stage attack scene construction method and system based on graph convolution neural network
CN114172688A (en) Encrypted traffic network threat key node automatic extraction method based on GCN-DL
CN107832611B (en) Zombie program detection and classification method combining dynamic and static characteristics
Perona et al. Service-independent payload analysis to improve intrusion detection in network traffic
CN110311870B (en) SSL VPN flow identification method based on density data description
CN117318980A (en) Small sample scene-oriented self-supervision learning malicious traffic detection method
CN115334005B (en) Encryption flow identification method based on pruning convolutional neural network and machine learning
Huizinga Using machine learning in network traffic analysis for penetration testing auditability
Bienvenu et al. The Moran forest
CN112839051A (en) Encryption flow real-time classification method and device based on convolutional neural network
CN114118255B (en) Unknown protocol cluster analysis method, device and medium based on spectral clustering
CN117527446B (en) Network abnormal flow refined detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1253991

Country of ref document: HK

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201020

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201020

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant