The content of the invention
The present invention is intended to provide it is a kind of with less sample labeling cost and more fully network data flow feature to network
Data flow carry out high-accuracy, high recall rate classification method and apparatus.
To achieve the above object, this specification first aspect provides the grader that a kind of training is used for network data flow classification
Method, comprise the following steps:Extract current load feature U1, statistical flow characteristic U2, and stream entropy spy respectively from data flow
Levy U3;Classified using grader Fi corresponding with feature Ui to the data flow, obtain the first classification results, wherein Ui is
Above-mentioned current load feature U1, statistical flow characteristic U2, flow entropy feature U3 any one, wherein i=1,2 or 3;Using with feature
The corresponding grader Fj of Uj classify the data flow, obtain the second classification results, and wherein Uj is above-mentioned current load feature
U1, statistical flow characteristic U2, flow any one for being not equal to Ui in entropy feature U3, wherein j=1,2 or 3;In the first classification results
In the case of identical with the second classification results, using the data flow and first classification results as training data, for instructing
Practice grader Fk corresponding with feature Uk, wherein Uk is above-mentioned current load feature U1, statistical flow characteristic U2, is flowed in entropy feature U3
Except Ui, that outside Uj, wherein k=1,2 or 3.
In one embodiment, the data flow is carried out to classify it using grader Fi corresponding with feature Ui described
Before, further include:It is respectively trained out with training set E1, E2 and E3 of the set of data flows based on calibration corresponding with current load feature U1
F1, F2 corresponding with statistical flow characteristic U2, and with the corresponding F3 of stream entropy feature U3.
In one embodiment, the data flow and first classification results are included as training data, by described in
Data flow and first classification results are added in the current training set of grader Fk, so as to obtain the new training of grader Fk
Collect Ek ', the method further includes, with new training set Ek ' the re -training graders Fk.
In one embodiment, above-mentioned training is further included for the method for the grader of network data flow classification, is further included,
After training is all re-started to grader F1, F2 and F3, if any of grader F1, F2 and F3 change, repeat
The method is carried out, until grader F1, F2 and F3 no longer change.
In one embodiment, above-mentioned training is further included for the method for the grader of network data flow classification, as F1, F2
When all no longer changing with F3, integrated classifier is drawn by using most Voting principles.
In one embodiment, the extraction current load feature includes the feature that the semantic information of extraction data flow includes
The value of the tf*idf of word, wherein tf are word frequency, and idf is reverse document-frequency, i.e., the fluxion in training set is with including this feature word
Fluxion ratio logarithm.
In one embodiment, it is described extraction current load feature include, remove tf less than word frequency threshold value Feature Words and
Idf is higher than the Feature Words of reverse document-frequency threshold value
In one embodiment, it is described extraction statistical flow characteristic include extraction stream time interval, stream in data packet bag then
Between it is at least one in interval, data package size, data packet number, TCP flags number and state of activation.
In one embodiment, it is any based on decision tree, naive Bayesian, support vector machines, association in F1, F2 and F3
At least one of rule learning, neutral net, genetic algorithm algorithm.
In one embodiment, F1-F3 is based on identical algorithm.
This specification second aspect provides the device that a kind of training is used for the grader of network data flow classification, including:It is special
Extraction unit is levied, is configured to:Extract current load feature U1, statistical flow characteristic U2, and stream entropy feature respectively from data flow
U3;First taxon, is configured to:Classified using grader Fi corresponding with feature Ui to the data flow, obtain
One classification results, wherein Ui are above-mentioned current load feature U1, statistical flow characteristic U2, flow entropy feature U3 any one, wherein i
=1,2 or 3;Second taxon, is configured to:Classified using grader Fj corresponding with feature Uj to the data flow,
The second classification results are obtained, wherein Uj is above-mentioned current load feature U1, statistical flow characteristic U2, is not equal to Ui in stream entropy feature U3
Any one, wherein j=1,2 or 3;Training data acquiring unit, is configured to:In the first classification results and the second classification results
It is corresponding with feature Uk for training using the data flow and first classification results as training data in the case of identical
Grader Fk, wherein Uk is above-mentioned current load feature U1, statistical flow characteristic U2, removes Ui in stream entropy feature U3, outside Uj
That, wherein k=1,2 or 3.
In one embodiment, above-mentioned training further includes initial training for the device of the grader of network data flow classification
Unit, is configured to:Before use grader Fi corresponding with feature Ui classifies the data flow, with based on mark
F1 corresponding with current load feature U1 is respectively trained out in training set E1, E2 and E3 of fixed set of data flows, with statistical flow characteristic
The corresponding F2 of U2, and F3 corresponding with stream entropy feature U3.
In one embodiment, the data flow and first classification results are included as training data, by described in
Data flow and first classification results are added in the current training set of grader Fk, so as to obtain the new training of grader Fk
Collect Ek ', described device further includes re -training unit, is configured to, with new training set Ek ' the re -training graders Fk.
In one embodiment, above-mentioned training further includes iteration list for the device of the grader of network data flow classification
Member, is configured to:After training is all re-started to grader F1, F2 and F3, if any of grader F1, F2 and F3 occur
Change, repeats the operation of described device progress, until grader F1, F2 and F3 no longer change.
In one embodiment, above-mentioned training further includes integrated single for the device of the grader of network data flow classification
Member, is configured to, and when F1-F3 no longer changes, integrated classifier is drawn by using most Voting principles.
This specification third aspect provides a kind of computer-readable storage medium, is stored thereon with instruction code, described
When instruction code performs in a computer, computer is made to perform the device that above-mentioned training is used for the grader of network data flow classification
Method.
This specification fourth aspect provides a kind of method classified to network data flow, including:Data flow is extracted
Stream feature Vi, Vi are current load feature V1, statistical flow characteristic V2, and flow any of entropy feature V3;And by described in
Feature Vi inputs are flowed by training described above for the method acquisition of the grader of network data flow classification and feature Vi
Corresponding grader Fi, to obtain the Type C i of the data flow.
The aspect of this specification the 5th provides a kind of device classified to network data flow, including:Feature extraction unit,
It is configured to, it is in current load feature V1, statistical flow characteristic V2, and stream entropy feature V3 that stream feature Vi, Vi are extracted to data flow
It is any;And taxon, it is configured to, the stream feature Vi inputs is used for network data by training described above
The grader Fi corresponding with feature Vi that the classifier methods of flow point class obtain, to obtain the Type C i of the data flow.
The aspect of this specification the 6th provides a kind of computer-readable storage medium, is stored thereon with instruction code, described
When instruction code performs in a computer, computer is made to perform the above-mentioned method classified to network data flow.
The embodiment of this specification is by combining traffic statistics feature, flow load feature and flow entropy feature, Quan Fang
Position, the data characteristic and behavior expression in depth excavating network traffics, and Cooperative Study algorithm is used, use a small amount of calibration sample
This, is rationally introduced as calibration sample to expand training sample, and enhances grader accuracy, thinks in addition, using for reference integrated study
Road, using the classification results of most Voting principle set single classifiers, further increases grader accuracy rate and recall rate.
Embodiment
The specific embodiment of this specification is described below in conjunction with the accompanying drawings.
Fig. 1 shows the general illustration of included module in the technical solution of this specification embodiment.This specification is real
Applying the technical solution of example includes four modules:Data acquisition module 11, characteristic extracting module 12, model training module 13, and
Module 14 is implemented in classification.
Fig. 2 shows the general illustration of the step of this specification embodiment implemented in the modules shown in Fig. 1.
As shown in Fig. 2, in data acquisition module 11, MAC data bag is captured, TCP flow reduction is carried out, so as to obtain network number
Closed according to adfluxion, which is divided into calibration collection L and does not demarcate collection U, and the data flow in L is collected to calibration and is demarcated.In feature
Extraction module 12, extraction calibration collection L and current load feature, statistical flow characteristic and the stream entropy for not demarcating each data flow for collecting U
Value tag, and to its vectorization, the step of with below in input grader.In model training module 13, including:To corresponding to
The training of the single classifier of one of current load feature, statistical flow characteristic and stream entropy feature;By between three single classifiers
Cooperative Study obtain three strong classifiers;And obtain strong integrated classifier by most Voting principles.Implement mould in classification
Block 14, can use the single classifier obtained in model training module or integrated classifier to classify network data flow.
Statistical flow characteristic, current load feature and stream entropy feature belong to network data flow feature.Statistical flow characteristic is pair
The behavior measure of data flow, current load are characterized in that the message content of data flow is semantic, and stream entropy is characterized in the purity of data flow.Knot
Close state three kinds of features can it is comprehensive, deeper into ground excavate network traffics data characteristics and behavior expression.By using association
With study Algorithm for Training grader, it is possible to reduce need the data volume demarcated, reduce data scaling cost, while Rational choice is not
Nominal data sample, strengthens classification accuracy.In addition, by using the mode of integrated study, by multiple single classifiers into
The most ballots of row draw final classification as a result, improving classification accuracy and recall rate.
This specification embodiment will be hereafter more particularly described.
Fig. 3 is the stream for showing to be used for according to the training of this specification embodiment the method for the grader that network data flow is classified
Cheng Tu.
The classification accuracy of network data flow classifier is largely dependent upon the quality of training set sample, and network number
Taken time and effort according to stream numerous and complicated, enormous amount, the calibration of data sample, great amount of samples cannot demarcate.Grader is a small amount of
Showed on the training effect of sample not good enough.Therefore, how to obtain accurate disaggregated model using a small amount of calibration sample is one
A technical barrier.The thought of Cooperative Study is used for reference in this specification embodiment shown in Fig. 3, uses (three points of Tri-training
The coorinated training of class device) method reaches this purpose.
As shown in figure 3, in step 31, current load feature U1, statistical flow characteristic U2, Yi Jiliu are extracted respectively from data flow
Entropy feature U3.In step 32, classified using grader Fi corresponding with feature Ui to the data flow, obtain first point
Class as a result, wherein Ui is above-mentioned current load feature U1, statistical flow characteristic U2, stream entropy feature U3 any one, wherein i=1,
2 or 3.In step 33, classified using grader Fj corresponding with feature Uj to the data flow, obtain the second classification knot
Fruit, wherein Uj are above-mentioned current load feature U1, statistical flow characteristic U2, and Ui any one is not equal in stream entropy feature U3, its
Middle j=1,2 or 3.In step 34, in the case where the first classification results are identical with the second classification results, by the data flow and
First classification results are born as training data for training grader Fk corresponding with feature Uk, wherein Uk for above-mentioned stream
Feature U1, statistical flow characteristic U2 are carried, removes Ui in stream entropy feature U3, that outside Uj, wherein k=1,2 or 3.
Wherein, current load is characterized in data payload characteristic value of the network flow in addition to protocol headers, contains communication data
Abundant semantic information.After flow calibration, its feature set of words t={ t are extracted1, t2... tn, each flow data message can
It is expressed as the vector on Feature Words:V (d)={ (t1, w1), { t2, w2... { tn, wn, wherein wi is the weight system of feature ti
Number, this programme represented using tf*idf values, wherein, tf is word frequency, i.e., the number that Feature Words occur in certain data stream and
The ratio of effective word in the data flow.Idf is reverse document-frequency, is fluxion in training set and the fluxion that includes this feature word
The logarithm of ratio.Tf*idf values are the product of tf and idf.Tf is higher than reverse file less than the Feature Words and idf of word frequency threshold value
The word of frequency threshold can be cleaned.This specification embodiment using flow data as row vector, using the tf*idf values of Feature Words as row to
Amount construction flow load eigenmatrix.It is to be understood that the computational methods of convection current load characteristic vector are only in this specification embodiment
It is exemplary, i.e. current load feature vector can also be calculated with other computational methods well known by persons skilled in the art.
Table 1 shows the example for the Feature Words that partial data stream includes.
Table 1
In one embodiment, Feature Words are stored in Feature Words database, for calculating current load feature.
Statistical flow characteristic is that the network behavior of data flow is counted, and what is calculated estimates set.Common stream statistics
Feature includes data packet bag in stream time interval, stream to time interval, data package size, data packet number, TCP flag numbers
And state of activation etc..Features described above can also calculate the performance of its mathematics, by taking data package size as an example, can calculate this fluxion
According to statistical values such as packet byte number maximum, minimum value, average value and variances.Meanwhile according to the direction of data flow, it is further divided into
Forward-flow feature and backward current feature.Common traffic statistics feature is as shown in table 2.This specification embodiment is using flow data as row
Vector, traffic statistics eigenmatrix is constructed by column vector of statistical characteristics.Table 2 shows common traffic statistics mark sheet.
Table 2
The entropy of flow represents the confusion degree of data on flows.Its known criterion calculation formula of those skilled in the art,
Specifically, a data stream packet is represented with F, with fkThe set of all k continuation characters of the data stream packet is represented, with hk
Represent corresponding fkEntropy, then its calculation formula is as follows:
According to the formula, the flow F for including m byte message, can obtain its entropy characteristic set Hm={ h1,
h2... hn, this specification embodiment constructs flow entropy feature using flow data as row vector, by column vector of the entropy of different m
Matrix.
In figure 3 shown grader Fi corresponding with feature Ui refer to using calibration sample set statistical flow characteristic,
The single classifier that a kind of feature training grader in stream entropy feature and current load feature obtains.In one embodiment, will
The data flow of collection is divided into calibration collection L and does not demarcate collection U, and collects the data flow in L to calibration and demarcate.In one embodiment
In, can be in FTP, HTTP, SMTP, I MAP, SSH, POP3, BitTorrent, DNS, KuGoo, PPLive totally ten types
Network data flow is demarcated.In one embodiment, the magnitude for demarcating the data flow number of concentration is hundred grades, does not demarcate collection
The magnitude of middle data flow number is 100,000 grades, it is clear that the technical solution of this specification embodiment greatly reduces calibration cost.It is right
Data flow in calibration collection L extracts current load feature and by it with demarcating vectorization together with type respectively, to obtain training set E1;
The data flow collected to calibration in L extracts statistical flow characteristic and by it with demarcating vectorization together with type respectively, to obtain training set
E2;And the data flow collected to calibration in L extracts stream entropy feature and by it with demarcating vectorization together with type respectively, to obtain
Training set E3.Training set E1-E3 is inputted into Fi (i=1 respectively:3), thus obtain respectively with statistical flow characteristic, stream entropy feature
Preliminary classification device F1-F3 corresponding with current load feature.
In one embodiment, Fi (i=1:3) it is based on decision tree, naive Bayesian, support vector machines, correlation rule
The disaggregated model of at least one of study, neutral net, genetic algorithm algorithm.In another embodiment, F1-F3 is based on phase
Same algorithm.In another embodiment, F1-F3 is based on different algorithms.
Fig. 4 shows the rough schematic of the Tri-training methods of the training grader shown in Fig. 3.F1-F3 can take turns
Stream is used as Main classification device, two other strengthens the training set of Main classification device as synergetic classification device.By taking F3 as an example, association
Classification calibration is carried out with grader F1 and F2 each sample that can be concentrated to non-calibrational capacity, if calibration result is identical,
The sample and its calibration result are added in the training set E3 of F3.Classified using grader F1 and F2 to not demarcating collection U
The new training set E3 ' of F3 are obtained after, as the new training set for carrying out re -training to F3 afterwards.F1 can similarly be obtained
With new the training set E1 ' and E2 ' of F2.Using new training set E1 ', E2 ' and E3 ' respectively to grader F1-F3 re -trainings,
So as to obtain the grader F1-F3 of enhancing.
In one embodiment, judge whether the grader F1-F3 of enhancing becomes compared to it before this training
Change, for example, by the grader Fi (i=1 after enhancing:And Fj (j=1 3):3 and j ≠ i) it is used further to Tri-training algorithms
In, judge whether also presence can add training set Ek (k=1 after collection U is demarcated to not demarcating for it:3, k ≠ i and k ≠ j)
Sample u, or judge whether that new training set Ek ' can be obtained, if there is no the sample u, or new instruction cannot be obtained
Practice collection Ek ', then it represents that Fk no longer changes compared to it before this training.If any of grader F1-F3 is sent out
Changing, then repeat the above method, all no longer changes after grader F1, F2 and F3 state method on the implementation, so as to tie
The beam algorithm.Training and Sample Refreshment iteration through excessive round, obtain three strong classifier F1-F3.
Fig. 5 shows the iterative algorithm of above-mentioned Tri-training methods.As shown in figure 5, in step 51, respectively for Fi
(i=1:3), using two other grader Fj (j=1:3 and j ≠ i) and Fk (k=1:3, k ≠ i and k ≠ j) to not demarcating collection U
Classify.
In step 52, if grader Fj (j=1:3 and j ≠ i) and Fk (k=1:3, k ≠ i and k ≠ j) to not demarcating collection U
In sample u calibration result it is identical, then sample u and calibration result are added in the training set Ei (i=1: 3) of Fi, and
The sample u is never demarcated in collection U and removed, so as to obtain new training set Ei'(i=1: 3 of Fi (i=1: 3) respectively).
In step 53, respectively with new training set Ei'(i=1: 3) re -training grader Fi (i=1:3).
In step 54, judging F1, F2 and F3, whether any one changes, if any of F1-F3 changes,
Repeat step 51 is to step 53, until F1-F3 no longer changes, to obtain strong classifier Fi (i=1:3).
However, the classification accuracy of single grader is closed in different category sets relatively large deviation, it is also possible in list
A category set, which closes, there is over-fitting.Integrated study is sampled using sample set, characteristic set selects and sorting algorithm selects etc.
Mode, the different grader of training, then completes the polymerization of result using principles such as majority ballots, can not only improve classification
Accuracy, and the over-fitting of single classifier can be effectively prevented from.In this specification embodiment, by identical training set
Three discrepant single classifiers are obtained using different sorting algorithm or characteristic of division, are then obtained using most Voting principles
The final classification result of sample.
Fig. 6 shows the device 600 for being used for the grader that network data flow is classified according to the training of this specification embodiment, bag
Include:Feature extraction unit 61, is configured to:Extract current load feature U1, statistical flow characteristic U2, and stream entropy respectively from data flow
Value tag U3;First taxon 62, is configured to:The data flow is divided using grader Fi corresponding with feature Ui
Class, obtains the first classification results, and wherein Ui is above-mentioned current load feature U1, statistical flow characteristic U2, and stream entropy feature U3's is any
One, wherein i=1,2 or 3;Second taxon 63, is configured to:Using grader Fj corresponding with feature Uj to the data
Stream is classified, and obtains the second classification results, and wherein Uj is above-mentioned current load feature U1, statistical flow characteristic U2, flows entropy feature
It is not equal to any one of Ui, wherein j=1,2 or 3 in U3;Training data acquiring unit 64, is configured to:In the first classification results
In the case of identical with the second classification results, using the data flow and first classification results as training data, for instructing
Practice grader Fk corresponding with feature Uk, wherein Uk is above-mentioned current load feature U1, statistical flow characteristic U2, is flowed in entropy feature U3
Except Ui, that outside Uj, wherein k=1,2 or 3.
In one embodiment, the dress for the grader that network data flow is classified is used for according to the training of this specification embodiment
Put 600 and further include initial training unit 65, be configured to:Grader Fi corresponding with feature Ui is used to the data flow described
Before being classified, it is respectively trained out and current load feature U1 with training set E1, E2 and E3 of the set of data flows based on calibration
Corresponding F1, F2 corresponding with statistical flow characteristic U2, and F3 corresponding with stream entropy feature U3.
In one embodiment, wherein, the data flow and first classification results are included as training data, will
The data flow and first classification results are added in the current training set of grader Fk, so as to obtain the new of grader Fk
Training set Ek ', the device 600 that the training according to this specification embodiment is used for the grader of network data flow classification also wrap
Re -training unit 66 is included, is configured to, with new training set Ek ' the re -training graders Fk.
In one embodiment, the device 600 for the grader that network data flow is classified is used for according to the training of this specification
Iteration unit 67 is further included, is configured to:After being trained to grader F1, F2 and F3, if appointed in grader F1, F2 and F3
One changes, and repeats the operation of above device progress, until grader F1, F2 and F3 no longer change.
In one embodiment, the device 600 for the grader that network data flow is classified is used for according to the training of this specification
Integrated unit 68 is further included, is configured to, when F1-F3 no longer changes, Ensemble classifier is drawn by using most Voting principles
Device.
Dotted line frame in Fig. 6 represents that the unit is optional unit in embodiment, rather than required unit.For example, this
Device 600 in specification embodiment can not include initial training unit 65, i.e. be obtained by being trained to calibration collection
Initial single classifier Fi is obtained, but the initial single classifier can be obtained with other manner well known by persons skilled in the art
Fi.Similarly, re -training unit 66, iteration unit 67 and integrated unit 68 are also only optional units, rather than the implementation
Required unit in example.Hereinafter the dotted line frame in Fig. 7 and Fig. 8 also illustrates that identical implication.
Fig. 7 shows the method classified to network data flow according to this specification embodiment, comprises the following steps:Step
Rapid 71, it is in current load feature V1, statistical flow characteristic V2, and stream entropy feature V3 that stream feature Vi, Vi are extracted to data flow
It is any;And step 72, by the stream feature Vi inputs by being classified according to training described above for network data flow
Grader obtain grader Fi corresponding with feature Vi, to obtain the Type C i of the data flow.
In one embodiment, the side classified to network data flow according to this specification embodiment shown in Fig. 7
Method further includes step 73, for the Type C 1-C3 of the data flow of acquisition, by most Voting principles, draws data flow
Final type.
Fig. 8 shows the device 800 classified to network data flow according to this specification embodiment, including:Feature carries
Unit 81 is taken, is configured to, it is current load feature V1, statistical flow characteristic V2 that stream feature Vi, Vi are extracted to data flow, and flows entropy
Any of feature V3;And taxon 82, it is configured to, the stream feature Vi inputs is used by training described above
In the grader Fi corresponding with feature Vi that the method for the grader of network data flow classification obtains, to obtain the data flow
Type C i.
In one embodiment, the dress classified to network data flow according to this specification embodiment shown in Fig. 8
Put 800 and further include integrated unit 83, be configured to, it is former by majority ballot for the Type C 1-C3 of the data flow of acquisition
Then, the final type of data flow is drawn.
On the other hand, the embodiment of this specification also provides a kind of computer-readable storage medium, is stored thereon with
Instruction code, when described instruction code performs in a computer, makes computer perform above-mentioned training and classifies for network data flow
Grader method.
Another aspect, the embodiment of this specification also provide a kind of computer-readable storage medium, are stored thereon with meter
Calculation machine instruction code, when described instruction code performs in a computer, makes computer execution is above-mentioned to divide network data flow
The method of class.
The above method and device in this specification embodiment can be deployed to arbitrary network environment, to the stream of the network environment
Amount is classified and is analyzed.
The embodiment of this specification is by combining traffic statistics feature, flow load feature and flow entropy feature, Quan Fang
Position, the data characteristic and behavior expression in depth excavating network traffics.This specification embodiment also uses Cooperative Study algorithm, makes
With a small amount of calibration sample, calibration sample is rationally introduced as to expand training sample, and enhance grader accuracy.In addition, this
Specification embodiment is also by using Ensemble Learning Algorithms, using the classification results of most Voting principle set single classifiers, into
One step improves grader accuracy rate and recall rate.
Those of ordinary skill in the art should further appreciate that, be described with reference to the embodiments described herein
Each exemplary unit and algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clear
Illustrate to Chu the interchangeability of hardware and software, generally describe each exemplary group according to function in the above description
Into and step.These functions hold track with hardware or software mode actually, depending on technical solution application-specific and set
Count constraints.Those of ordinary skill in the art can be described to be realized using distinct methods to each specific application
Function, but this realization is it is not considered that exceed scope of the present application.
Track can be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor
Software module, or the two combination implemented.Software module can be placed in random access memory (RAM), memory, read-only storage
Device (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology neck
In any other form of storage medium well known in domain.
Above-described embodiment, has carried out the purpose of the present invention, technical solution and beneficial effect further
Describe in detail, it should be understood that the foregoing is merely the embodiment of the present invention, be not intended to limit the present invention
Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution, improvement and etc. done, should all include
Within protection scope of the present invention.