CN107967311A - A kind of method and apparatus classified to network data flow - Google Patents

A kind of method and apparatus classified to network data flow Download PDF

Info

Publication number
CN107967311A
CN107967311A CN201711158988.4A CN201711158988A CN107967311A CN 107967311 A CN107967311 A CN 107967311A CN 201711158988 A CN201711158988 A CN 201711158988A CN 107967311 A CN107967311 A CN 107967311A
Authority
CN
China
Prior art keywords
feature
grader
data flow
training
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711158988.4A
Other languages
Chinese (zh)
Other versions
CN107967311B (en
Inventor
续涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201711158988.4A priority Critical patent/CN107967311B/en
Publication of CN107967311A publication Critical patent/CN107967311A/en
Application granted granted Critical
Publication of CN107967311B publication Critical patent/CN107967311B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design

Abstract

This specification embodiment discloses the method and apparatus that a kind of training is used for the grader of network data flow classification.It the described method comprises the following steps:Extract current load feature U1, statistical flow characteristic U2, and stream entropy feature U3 respectively from data flow;Classified using grader F i corresponding with feature Ui to data stream, obtain the first classification results;Classified using grader Fj corresponding with feature Uj to the data flow, obtain the second classification results;In the case where the first classification results are identical with the second classification results, using the data flow and first classification results as training data, for training grader Fk corresponding with feature Uk, wherein Uk is above-mentioned current load feature U1, statistical flow characteristic U2, remove Ui in stream entropy feature U3, that outside Uj.

Description

A kind of method and apparatus classified to network data flow
Technical field
The present invention relates to network data flow classification field, more specifically, is related to a kind of training and is used for network data flow point The method and apparatus of the grader of class, and a kind of method and apparatus classified to network data flow.
Background technology
Network data flow be network communication side interaction data embody, be carry out cyberspace trouble free service foundation stone, net The further investigation of network data flow and Rational Classification, to network attack detection, topology optimization, network fee administration and network Service lifting provides data guarantee and technical support.With the high speed development of internet, network data flow increases rapidly, becomes More diversified, complexity higher, communication data and instruction are more complicated, and reveal for different application model and scene table Different data characteristics and network behavior.By taking network attack detection as an example, network data flow may include malicious code, hidden logical Road, protocol bug and stolen confidential information etc., if it is possible to accomplish effectively classification and deep analysis, it becomes possible to detection in time To this kind of attack traffic, so as to carry out defence.
Existing network data flow sorting technique is mainly including the method based on flow load characteristic matching and based on flow The machine learning classification method of statistical nature, achieves preferable classifying quality.In the sorting technique based on flow load feature In, the attribute for extracting Feature Words that the semantic information of data flow includes and this feature word is trained grader.Based on In traffic statistics tagsort method, data packet bag is to time interval, data package size, data in extraction stream time interval, stream At least one in bag number, TCP flags number and state of activation is used as statistical flow characteristic, to be instructed to grader Practice.However, individually the feature of type can not fully and effectively describe the network behavior and data characteristic of flow.Network data flow is confused Complicated miscellaneous, data sample calibration takes time and effort, and a small amount of sample less effective in classifier training.In addition, single grader Due to the influence of data sample, it is easy to classification bias is produced, so as to influence the accuracy rate of classification.
, can be with less sample labeling cost and more fully network data flow is special therefore, it is necessary to more effective scheme Levy the classification that high-accuracy, high recall rate are carried out to network data flow.
The content of the invention
The present invention is intended to provide it is a kind of with less sample labeling cost and more fully network data flow feature to network Data flow carry out high-accuracy, high recall rate classification method and apparatus.
To achieve the above object, this specification first aspect provides the grader that a kind of training is used for network data flow classification Method, comprise the following steps:Extract current load feature U1, statistical flow characteristic U2, and stream entropy spy respectively from data flow Levy U3;Classified using grader Fi corresponding with feature Ui to the data flow, obtain the first classification results, wherein Ui is Above-mentioned current load feature U1, statistical flow characteristic U2, flow entropy feature U3 any one, wherein i=1,2 or 3;Using with feature The corresponding grader Fj of Uj classify the data flow, obtain the second classification results, and wherein Uj is above-mentioned current load feature U1, statistical flow characteristic U2, flow any one for being not equal to Ui in entropy feature U3, wherein j=1,2 or 3;In the first classification results In the case of identical with the second classification results, using the data flow and first classification results as training data, for instructing Practice grader Fk corresponding with feature Uk, wherein Uk is above-mentioned current load feature U1, statistical flow characteristic U2, is flowed in entropy feature U3 Except Ui, that outside Uj, wherein k=1,2 or 3.
In one embodiment, the data flow is carried out to classify it using grader Fi corresponding with feature Ui described Before, further include:It is respectively trained out with training set E1, E2 and E3 of the set of data flows based on calibration corresponding with current load feature U1 F1, F2 corresponding with statistical flow characteristic U2, and with the corresponding F3 of stream entropy feature U3.
In one embodiment, the data flow and first classification results are included as training data, by described in Data flow and first classification results are added in the current training set of grader Fk, so as to obtain the new training of grader Fk Collect Ek ', the method further includes, with new training set Ek ' the re -training graders Fk.
In one embodiment, above-mentioned training is further included for the method for the grader of network data flow classification, is further included, After training is all re-started to grader F1, F2 and F3, if any of grader F1, F2 and F3 change, repeat The method is carried out, until grader F1, F2 and F3 no longer change.
In one embodiment, above-mentioned training is further included for the method for the grader of network data flow classification, as F1, F2 When all no longer changing with F3, integrated classifier is drawn by using most Voting principles.
In one embodiment, the extraction current load feature includes the feature that the semantic information of extraction data flow includes The value of the tf*idf of word, wherein tf are word frequency, and idf is reverse document-frequency, i.e., the fluxion in training set is with including this feature word Fluxion ratio logarithm.
In one embodiment, it is described extraction current load feature include, remove tf less than word frequency threshold value Feature Words and Idf is higher than the Feature Words of reverse document-frequency threshold value
In one embodiment, it is described extraction statistical flow characteristic include extraction stream time interval, stream in data packet bag then Between it is at least one in interval, data package size, data packet number, TCP flags number and state of activation.
In one embodiment, it is any based on decision tree, naive Bayesian, support vector machines, association in F1, F2 and F3 At least one of rule learning, neutral net, genetic algorithm algorithm.
In one embodiment, F1-F3 is based on identical algorithm.
This specification second aspect provides the device that a kind of training is used for the grader of network data flow classification, including:It is special Extraction unit is levied, is configured to:Extract current load feature U1, statistical flow characteristic U2, and stream entropy feature respectively from data flow U3;First taxon, is configured to:Classified using grader Fi corresponding with feature Ui to the data flow, obtain One classification results, wherein Ui are above-mentioned current load feature U1, statistical flow characteristic U2, flow entropy feature U3 any one, wherein i =1,2 or 3;Second taxon, is configured to:Classified using grader Fj corresponding with feature Uj to the data flow, The second classification results are obtained, wherein Uj is above-mentioned current load feature U1, statistical flow characteristic U2, is not equal to Ui in stream entropy feature U3 Any one, wherein j=1,2 or 3;Training data acquiring unit, is configured to:In the first classification results and the second classification results It is corresponding with feature Uk for training using the data flow and first classification results as training data in the case of identical Grader Fk, wherein Uk is above-mentioned current load feature U1, statistical flow characteristic U2, removes Ui in stream entropy feature U3, outside Uj That, wherein k=1,2 or 3.
In one embodiment, above-mentioned training further includes initial training for the device of the grader of network data flow classification Unit, is configured to:Before use grader Fi corresponding with feature Ui classifies the data flow, with based on mark F1 corresponding with current load feature U1 is respectively trained out in training set E1, E2 and E3 of fixed set of data flows, with statistical flow characteristic The corresponding F2 of U2, and F3 corresponding with stream entropy feature U3.
In one embodiment, the data flow and first classification results are included as training data, by described in Data flow and first classification results are added in the current training set of grader Fk, so as to obtain the new training of grader Fk Collect Ek ', described device further includes re -training unit, is configured to, with new training set Ek ' the re -training graders Fk.
In one embodiment, above-mentioned training further includes iteration list for the device of the grader of network data flow classification Member, is configured to:After training is all re-started to grader F1, F2 and F3, if any of grader F1, F2 and F3 occur Change, repeats the operation of described device progress, until grader F1, F2 and F3 no longer change.
In one embodiment, above-mentioned training further includes integrated single for the device of the grader of network data flow classification Member, is configured to, and when F1-F3 no longer changes, integrated classifier is drawn by using most Voting principles.
This specification third aspect provides a kind of computer-readable storage medium, is stored thereon with instruction code, described When instruction code performs in a computer, computer is made to perform the device that above-mentioned training is used for the grader of network data flow classification Method.
This specification fourth aspect provides a kind of method classified to network data flow, including:Data flow is extracted Stream feature Vi, Vi are current load feature V1, statistical flow characteristic V2, and flow any of entropy feature V3;And by described in Feature Vi inputs are flowed by training described above for the method acquisition of the grader of network data flow classification and feature Vi Corresponding grader Fi, to obtain the Type C i of the data flow.
The aspect of this specification the 5th provides a kind of device classified to network data flow, including:Feature extraction unit, It is configured to, it is in current load feature V1, statistical flow characteristic V2, and stream entropy feature V3 that stream feature Vi, Vi are extracted to data flow It is any;And taxon, it is configured to, the stream feature Vi inputs is used for network data by training described above The grader Fi corresponding with feature Vi that the classifier methods of flow point class obtain, to obtain the Type C i of the data flow.
The aspect of this specification the 6th provides a kind of computer-readable storage medium, is stored thereon with instruction code, described When instruction code performs in a computer, computer is made to perform the above-mentioned method classified to network data flow.
The embodiment of this specification is by combining traffic statistics feature, flow load feature and flow entropy feature, Quan Fang Position, the data characteristic and behavior expression in depth excavating network traffics, and Cooperative Study algorithm is used, use a small amount of calibration sample This, is rationally introduced as calibration sample to expand training sample, and enhances grader accuracy, thinks in addition, using for reference integrated study Road, using the classification results of most Voting principle set single classifiers, further increases grader accuracy rate and recall rate.
Brief description of the drawings
This specification embodiment is described in conjunction with the accompanying drawings, can make it that this specification embodiment is clearer:
Fig. 1 shows the general illustration of module included by this specification embodiment;
Fig. 2 shows the general illustration of the step of this specification embodiment implemented in the modules shown in Fig. 1;
Fig. 3 is the stream for showing to be used for according to the training of this specification embodiment the method for the grader that network data flow is classified Cheng Tu;
Fig. 4 shows the rough schematic of the Tri-training methods of the training grader shown in Fig. 3;
Fig. 5 shows the iterative algorithm of the Tri-training methods;
Fig. 6 shows the device for being used for the grader that network data flow is classified according to the training of this specification embodiment;
Fig. 7 shows the method classified to network data flow according to this specification embodiment;And
Fig. 8 shows the device classified to network data flow according to this specification embodiment.
Embodiment
The specific embodiment of this specification is described below in conjunction with the accompanying drawings.
Fig. 1 shows the general illustration of included module in the technical solution of this specification embodiment.This specification is real Applying the technical solution of example includes four modules:Data acquisition module 11, characteristic extracting module 12, model training module 13, and Module 14 is implemented in classification.
Fig. 2 shows the general illustration of the step of this specification embodiment implemented in the modules shown in Fig. 1.
As shown in Fig. 2, in data acquisition module 11, MAC data bag is captured, TCP flow reduction is carried out, so as to obtain network number Closed according to adfluxion, which is divided into calibration collection L and does not demarcate collection U, and the data flow in L is collected to calibration and is demarcated.In feature Extraction module 12, extraction calibration collection L and current load feature, statistical flow characteristic and the stream entropy for not demarcating each data flow for collecting U Value tag, and to its vectorization, the step of with below in input grader.In model training module 13, including:To corresponding to The training of the single classifier of one of current load feature, statistical flow characteristic and stream entropy feature;By between three single classifiers Cooperative Study obtain three strong classifiers;And obtain strong integrated classifier by most Voting principles.Implement mould in classification Block 14, can use the single classifier obtained in model training module or integrated classifier to classify network data flow.
Statistical flow characteristic, current load feature and stream entropy feature belong to network data flow feature.Statistical flow characteristic is pair The behavior measure of data flow, current load are characterized in that the message content of data flow is semantic, and stream entropy is characterized in the purity of data flow.Knot Close state three kinds of features can it is comprehensive, deeper into ground excavate network traffics data characteristics and behavior expression.By using association With study Algorithm for Training grader, it is possible to reduce need the data volume demarcated, reduce data scaling cost, while Rational choice is not Nominal data sample, strengthens classification accuracy.In addition, by using the mode of integrated study, by multiple single classifiers into The most ballots of row draw final classification as a result, improving classification accuracy and recall rate.
This specification embodiment will be hereafter more particularly described.
Fig. 3 is the stream for showing to be used for according to the training of this specification embodiment the method for the grader that network data flow is classified Cheng Tu.
The classification accuracy of network data flow classifier is largely dependent upon the quality of training set sample, and network number Taken time and effort according to stream numerous and complicated, enormous amount, the calibration of data sample, great amount of samples cannot demarcate.Grader is a small amount of Showed on the training effect of sample not good enough.Therefore, how to obtain accurate disaggregated model using a small amount of calibration sample is one A technical barrier.The thought of Cooperative Study is used for reference in this specification embodiment shown in Fig. 3, uses (three points of Tri-training The coorinated training of class device) method reaches this purpose.
As shown in figure 3, in step 31, current load feature U1, statistical flow characteristic U2, Yi Jiliu are extracted respectively from data flow Entropy feature U3.In step 32, classified using grader Fi corresponding with feature Ui to the data flow, obtain first point Class as a result, wherein Ui is above-mentioned current load feature U1, statistical flow characteristic U2, stream entropy feature U3 any one, wherein i=1, 2 or 3.In step 33, classified using grader Fj corresponding with feature Uj to the data flow, obtain the second classification knot Fruit, wherein Uj are above-mentioned current load feature U1, statistical flow characteristic U2, and Ui any one is not equal in stream entropy feature U3, its Middle j=1,2 or 3.In step 34, in the case where the first classification results are identical with the second classification results, by the data flow and First classification results are born as training data for training grader Fk corresponding with feature Uk, wherein Uk for above-mentioned stream Feature U1, statistical flow characteristic U2 are carried, removes Ui in stream entropy feature U3, that outside Uj, wherein k=1,2 or 3.
Wherein, current load is characterized in data payload characteristic value of the network flow in addition to protocol headers, contains communication data Abundant semantic information.After flow calibration, its feature set of words t={ t are extracted1, t2... tn, each flow data message can It is expressed as the vector on Feature Words:V (d)={ (t1, w1), { t2, w2... { tn, wn, wherein wi is the weight system of feature ti Number, this programme represented using tf*idf values, wherein, tf is word frequency, i.e., the number that Feature Words occur in certain data stream and The ratio of effective word in the data flow.Idf is reverse document-frequency, is fluxion in training set and the fluxion that includes this feature word The logarithm of ratio.Tf*idf values are the product of tf and idf.Tf is higher than reverse file less than the Feature Words and idf of word frequency threshold value The word of frequency threshold can be cleaned.This specification embodiment using flow data as row vector, using the tf*idf values of Feature Words as row to Amount construction flow load eigenmatrix.It is to be understood that the computational methods of convection current load characteristic vector are only in this specification embodiment It is exemplary, i.e. current load feature vector can also be calculated with other computational methods well known by persons skilled in the art.
Table 1 shows the example for the Feature Words that partial data stream includes.
Table 1
In one embodiment, Feature Words are stored in Feature Words database, for calculating current load feature.
Statistical flow characteristic is that the network behavior of data flow is counted, and what is calculated estimates set.Common stream statistics Feature includes data packet bag in stream time interval, stream to time interval, data package size, data packet number, TCP flag numbers And state of activation etc..Features described above can also calculate the performance of its mathematics, by taking data package size as an example, can calculate this fluxion According to statistical values such as packet byte number maximum, minimum value, average value and variances.Meanwhile according to the direction of data flow, it is further divided into Forward-flow feature and backward current feature.Common traffic statistics feature is as shown in table 2.This specification embodiment is using flow data as row Vector, traffic statistics eigenmatrix is constructed by column vector of statistical characteristics.Table 2 shows common traffic statistics mark sheet.
Table 2
The entropy of flow represents the confusion degree of data on flows.Its known criterion calculation formula of those skilled in the art, Specifically, a data stream packet is represented with F, with fkThe set of all k continuation characters of the data stream packet is represented, with hk Represent corresponding fkEntropy, then its calculation formula is as follows:
According to the formula, the flow F for including m byte message, can obtain its entropy characteristic set Hm={ h1, h2... hn, this specification embodiment constructs flow entropy feature using flow data as row vector, by column vector of the entropy of different m Matrix.
In figure 3 shown grader Fi corresponding with feature Ui refer to using calibration sample set statistical flow characteristic, The single classifier that a kind of feature training grader in stream entropy feature and current load feature obtains.In one embodiment, will The data flow of collection is divided into calibration collection L and does not demarcate collection U, and collects the data flow in L to calibration and demarcate.In one embodiment In, can be in FTP, HTTP, SMTP, I MAP, SSH, POP3, BitTorrent, DNS, KuGoo, PPLive totally ten types Network data flow is demarcated.In one embodiment, the magnitude for demarcating the data flow number of concentration is hundred grades, does not demarcate collection The magnitude of middle data flow number is 100,000 grades, it is clear that the technical solution of this specification embodiment greatly reduces calibration cost.It is right Data flow in calibration collection L extracts current load feature and by it with demarcating vectorization together with type respectively, to obtain training set E1; The data flow collected to calibration in L extracts statistical flow characteristic and by it with demarcating vectorization together with type respectively, to obtain training set E2;And the data flow collected to calibration in L extracts stream entropy feature and by it with demarcating vectorization together with type respectively, to obtain Training set E3.Training set E1-E3 is inputted into Fi (i=1 respectively:3), thus obtain respectively with statistical flow characteristic, stream entropy feature Preliminary classification device F1-F3 corresponding with current load feature.
In one embodiment, Fi (i=1:3) it is based on decision tree, naive Bayesian, support vector machines, correlation rule The disaggregated model of at least one of study, neutral net, genetic algorithm algorithm.In another embodiment, F1-F3 is based on phase Same algorithm.In another embodiment, F1-F3 is based on different algorithms.
Fig. 4 shows the rough schematic of the Tri-training methods of the training grader shown in Fig. 3.F1-F3 can take turns Stream is used as Main classification device, two other strengthens the training set of Main classification device as synergetic classification device.By taking F3 as an example, association Classification calibration is carried out with grader F1 and F2 each sample that can be concentrated to non-calibrational capacity, if calibration result is identical, The sample and its calibration result are added in the training set E3 of F3.Classified using grader F1 and F2 to not demarcating collection U The new training set E3 ' of F3 are obtained after, as the new training set for carrying out re -training to F3 afterwards.F1 can similarly be obtained With new the training set E1 ' and E2 ' of F2.Using new training set E1 ', E2 ' and E3 ' respectively to grader F1-F3 re -trainings, So as to obtain the grader F1-F3 of enhancing.
In one embodiment, judge whether the grader F1-F3 of enhancing becomes compared to it before this training Change, for example, by the grader Fi (i=1 after enhancing:And Fj (j=1 3):3 and j ≠ i) it is used further to Tri-training algorithms In, judge whether also presence can add training set Ek (k=1 after collection U is demarcated to not demarcating for it:3, k ≠ i and k ≠ j) Sample u, or judge whether that new training set Ek ' can be obtained, if there is no the sample u, or new instruction cannot be obtained Practice collection Ek ', then it represents that Fk no longer changes compared to it before this training.If any of grader F1-F3 is sent out Changing, then repeat the above method, all no longer changes after grader F1, F2 and F3 state method on the implementation, so as to tie The beam algorithm.Training and Sample Refreshment iteration through excessive round, obtain three strong classifier F1-F3.
Fig. 5 shows the iterative algorithm of above-mentioned Tri-training methods.As shown in figure 5, in step 51, respectively for Fi (i=1:3), using two other grader Fj (j=1:3 and j ≠ i) and Fk (k=1:3, k ≠ i and k ≠ j) to not demarcating collection U Classify.
In step 52, if grader Fj (j=1:3 and j ≠ i) and Fk (k=1:3, k ≠ i and k ≠ j) to not demarcating collection U In sample u calibration result it is identical, then sample u and calibration result are added in the training set Ei (i=1: 3) of Fi, and The sample u is never demarcated in collection U and removed, so as to obtain new training set Ei'(i=1: 3 of Fi (i=1: 3) respectively).
In step 53, respectively with new training set Ei'(i=1: 3) re -training grader Fi (i=1:3).
In step 54, judging F1, F2 and F3, whether any one changes, if any of F1-F3 changes, Repeat step 51 is to step 53, until F1-F3 no longer changes, to obtain strong classifier Fi (i=1:3).
However, the classification accuracy of single grader is closed in different category sets relatively large deviation, it is also possible in list A category set, which closes, there is over-fitting.Integrated study is sampled using sample set, characteristic set selects and sorting algorithm selects etc. Mode, the different grader of training, then completes the polymerization of result using principles such as majority ballots, can not only improve classification Accuracy, and the over-fitting of single classifier can be effectively prevented from.In this specification embodiment, by identical training set Three discrepant single classifiers are obtained using different sorting algorithm or characteristic of division, are then obtained using most Voting principles The final classification result of sample.
Fig. 6 shows the device 600 for being used for the grader that network data flow is classified according to the training of this specification embodiment, bag Include:Feature extraction unit 61, is configured to:Extract current load feature U1, statistical flow characteristic U2, and stream entropy respectively from data flow Value tag U3;First taxon 62, is configured to:The data flow is divided using grader Fi corresponding with feature Ui Class, obtains the first classification results, and wherein Ui is above-mentioned current load feature U1, statistical flow characteristic U2, and stream entropy feature U3's is any One, wherein i=1,2 or 3;Second taxon 63, is configured to:Using grader Fj corresponding with feature Uj to the data Stream is classified, and obtains the second classification results, and wherein Uj is above-mentioned current load feature U1, statistical flow characteristic U2, flows entropy feature It is not equal to any one of Ui, wherein j=1,2 or 3 in U3;Training data acquiring unit 64, is configured to:In the first classification results In the case of identical with the second classification results, using the data flow and first classification results as training data, for instructing Practice grader Fk corresponding with feature Uk, wherein Uk is above-mentioned current load feature U1, statistical flow characteristic U2, is flowed in entropy feature U3 Except Ui, that outside Uj, wherein k=1,2 or 3.
In one embodiment, the dress for the grader that network data flow is classified is used for according to the training of this specification embodiment Put 600 and further include initial training unit 65, be configured to:Grader Fi corresponding with feature Ui is used to the data flow described Before being classified, it is respectively trained out and current load feature U1 with training set E1, E2 and E3 of the set of data flows based on calibration Corresponding F1, F2 corresponding with statistical flow characteristic U2, and F3 corresponding with stream entropy feature U3.
In one embodiment, wherein, the data flow and first classification results are included as training data, will The data flow and first classification results are added in the current training set of grader Fk, so as to obtain the new of grader Fk Training set Ek ', the device 600 that the training according to this specification embodiment is used for the grader of network data flow classification also wrap Re -training unit 66 is included, is configured to, with new training set Ek ' the re -training graders Fk.
In one embodiment, the device 600 for the grader that network data flow is classified is used for according to the training of this specification Iteration unit 67 is further included, is configured to:After being trained to grader F1, F2 and F3, if appointed in grader F1, F2 and F3 One changes, and repeats the operation of above device progress, until grader F1, F2 and F3 no longer change.
In one embodiment, the device 600 for the grader that network data flow is classified is used for according to the training of this specification Integrated unit 68 is further included, is configured to, when F1-F3 no longer changes, Ensemble classifier is drawn by using most Voting principles Device.
Dotted line frame in Fig. 6 represents that the unit is optional unit in embodiment, rather than required unit.For example, this Device 600 in specification embodiment can not include initial training unit 65, i.e. be obtained by being trained to calibration collection Initial single classifier Fi is obtained, but the initial single classifier can be obtained with other manner well known by persons skilled in the art Fi.Similarly, re -training unit 66, iteration unit 67 and integrated unit 68 are also only optional units, rather than the implementation Required unit in example.Hereinafter the dotted line frame in Fig. 7 and Fig. 8 also illustrates that identical implication.
Fig. 7 shows the method classified to network data flow according to this specification embodiment, comprises the following steps:Step Rapid 71, it is in current load feature V1, statistical flow characteristic V2, and stream entropy feature V3 that stream feature Vi, Vi are extracted to data flow It is any;And step 72, by the stream feature Vi inputs by being classified according to training described above for network data flow Grader obtain grader Fi corresponding with feature Vi, to obtain the Type C i of the data flow.
In one embodiment, the side classified to network data flow according to this specification embodiment shown in Fig. 7 Method further includes step 73, for the Type C 1-C3 of the data flow of acquisition, by most Voting principles, draws data flow Final type.
Fig. 8 shows the device 800 classified to network data flow according to this specification embodiment, including:Feature carries Unit 81 is taken, is configured to, it is current load feature V1, statistical flow characteristic V2 that stream feature Vi, Vi are extracted to data flow, and flows entropy Any of feature V3;And taxon 82, it is configured to, the stream feature Vi inputs is used by training described above In the grader Fi corresponding with feature Vi that the method for the grader of network data flow classification obtains, to obtain the data flow Type C i.
In one embodiment, the dress classified to network data flow according to this specification embodiment shown in Fig. 8 Put 800 and further include integrated unit 83, be configured to, it is former by majority ballot for the Type C 1-C3 of the data flow of acquisition Then, the final type of data flow is drawn.
On the other hand, the embodiment of this specification also provides a kind of computer-readable storage medium, is stored thereon with Instruction code, when described instruction code performs in a computer, makes computer perform above-mentioned training and classifies for network data flow Grader method.
Another aspect, the embodiment of this specification also provide a kind of computer-readable storage medium, are stored thereon with meter Calculation machine instruction code, when described instruction code performs in a computer, makes computer execution is above-mentioned to divide network data flow The method of class.
The above method and device in this specification embodiment can be deployed to arbitrary network environment, to the stream of the network environment Amount is classified and is analyzed.
The embodiment of this specification is by combining traffic statistics feature, flow load feature and flow entropy feature, Quan Fang Position, the data characteristic and behavior expression in depth excavating network traffics.This specification embodiment also uses Cooperative Study algorithm, makes With a small amount of calibration sample, calibration sample is rationally introduced as to expand training sample, and enhance grader accuracy.In addition, this Specification embodiment is also by using Ensemble Learning Algorithms, using the classification results of most Voting principle set single classifiers, into One step improves grader accuracy rate and recall rate.
Those of ordinary skill in the art should further appreciate that, be described with reference to the embodiments described herein Each exemplary unit and algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clear Illustrate to Chu the interchangeability of hardware and software, generally describe each exemplary group according to function in the above description Into and step.These functions hold track with hardware or software mode actually, depending on technical solution application-specific and set Count constraints.Those of ordinary skill in the art can be described to be realized using distinct methods to each specific application Function, but this realization is it is not considered that exceed scope of the present application.
Track can be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor Software module, or the two combination implemented.Software module can be placed in random access memory (RAM), memory, read-only storage Device (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology neck In any other form of storage medium well known in domain.
Above-described embodiment, has carried out the purpose of the present invention, technical solution and beneficial effect further Describe in detail, it should be understood that the foregoing is merely the embodiment of the present invention, be not intended to limit the present invention Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution, improvement and etc. done, should all include Within protection scope of the present invention.

Claims (26)

1. a kind of training is used for the method for the grader of network data flow classification, comprise the following steps:
Extract current load feature U1, statistical flow characteristic U2, and stream entropy feature U3 respectively from data flow;
Classified using grader Fi corresponding with feature Ui to the data flow, obtain the first classification results, wherein Ui is Above-mentioned current load feature U1, statistical flow characteristic U2, flow entropy feature U3 any one, wherein i=1,2 or 3;
Classified using grader Fj corresponding with feature Uj to the data flow, obtain the second classification results, wherein Uj is Above-mentioned current load feature U1, statistical flow characteristic U2, flow any one for being not equal to Ui in entropy feature U3, wherein j=1,2 or 3;
In the case where the first classification results are identical with the second classification results, the data flow and first classification results are made For training data, for training grader Fk corresponding with feature Uk, wherein Uk is above-mentioned current load feature U1, statistical flow characteristic Remove Ui in U2, stream entropy feature U3, that outside Uj, wherein k=1,2 or 3.
2. training according to claim 1 is used for the method for the grader of network data flow classification, used and spy described Before the corresponding grader Fi of sign Ui classify the data flow, further include:With the instruction of the set of data flows based on calibration Practice collection E1, E2 and E3 and be respectively trained out F1 corresponding with current load feature U1, F2 corresponding with statistical flow characteristic U2, and with stream The corresponding F3 of entropy feature U3.
3. training according to claim 1 or 2 is used for the method for the grader of network data flow classification, by the data flow Include with first classification results as training data, the data flow and first classification results are added into grader Fk Current training set in, so as to obtain the new training set Ek ' of grader Fk, the method further includes, with the new training set Ek ' re -training graders Fk.
4. training according to claim 3 is used for the method for the grader of network data flow classification, further include, to classification After device F1, F2 and F3 re-start training, if any of grader F1, F2 and F3 change, repeat described Method, until grader F1, F2 and F3 no longer change.
5. training according to claim 4 is used for the method for the grader of network data flow classification, further include, as F1, F2 When all no longer changing with F3, integrated classifier is drawn by using most Voting principles.
6. the training according to any one of claim 1-5 is used for the method for the grader of network data flow classification, wherein, The value of the tf*idf for the Feature Words that the semantic information that the extraction current load feature includes extraction data flow includes, wherein tf are Word frequency, idf are reverse document-frequencies, i.e. fluxion in training set and the logarithm of the fluxion ratio comprising this feature word.
7. training according to claim 6 is used for the method for the grader of network data flow classification, wherein, the extraction stream Load characteristic includes, and removes the Feature Words that tf is higher than reverse document-frequency threshold value less than the Feature Words and idf of word frequency threshold value.
8. the training according to any one of claim 1-5 is used for the method for the grader of network data flow classification, wherein, The extraction statistical flow characteristic includes data packet bag in extraction stream time interval, stream to time interval, data package size, data packet It is at least one in number, TCP flags number and state of activation.
9. the training according to any one of claim 1-8 is used for the method for the grader of network data flow classification, wherein, Any one in grader F1, F2 and F3 is based on decision tree, naive Bayesian, support vector machines, correlation rule study, nerve net At least one of network, genetic algorithm algorithm.
10. training according to claim 9 is used for the method for the grader of network data flow classification, wherein, F1, F2 and F3 Based on identical algorithm.
11. a kind of training is used for the device of the grader of network data flow classification, including:
Feature extraction unit, is configured to:Extract current load feature U1, statistical flow characteristic U2, and stream entropy respectively from data flow Value tag U3;
First taxon, is configured to:Classified using grader Fi corresponding with feature Ui to the data flow, obtain One classification results, wherein Ui are above-mentioned current load feature U1, statistical flow characteristic U2, flow entropy feature U3 any one, wherein i =1,2 or 3;
Second taxon, is configured to:Classified using grader Fj corresponding with feature Uj to the data flow, obtain Two classification results, wherein Uj are above-mentioned current load feature U1, statistical flow characteristic U2, are not equal to any of Ui in stream entropy feature U3 One, wherein j=1,2 or 3;
Training data acquiring unit, is configured to:In the case where the first classification results are identical with the second classification results, by the number According to stream and first classification results as training data, for training grader Fk corresponding with feature Uk, wherein Uk is upper State current load feature U1, statistical flow characteristic U2, remove Ui in stream entropy feature U3, that outside Uj, wherein k=1,2 or 3.
12. training according to claim 11 is used for the device of the grader of network data flow classification, initial instruction is further included Practice unit, be configured to:It is described classified to the data flow using grader Fi corresponding with feature Ui before, with based on F1 corresponding with current load feature U1 is respectively trained out in training set E1, E2 and E3 of the set of data flows of calibration, special with stream statistics Levy the corresponding F2 of U2, and F3 corresponding with stream entropy feature U3.
13. the training according to claim 11 or 12 is used for the device of the grader of network data flow classification, wherein, by institute State data flow and first classification results includes as training data, and the data flow and first classification results are added In the current training set of grader Fk, so as to obtain the new training set Ek ' of grader Fk, described device further includes re -training Unit, is configured to, with new training set Ek ' the re -training graders Fk.
14. training according to claim 13 is used for the device of the grader of network data flow classification, iteration list is further included Member, is configured to:After training is all re-started to grader F1, F2 and F3, if any of grader F1, F2 and F3 occur Change, repeats the operation of described device progress, until grader F1, F2 and F3 no longer change.
15. training according to claim 14 is used for the device of the grader of network data flow classification, further include integrated single Member, is configured to, and when F1, F2 and F3 no longer change, integrated classifier is drawn by using most Voting principles.
16. the training according to any one of claim 11-15 is used for the device of the grader of network data flow classification, its In, the value of the tf*idf for the Feature Words that semantic information of the extraction current load feature including extraction data flow includes, wherein Tf is word frequency, and idf is reverse document-frequency, i.e. fluxion in training set and the logarithm of the fluxion ratio comprising this feature word.
17. training according to claim 16 is used for the device of the grader of network data flow classification, wherein, the extraction Current load feature includes, and removes the Feature Words that tf is higher than reverse document-frequency threshold value less than the Feature Words and idf of word frequency threshold value.
18. the training according to any one of claim 11-15 is used for the device of the grader of network data flow classification, its In, the extraction statistical flow characteristic includes data packet bag in extraction stream time interval, stream to time interval, data package size, number According at least one in bag number, TCP flags number and state of activation.
19. the training according to any one of claim 11-18 is used for the device of the grader of network data flow classification, its In, any one in grader F1, F2 and F3 is based on decision tree, naive Bayesian, support vector machines, correlation rule study, god Through at least one of network, genetic algorithm algorithm.
20. training according to claim 19 be used for network data flow classification grader device, wherein, F1, F2 and F3 is based on identical algorithm.
21. a kind of computer-readable storage medium, is stored thereon with instruction code, described instruction code performs in a computer When, make computer perform claim require the method any one of 1-10.
22. a kind of method classified to network data flow, including
It is appointing in current load feature V1, statistical flow characteristic V2, and stream entropy feature V3 that stream feature Vi, Vi are extracted to data flow It is a kind of;And
The stream feature Vi inputs are obtained by the method according to any one of claim 1-10 with Vi pairs of feature The grader Fi answered, to obtain the Type C i of the data flow.
23. the method according to claim 22 classified to network data flow, further includes, for the number of acquisition According to the Type C 1-C3 of stream, by most Voting principles, the final type of data flow is drawn.
24. a kind of device classified to network data flow, including
Feature extraction unit, is configured to, and stream feature Vi, Vi current load feature V1, statistical flow characteristic V2 are extracted to data flow, and Flow any of entropy feature V3;And
Taxon, is configured to, and the stream feature Vi inputs are passed through the method according to any one of claim 1-10 The grader Fi corresponding with feature Vi obtained, to obtain the Type C i of the data flow.
25. the device according to claim 24 classified to network data flow, further includes integrated unit, is configured to, For the Type C 1-C3 of the data flow of acquisition, by most Voting principles, the final type of data flow is drawn.
26. a kind of computer-readable storage medium, is stored thereon with instruction code, described instruction code performs in a computer When, make the method described in computer perform claim requirement 22 or 23.
CN201711158988.4A 2017-11-20 2017-11-20 Method and device for classifying network data streams Active CN107967311B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711158988.4A CN107967311B (en) 2017-11-20 2017-11-20 Method and device for classifying network data streams

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711158988.4A CN107967311B (en) 2017-11-20 2017-11-20 Method and device for classifying network data streams

Publications (2)

Publication Number Publication Date
CN107967311A true CN107967311A (en) 2018-04-27
CN107967311B CN107967311B (en) 2021-06-29

Family

ID=62001312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711158988.4A Active CN107967311B (en) 2017-11-20 2017-11-20 Method and device for classifying network data streams

Country Status (1)

Country Link
CN (1) CN107967311B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109309630A (en) * 2018-09-25 2019-02-05 深圳先进技术研究院 A kind of net flow assorted method, system and electronic equipment
CN109359109A (en) * 2018-08-23 2019-02-19 阿里巴巴集团控股有限公司 A kind of data processing method and system calculated based on distributed stream
CN110059726A (en) * 2019-03-22 2019-07-26 中国科学院信息工程研究所 The threat detection method and device of industrial control system
CN110781950A (en) * 2019-10-23 2020-02-11 新华三信息安全技术有限公司 Message processing method and device
CN112380406A (en) * 2020-11-15 2021-02-19 杭州光芯科技有限公司 Real-time network traffic classification method based on crawler technology
CN112423324A (en) * 2021-01-22 2021-02-26 深圳市科思科技股份有限公司 Wireless intelligent decision communication method, device and system
WO2021047401A1 (en) * 2019-09-10 2021-03-18 华为技术有限公司 Service classification method and apparatus, and internet system
CN112836214A (en) * 2019-11-22 2021-05-25 南京聚铭网络科技有限公司 Communication protocol hidden channel detection method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100034102A1 (en) * 2008-08-05 2010-02-11 At&T Intellectual Property I, Lp Measurement-Based Validation of a Simple Model for Panoramic Profiling of Subnet-Level Network Data Traffic
CN102315974A (en) * 2011-10-17 2012-01-11 北京邮电大学 Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows
US8311956B2 (en) * 2009-08-11 2012-11-13 At&T Intellectual Property I, L.P. Scalable traffic classifier and classifier training system
CN103870751A (en) * 2012-12-18 2014-06-18 中国移动通信集团山东有限公司 Method and system for intrusion detection
CN104270392A (en) * 2014-10-24 2015-01-07 中国科学院信息工程研究所 Method and system for network protocol recognition based on tri-classifier cooperative training learning
CN106559261A (en) * 2016-11-03 2017-04-05 国网江西省电力公司电力科学研究院 A kind of substation network intrusion detection of feature based fingerprint and analysis method
CN106657141A (en) * 2017-01-19 2017-05-10 西安电子科技大学 Android malware real-time detection method based on network flow analysis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100034102A1 (en) * 2008-08-05 2010-02-11 At&T Intellectual Property I, Lp Measurement-Based Validation of a Simple Model for Panoramic Profiling of Subnet-Level Network Data Traffic
US8311956B2 (en) * 2009-08-11 2012-11-13 At&T Intellectual Property I, L.P. Scalable traffic classifier and classifier training system
CN102315974A (en) * 2011-10-17 2012-01-11 北京邮电大学 Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows
CN103870751A (en) * 2012-12-18 2014-06-18 中国移动通信集团山东有限公司 Method and system for intrusion detection
CN104270392A (en) * 2014-10-24 2015-01-07 中国科学院信息工程研究所 Method and system for network protocol recognition based on tri-classifier cooperative training learning
CN106559261A (en) * 2016-11-03 2017-04-05 国网江西省电力公司电力科学研究院 A kind of substation network intrusion detection of feature based fingerprint and analysis method
CN106657141A (en) * 2017-01-19 2017-05-10 西安电子科技大学 Android malware real-time detection method based on network flow analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JEFFREY ERMAN等: "Traffic classification using clustering algorithms", 《PROCEEDINGS OF THE 2006 SIGCOMM WORKSHOP ON MINING NETWORK DATA》 *
张炜: "基于多分类器的网络流量分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359109A (en) * 2018-08-23 2019-02-19 阿里巴巴集团控股有限公司 A kind of data processing method and system calculated based on distributed stream
CN109359109B (en) * 2018-08-23 2022-05-27 创新先进技术有限公司 Data processing method and system based on distributed stream computing
CN109309630A (en) * 2018-09-25 2019-02-05 深圳先进技术研究院 A kind of net flow assorted method, system and electronic equipment
CN109309630B (en) * 2018-09-25 2021-09-21 深圳先进技术研究院 Network traffic classification method and system and electronic equipment
CN110059726A (en) * 2019-03-22 2019-07-26 中国科学院信息工程研究所 The threat detection method and device of industrial control system
WO2021047401A1 (en) * 2019-09-10 2021-03-18 华为技术有限公司 Service classification method and apparatus, and internet system
CN110781950A (en) * 2019-10-23 2020-02-11 新华三信息安全技术有限公司 Message processing method and device
CN110781950B (en) * 2019-10-23 2023-06-30 新华三信息安全技术有限公司 Message processing method and device
CN112836214A (en) * 2019-11-22 2021-05-25 南京聚铭网络科技有限公司 Communication protocol hidden channel detection method
CN112380406A (en) * 2020-11-15 2021-02-19 杭州光芯科技有限公司 Real-time network traffic classification method based on crawler technology
CN112380406B (en) * 2020-11-15 2022-11-18 杭州光芯科技有限公司 Real-time network traffic classification method based on crawler technology
CN112423324A (en) * 2021-01-22 2021-02-26 深圳市科思科技股份有限公司 Wireless intelligent decision communication method, device and system

Also Published As

Publication number Publication date
CN107967311B (en) 2021-06-29

Similar Documents

Publication Publication Date Title
CN107967311A (en) A kind of method and apparatus classified to network data flow
CN112738015B (en) Multi-step attack detection method based on interpretable convolutional neural network CNN and graph detection
CN111340191B (en) Bot network malicious traffic classification method and system based on ensemble learning
US9781139B2 (en) Identifying malware communications with DGA generated domains by discriminative learning
CN106657141A (en) Android malware real-time detection method based on network flow analysis
CN105871619B (en) A kind of flow load type detection method based on n-gram multiple features
CN113329023A (en) Encrypted flow malice detection model establishing and detecting method and system
Rassam et al. Artificial immune network clustering approach for anomaly intrusion detection
CN106897733A (en) Video stream characteristics selection and sorting technique based on particle swarm optimization algorithm
CN111224994A (en) Botnet detection method based on feature selection
CN112800424A (en) Botnet malicious traffic monitoring method based on random forest
Li et al. Improving attack detection performance in NIDS using GAN
Song et al. Unsupervised anomaly detection based on clustering and multiple one-class SVM
US20160127290A1 (en) Method and system for detecting spam bot and computer readable storage medium
CN112910853A (en) Encryption flow classification method based on mixed characteristics
CN114500396B (en) MFD chromatographic feature extraction method and system for distinguishing anonymous Torr application flow
CN114553722B (en) VPN and non-VPN network flow classification method based on multi-view one-dimensional convolutional neural network
CN114422211B (en) HTTP malicious traffic detection method and device based on graph attention network
CN109325814A (en) A method of for finding suspicious trade network
CN107832611B (en) Zombie program detection and classification method combining dynamic and static characteristics
CN111464510A (en) Network real-time intrusion detection method based on rapid gradient lifting tree model
CN109450876B (en) DDos identification method and system based on multi-dimensional state transition matrix characteristics
Santhosh et al. Detection Of DDOS Attack using Machine Learning Models
CN106557983B (en) Microblog junk user detection method based on fuzzy multi-class SVM
CN114124565B (en) Network intrusion detection method based on graph embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1253991

Country of ref document: HK

TA01 Transfer of patent application right

Effective date of registration: 20201020

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201020

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant