CN109151880A - Mobile application flow identification method based on multilayer classifier - Google Patents
Mobile application flow identification method based on multilayer classifier Download PDFInfo
- Publication number
- CN109151880A CN109151880A CN201811326852.4A CN201811326852A CN109151880A CN 109151880 A CN109151880 A CN 109151880A CN 201811326852 A CN201811326852 A CN 201811326852A CN 109151880 A CN109151880 A CN 109151880A
- Authority
- CN
- China
- Prior art keywords
- classifier
- sample
- training
- flow
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/08—Testing, supervising or monitoring using real traffic
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the field of network traffic analysis, and provides a mobile application traffic identification method based on a multilayer classifier aiming at the problem that the existing mobile application traffic identification method cannot detect and process background traffic, wherein the technical scheme is as follows: firstly, extracting the characteristics of a flow training set to obtain the characteristic representation of a flow sample; secondly, training a first-layer classifier, and preliminarily detecting a sample to be detected as target flow or background flow; training a second-layer classifier, and performing fine-grained identification on the target flow; fourthly, training a third-layer classifier; and fifthly, carrying out mobile application flow identification on the sample to be detected by using the trained multilayer classifier. The invention fully considers the flow distribution condition in a real network, and under the condition of not having a complete background flow data set, the characteristics of the target flow sample are learned layer by layer, so that the classifier can identify the target flow and simultaneously eliminate the background flow, and the false positive number of the classifier is reduced.
Description
Technical field
The invention belongs to network traffic analysis fields, are related to a kind of network flow identification method based on machine learning, tool
Body is related to the mobile application method for recognizing flux based on Multilayer Classifier.
Background technique
With universal and mobile application the prosperity and development of mobile device, it is presently most used that mobile application has become people
Network access.End the first quarter in 2018, Google's application market there are 3,800,000 sections to download using for user, and average every
New 6,140 new opplications in the sky.To 2017, there is 57% network flow both to be from mobile device.Therefore, mobile network flows
Amount, which alreadys exceed conventional workstation flow, becomes the chief component of network flow.The hot spot of concern is studied also from conventional operation
Flow of standing identification turns to the identification of mobile network's flow.
The target of mobile network's flow identification technology is the mobile flow of identification using source.This technology is in network management
With safety, market survey, there is important role in the fields such as customer analysis.For example, being based on this technology, service provider can be slapped
Hold the mobile application flow distribution situation in network;Popular network application and optimize phase in the available garden of network administrator
Internet resources distribution is closed to improve user experience;Advertising provider will be seen that a certain apply when and where more popular with users
To formulate more reasonable advertisement serving policy etc..
Although the identification of mobile network's flow is similar with conventional desktop flow identification process, the particularity of mobile flow is to biography
System flow identification technology brings huge challenge:
1) mobile application flow mostly uses HTTP/HTTPS agreement to transmit, this makes the flow identification technology based on port only
This kind of mobile application flow can be identified as Web.Other transmission ports are generally random port number, so that this technology is lost completely
Effect.
2) in order to protect privacy of user, mobile flow mostly uses cryptographic protocol to transmit, reduces based on deep-packet detection DPI
The validity of the flow identification technology of (Deep Packet Inspection).
3) mobile application uses third party library more, causes different applications that can generate similar flow, these flows are difficult to
It is identified using DPI technology and IP address.
4) CDN (Content Distribution Network, content distributing network) is that mobile application generally uses
Technology.This technology causes the address of the IP of a server that may simultaneously be different application services.Therefore it reduces and is based on
The validity of the flow identification technology of DNS (Domain Naming System, Domain Name Service System).In addition to this, some applications
Server address may be obtained without using DNS, further reduce the scope of application based on DNS flow identification technology.
5) mobile application enormous amount, updating decision, emerging application emerge one after another, and identification technology need to constantly update, such as DPI
Technology needs continuous updating load characteristic library etc..
For these reasons, traditional method for recognizing flux cannot effectively handle mobile flow.It is based on machine in recent years
The flow identification technology of device study shows good classification performance in the identification of conventional desktop network flow, therefore has work
Also it is applied in mobile application flow identification mission.
Wang et al. (Wang etc., I know what you did on your smartphone:Inferring app
(I knows what your mobile phone doing to usage over encrypted data traffic: speculating movement by encryption flow
Using situation) .IEEE Conference on Communications and Network Security (ieee communication
With network security meeting), 2015,433-441) artificially collect under 13 kinds of iOS systems using each 5 minutes flows of self-operating,
And training random forest grader.But the sample size that this work uses is very few, and therefore, it is difficult to assess the validity of this method.
AppScanner (Vincent etc., Robust smartphone app identification via encrypted
Network traffic analysis (the robustness mobile application recognition methods based on refined net flow analysis), IEEE
Transactions on Information Forensics&Security (evidence obtaining of IEEE information and safe periodical), 2017,
13 (1): the fingerprint of application 63-78) is extracted using random forests algorithm and identifies flow.Its data set used comes from two not
The flow that 110 kinds of applications in same Android device generate.But using " burst ", (i.e. interval time is small in certain time for the work
In one group of packet of a certain threshold value) as flow identification basic object, cause this method to be only applicable to the flow of simple network
Identify work.Wang et al. (Wang etc., End-to-end encrypted traffic classification with
The one-dimensional convolution neural networks (End to End Encryption based on one-dimensional convolutional neural networks
Method for recognizing flux), IEEE International Conference on Intelligence and Security
Informatics (IEEE information and security information meeting), 2017,43-48) using one-dimensional convolutional neural networks model identification stream
Amount, when classifying in fine granularity to flow, true rate is up to 86.6%.Deep Packet (Mohammad etc., Deep
packet:A novel approach for encrypted traffic classification using deep
Learning (a kind of Deep packet: encryption traffic classification method based on deep learning), arXiv, 2017) based on one-dimensional
Convolutional neural networks and stack autocoder classify to mobile application flow.Giuseppe et al. (Giuseppe etc.,
The Mobile encrypted traffic classification using deep learning (shifting based on deep learning
Dynamic encryption traffic classification), 2018) to being pointed out after four kinds of recognition methods neural network based, Wang et al. (Wang etc.,
End-to-end encrypted traffic classification with one-dimensional convolution
Neural networks (the End to End Encryption method for recognizing flux based on one-dimensional convolutional neural networks), IEEE
International Conference on Intelligence and Security Informatics (IEEE information with
Security information meeting), 2017,43-48) propose classifier have optimal mobile application flow recognition performance.
In conclusion although mobile application method for recognizing flux set forth above all shows outstanding recognition result,
These methods do not consider that Unknown Background flow is influenced to classifier performance bring, only testing classification device in a closed environment,
I.e. test set flow both is from application involved in training set.And in live network, other than target application flow, unknown applications
Thousands of and emerging application emerges one after another, these non-targeted application reasons for its use flows, which can bring classifier, greatly to be chosen
War.And the test environment of the above method does not consider this problem, causes these methods that can not be deployed in real network environment.
Summary of the invention
The present invention cannot detect and handle background stream for the existing mobile application method for recognizing flux based on machine learning
The problem of amount, a kind of mobile application method for recognizing flux based on Multilayer Classifier is provided, Level by level learning target flow sample is special
Sign, to make classifier that can also exclude background traffic while identifying target flow, reduces the pseudo- positive number of classifier.
Technical solution is as follows:
The first step extracts the feature of flow training set, obtains the character representation of flow sample.Each flow sample is denoted as
Stream.
Sample Preliminary detection to be detected is target flow or background traffic by second step, training first layer classifier.Remember mesh
Mark flow is Target class, and background traffic is Other class.
Third step extracts fuzzy stream, constructs the training set of second layer classifier, second layer classifier then trained, to mesh
It marks flow and carries out fine granularity identification.The similar flow that fuzzy stream refers to while being generated by multiple applications, such as third party library or advertisement
Flow.Remember that i-th of target application is Appi.The number of target application is N, and N is natural number.
4th step extracts background traffic sample again, constructs the training set of third layer classifier, then trains third layer point
Class device.
5th step carries out the identification of mobile application flow to sample to be detected using trained Multilayer Classifier.Method is:
Firstly, using first layer classifier by flow measurement specimen discerning to be checked be Target class or Other class, be identified as the stream of Target class
Amount sample enters second layer classifier and continues to test;Then, Target class is identified as certain to fine granularity again by second layer classifier
One target application or fuzzy stream, if a certain sample is identified as target application Appi, into third layer classifier, by correlation
Classifier continue to identify;When third layer classifier provides consistent recognition result, then final recognition result is provided,
Otherwise refusal judgement.
As the further improvement of technical solution of the present invention, the first step extracts the feature of flow training set, specific side
Method are as follows: first to original flow according to five-tuple<source IP, destination IP, source port, destination port, agreement>be grouped, composition
Stream.If the load that a stream includes is not less than or equal to five for 0 message number, corresponding 29 kinds of flows are extracted according to whole stream
Feature;If the load that a stream includes is not greater than five for 0 message number, first five load is only taken not extract for 0 message
Corresponding 29 kinds of traffic characteristics, and giving up the 5th load not is other messages after 0 message.
As the further improvement of technical solution of the present invention, 29 kinds of traffic characteristics respectively include destination port, stream
Preceding 16 load bytes and 12 kinds of statistical natures.12 kinds of statistical natures are respectively message size of the client to server end
Maximum value, minimum value, the maximum value of the message size of server end to client, minimum value, average value, variance yields, client
To the payload size of first 3 messages with non-zero load of server end, the 1st of server end to client has with the 3rd
There are the payload size of the message of non-zero load, the most parcel size of stream.
As the further improvement of technical solution of the present invention, the second step first layer classifier training method particularly includes:
The label of training set sample is divided into Target class and Other class, and training random forest binary classifier.In training, increase
The weight of target flow sample is added to make the feature of classifier preference learning target sample.
As the further improvement of technical solution of the present invention, the feature of the training set sample in the second step be port and
12 kinds of statistical natures.
As the further improvement of technical solution of the present invention, the third step second layer classifier training specific method is such as
Under:
Step 3.1 extracts fuzzy stream, method particularly includes: firstly, flow sample set is according to binary group < server ip, service
Device port > be grouped.For each grouping, if the flow sample in grouping have it is multiple using label, and without certain
One kind application sample is occupied an leading position, then the flow in the group is fuzzy stream.In the present invention, when the sample of one kind application
When number accounts for 90% or more of packet samples number, it is believed that it occupies leading position.
Step 3.2 constructs training sample set, method particularly includes: it extracts remaining each after extracting fuzzy stream in former training set
The flow sample of target application constitutes the training sample of second layer classifier in conjunction with the fuzzy stream sample extracted in step 3.1
Collection includes N+1 class sample altogether.
Step 3.3 trains second layer N+1 member random forest grader.
As the further improvement of technical solution of the present invention, the feature of the training set sample in the step 3.3 is port
And 12 kinds of statistical natures.
As the further improvement of technical solution of the present invention, the 4th step third layer classifier training specific method is such as
Under:
Step 4.1 extracts Other class sample, method particularly includes: original is instructed using the first layer classifier of second step training
Practice collection to classify, extracts the Other class sample for being wherein accidentally divided into Target class, constitute new Other class data set.
Step 4.2 constructs training sample set, method particularly includes: the Other data set for extracting step 4.1, same to third step
In training set combine, constitute the training sample set of third layer classifier, altogether include N+2 class sample, i.e. N class target application
Sample obscures stream class and Other class.
Step 4.3 train third layer classifier, method particularly includes: selection random forest and XGBoost model (Chen etc.,
XGBoost:A Scalable Tree Boosting System (a kind of XGBoost: expansible set tree system), ACM
International Conference on Knowledge Discovery and Data Mining (ACM Knowledge Discovery with
Data mining international conference), 2016,785-794) training third layer classifier.For each model, based on one-to-one method (week
Will China, machine learning, 2016,63-66), i.e., one binary classifier of training between any two classification, training (N+2) * (N+
1)/2 binary classifier.Final third layer classifier includes (N+2) * (N+1) a classifier.The feature of training set is first 16
Load bytes.
As the further improvement of technical solution of the present invention, the 5th moved further application traffic identification specific method is such as
Under:
The classification of step 5.1 first layer classifier, method particularly includes: the port of extraction flow sample and 12 kinds of statistical natures,
It is identified using first layer classifier.If being identified as Target, 5.2 are entered step.Otherwise it is determined as background classes flow, terminates.
The classification of step 5.2 second layer classifier, method particularly includes: the port of extraction flow sample and 12 kinds of statistical natures,
It is identified using second layer classifier.If being identified as a certain target application Appi, 5.3 are entered step;If being identified as fuzzy stream, knot
Beam does not provide concrete application label.
The classification of step 5.3 third layer classifier, method particularly includes: preceding 16 bytes for extracting flow sample use third
2* (N+1) a classifier of layer is identified.When the recognition result and step 5.2 of 2* (N+1) a classifier are consistent, then it is determined as
Otherwise Appi terminates, do not provide concrete application label.
As the further improvement of technical solution of the present invention, 2* (N+1) a classifier in above-mentioned steps 5.3 is respectively N+
1 random forest grader and N+1 XGBoost classifier.N+1 binary random forest grader include Appi class sample with
The binary classifier (j is not equal to i) for the training set training that Appj class sample is constituted, Appi class sample and Other class sample are constituted
Training set training binary classifier and the binary point of training set training that constitutes of Appi class sample and fuzzy stream class sample
Class device.N+1 binary XGBoost classifier includes the binary point for the training set training that Appi class sample and Appj class sample are constituted
Class device (j is not equal to i), the binary classifier and Appi class of the training set training that Appi class sample and Other class sample are constituted
The binary classifier for the training set training that sample and fuzzy stream class sample are constituted.
Compared with prior art, the beneficial effects of the present invention are:
Since unknown applications are thousands of, and new opplication emerges one after another, this leads to that complete background can not be collected
Data on flows collection.Therefore classifier cannot learn all background traffic modes, thus cannot effectively exclude not learn
Background traffic.The classifier that the present invention designs is not in the case where having complete background traffic data set, by successively more preferable
Learning objective sample come have exclude non-targeted samples ability, alleviate unknown flow rate the classifier performance of classifier is brought
Influence;
The present invention has fully considered the flow distribution situation in live network, and the multilayer device of proposition can effectively detect net
A large amount of background traffics present in network have in practice to the mobile application method for recognizing flux based on machine learning to be deployed in
Certain directive significance.
Detailed description of the invention
Fig. 1 is overview flow chart of the present invention;
Fig. 2 is classifier identification process figure of the invention;
Fig. 3 is the precision and recall rate of the second layer and third layer classifier in the embodiment of the present invention;
Fig. 4 is that whether there is or not the classifier precision of fuzzy stream detection and recall rate to compare in the embodiment of the present invention;
Fig. 5 is that whether there is or not the classifier puppet positive numbers of fuzzy stream detection to compare in the embodiment of the present invention;
Fig. 6 is that the classifier performance of different decision threshold value in the embodiment of the present invention compares;
Fig. 7 is that classifier precision compares in the embodiment of the present invention;
Fig. 8 is that classifier recall rate compares in the embodiment of the present invention;
Fig. 9 is that classifier puppet positive number compares in the embodiment of the present invention.
Specific embodiment
Embodiments of the present invention are further elaborated below with reference to example.
As shown in Figure 1, the present invention is based on the mobile application method for recognizing flux of Multilayer Classifier the following steps are included:
The first step is extracted flow training set feature, i.e., is indicated to each sample with feature, and feature is 29 kinds total.
Second step, training first layer classifier.Training dataset sample is divided into Target and Other class, and training one
A binary random forest grader.First layer classifier in training result such as Fig. 2.
Third step, training second layer classifier.Fuzzy stream is extracted first, is constructed the training set of second layer classifier, is wrapped altogether
The sample of class containing N+1, and one N+1 member random forest grader of training, identify target flow in fine granularity.Training result is as schemed
Second layer classifier in 2.
4th step, training third layer classifier.It extracts background traffic sample again first, constructs the instruction of third layer classifier
Practice collection, altogether include N+2 class sample, then train third layer classifier, to every class model, generates (N+2) * (N+1)/2 classification
Device.Third layer Ji Fenleiqichi in training result such as Fig. 2.
5th step identifies sample to be detected.As shown in Fig. 2, using first layer first for flow measurement sample to be checked
Classifier is identified, Target class or Other class are identified as.The flow sample for being identified as Target class at this time enters the second layer
Classifier continues to test.It include target flow and background traffic in these samples.Second layer classifier will in fine granularity
Target class is identified as a certain target application or fuzzy stream.If a certain sample is identified as target application Appi, in third
Layer is continued to identify by relevant classifier.When the classifier of third layer provides consistent recognition result, then provide final
Recognition result, otherwise refusal judgement.
The present invention is tested using live network flow, and is assessed effectiveness of the invention.
1) data set
It is collected locally the mobile application flow generated in 12 users nearly three months.Wherein, the mobile device brand being related to
Including Huawei, millet, three magnitudes, totally 160 kinds of the mobile application covered, flow generate network environment include 2G, 3G, 4G and
Wireless network.The flow finally collected is divided into two datasets, related data such as table 1.
1. data set of table is detailed
The three-layer classification device that data set 1 is proposed for training with test.Wherein, have 7 application stream sample numbers be more than
5000, and it is chosen as target application, remaining 131 applications are used as non-targeted application, and correlative flow is background traffic.Number
It is only used for test three-layer classification device according to collection 2, and it includes the applications being not present in 22 kinds of data sets 1, account for 5569 stream samples altogether
This.Therefore, this 22 kinds applications can be considered emerging application.The detailed composition of two datasets is as shown in table 2.
2. data set 1 of table is constituted with data set 2
2) experimental setup
Classifier is realized using Scikit-learn machine learning algorithm library, and most by three-layer classification device and current effect
The good one-dimensional convolutional neural networks classifier (being abbreviated as 1D-CNN) of method (Wang etc., End-to-end encrypted
Traffic classification with one-dimensional convolution neural networks (is based on
The End to End Encryption method for recognizing flux of one-dimensional convolutional neural networks), IEEE International Conference on
Intelligence and Security Informatics (IEEE information and security information meeting), 2017,43-48) and it is single
One random forest benchmark classifier is compared.Benchmark classifier is the random forest grader of a N+1 class, includes 30
Tree, depth capacity 20.When realizing 1D-CNN, preceding 784 load structure one-dimensional vectors of every stream are extracted as the defeated of model
Enter, and uses the library Keras training N+1 class 1D-CNN classifier.Wherein, the parameter of one-dimensional convolutional neural networks model and former work
It is consistent.Fuzzy stream is not applied to base classifier and one-dimensional convolutional neural networks classifier.For three-layer classification device, preceding two
For the Random Forest model of layer each comprising 30 trees, tree depth capacity is 20.Each Random Forest model of third layer includes
20 trees, depth capacity 20.Each XGBoost model includes 10 trees, and maximal tree depth is 5.
Really number TP (True Positive), puppet positive number FP (False Positive), puppet negative FN (False
Negative), five kinds of evaluation indexes of precision (Precision) and recall rate (Recall) are used to evaluate the property of the classifier proposed
Energy.
3) each layer performance test
Every layer of classification performance of the Multilayer Classifier proposed using 1 Dui of data is tested.Firstly, data set 1 is according to 7:3
Ratio be randomly divided into training set and test set.Then it trains and tests 10 times, finally provide average result.For first layer
Classifier, precision and recall rate are respectively 12.21% and 99.40%.This result and expection are consistent, i.e., can only exclude a small amount of
Background traffic, the accuracy of identification of target application flow can be lower, but recall rate is very high.The second layer and third layer classifier are to each
The precision and recall rate of application are as shown in Figure 3.Fig. 3 shows that classifier has very high precision, shows to background traffic excellent
Elegant elimination ability.But the identification recall rate that second layer classifier applies rear three classes simultaneously is very low.This is because second
The test sample of layer classifier, last three classes application has 57.29%, 47.43% and 33.97% sample to be judged as mould respectively
Paste stream.By going through the training dataset of second layer classifier, find last three kinds of applications in data set 1 other are non-
Target application has very big connection.For example, QQ is a kind of instant communication software of prevalence and is integrated with very more functions, such as
News push, mail management and music etc..However, these functions have corresponding independent utility, i.e. Tencent's news, QQ respectively
Mailbox and QQ music.This flow for causing QQ to generate probably has similar or identical feature with other background applications flows.
In order to reduce erroneous judgement, classifier can preferentially be identified as fuzzy stream without providing detailed classification, so that the knowledge of QQ flow
Other recall rate is lower.There are similar situations for Taobao and Baidu.
4) stream detection test is obscured
According to above-mentioned experiment, it can be seen that the independent fuzzy stream of identification has huge shadow to the recall rate of second layer classifier
It rings.Therefore, the performance that this experiment is used with or without fuzzy stream detection to Multilayer Classifier is compared.The data set used for
Data set 1, experimental method are identical as above-mentioned experiment.Final Multilayer Classifier performance is more as shown in Figure 4, Figure 5, wherein Fig. 4
Statistics is that whether there is or not the classifier precision of fuzzy stream detection and recall rate to compare, and Fig. 5 statistics is that whether there is or not points of fuzzy stream detection
Class device puppet positive number compares.
From Fig. 4, Fig. 5 it is found that the recall rate of each application can rise when not fuzzy stream detection, especially rear three
The recall rate of kind application is improved significantly.But corresponding puppet positive number increases, and accuracy of identification slightly reduces.When to classifier
When puppet just judges that tolerance is lower, it may be selected with the fuzzy classifier for flowing detection.When pursuing high recall rate, then it can remove mould
Paste stream detection.
5) third layer classifier threshold testing
Third layer classifier carrys out the feature of learning objective application traffic using multiple binary classifiers.If working as all correlations
Binary classifier the label of stream is just determined when all providing consistent judgement result, this method can make the identification of classifier smart
Degree is big, and pseudo- positive elimination ability is strong, but is likely to the real judgement for being also easy to exclude the second layer, and the recall rate of classifier is made to be lower.
Therefore different judgment thresholds of third layer classifier are tested compared at this.Experimental setup and the first two experiment keep one
It causes, comparison result is as shown in Figure 6.It is consistent with the second layer that the value of RF in figure indicates that at least several Random Forest models are provided
Classification results, the value of XG indicate that at least several XGBoost classifiers provide and the consistent classification results of second layer classifier,
It can determine the label of a stream.
As can be seen from Figure 6, as decision condition is increasingly stringenter, the precision of classifier gradually rises to 99% from 80%, calls together
The rate of returning gradually drops to 53% from 66%.Whether the value of XG becomes larger or the value of RF becomes larger, and can all cause the upper of precision
The decline of liter and recall rate.When more demanding to recall rate, the value of XG or RF can be suitably reduced, when higher to required precision
When, then it can increase the value of XG or RF.
6) classifier compares
Three-layer classification device and benchmark classifier and one-dimensional convolutional neural networks classifier (1D-CNN) of this experiment to proposition
It is compared.It for the three-layer classification device of proposition, is detected using fuzzy stream, and the verification condition of third layer classifier is set as
Most stringent of situation, i.e., the recognition result of all relevant classifiers all must keep one with the recognition result of second layer classifier
It causes.
(1) data set 1 is tested
Data set 1 is randomly divided into training set and test set in the ratio of 7:3, trains and provide after testing 10 times test knot
The average value of fruit.The three kinds of classifiers compared include the base classifier that single Random Forest model is constituted, what Wang et al. was proposed
One-dimensional convolutional neural networks model (1D-CNN) (Wang etc., End-to-end encrypted traffic
Classification with one-dimensional convolution neural networks (is based on one-dimensional convolution
The End to End Encryption method for recognizing flux of neural network), IEEE International Conference on
Intelligence and Security Informatics (IEEE information and security information meeting), 2017,43-48), with
And three-layer classification device of the invention.Comparison result is as shown in four rows before table 3, to the accuracy of identification of each application, recall rate and puppet
Positive number is respectively such as Fig. 7, Fig. 8, Fig. 9.
The classification performance of 3. 4 kinds of classifiers of table compares
The result shows that the classifier proposed has highest precision, nearly 99% accuracy of identification, the pseudo- positive number of generation are realized
Well below other binary classifier.Compared to base classifier, the pseudo- negative that three-layer classification device proposed by the present invention generates is reduced
94%, show that third layer classifier has outstanding background traffic detectability.But propose Multilayer Classifier due to right
The identification recall rate of three kinds of applications is extremely low afterwards causes its average recall rate far below base classifier.When identification target is application covering
When rate, low recall rate has no effect on recognition result, but requires pseudo- positive judgement few as far as possible, thus this method be very suitable to it is such
The application of scene.When identifying target is stream coverage rate, this method is also to be hoisted in recall rate, if but having to accuracy of identification
Necessarily required scene, this method still have very big advantage on accuracy of identification.
It is furthermore noted that there is rear three classes applying for low recall rate to be lower than preceding four classes sample on training sample, therefore again
New training classifier, and use SMOTEENN method (Batista et al., A study of the behavior of
Several methods for balancing machine learning training data (several machine learning training
The research of data balancing method), ACM Sigkdd Explorations Newsletter (ACM SIGKDD explores communication),
2004,20-29) Different categories of samples is sampled in the training process, keeps all kinds of training samples numbers balanced.To check sample
Influence of the quantity to category of model performance.Classifier after re -training is named as " three-layer classification device+SMOTEENN ", identification
As a result 3 last line of table and Fig. 6-8 are found in.It can be seen that sample size is not the main original for influencing classifier recall rate
Cause.It is larger instead to the negative effect of accuracy of identification although improving certain recall rate by equilibrium sample.
(2) data set 2 is tested
It is used as test set testing classification device using the training classifier of data set 1, and by data set 2, further verifying classification
The recognition performance of device.Test result is as shown in table 4.
Result shown in table 4 is similar with the test result of data set 1, compared to other two classifier, proposed by the present invention three
Layer classifier has best background traffic detectability.On data set 2, three-layer classification device produces 152 puppets altogether and just sentences
It is disconnected, wherein to have 45 streams from newly applying, that is, eliminate about 99.2% totally unknown flow.In contrast, random gloomy
Woods produces 1478 puppets and just judges, has 403 streams to come from and newly applies;1D-CNN generates 3963 puppets and just judges have
1348 streams, which come from, newly to be applied.Then remaining 107 puppets are just judged to carry out detailed analysis.For 31 by mistake
It is classified as the stream of Tencent's video, wherein there are 3 from QQ music, 28 streams come from Tencent's news.And QQ music, Tencent's news,
Tencent's video is all the application software of Tencent, and since functional requirement may easily access identical resource.In addition, right
In QQ, there are 46 puppets just to judge from Tencent's map, Tencent's microblogging etc., it is similar with Tencent's pseudo- positive situation of video.Search dog is spelled
1 puppet of sound just judges that the stream by checking the IP address of the stream, and with other in training set with identical IP address carries out
Compare, it is found that all search dog phonetic of the label of the related streams in training set.Therefore, the sample that this puppet is just judging has very much can
It can be the vicious sample label of tool.
4. data set of table, 2 test result
Therefore by the feature of the learning objective application traffic successively refined, present invention enhances classifiers to background stream
The detectability of amount enables it to the flow that detection never learnt.The experimental results showed that this method has high identification essence
Degree, and unknown applications and emerging application reasons for its use flow can be effectively detected, in the identification field for needing to realize application covering
The great application advantage of Jing Zhongyou.
It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although ginseng
It is described the invention in detail according to preferred embodiment, those skilled in the art should understand that, it can be to the present invention
Technical solution be modified or replaced equivalently, without departing from the spirit and scope of the technical solution of the present invention.
Claims (9)
1. the mobile application method for recognizing flux based on Multilayer Classifier, which comprises the following steps:
The first step extracts the feature of flow training set, obtains the character representation of flow sample, each flow sample is denoted as stream;
Sample Preliminary detection to be detected is target flow or background traffic by second step, training first layer classifier;Remember target stream
Amount is Target class, and background traffic is Other class;
Third step extracts fuzzy stream, constructs the training set of second layer classifier, second layer classifier then trained, to target stream
Amount carries out fine granularity identification;The similar flow that fuzzy stream refers to while being generated by multiple applications;Remember that i-th of target application is
Appi;The number of target application is N, and N is natural number;
4th step extracts background traffic sample again, constructs the training set of third layer classifier, then trains third layer classification
Device;
5th step carries out the identification of mobile application flow to sample to be detected using trained Multilayer Classifier, and method is: first
First, the use of first layer classifier is Target class or Other class by flow measurement specimen discerning to be checked, is identified as the flow of Target class
Sample enters second layer classifier and continues to test;Then, Target class is identified as to fine granularity a certain by second layer classifier again
Target application or fuzzy stream, if a certain sample is identified as target application, into third layer classifier, by relevant classification
Device continues to identify;When third layer classifier provides consistent recognition result, then final recognition result is provided, is otherwise refused
Judgement absolutely.
2. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 1, which is characterized in that described
The feature of onestep extraction flow training set, detailed process are as follows: first to original flow according to five-tuple < source IP, destination IP, source
Mouthful, destination port, agreement > be grouped constitutes stream;If the load that a stream includes is not less than or equal to five for 0 message number,
Then corresponding 29 kinds of traffic characteristics are extracted according to whole stream;If the load that a stream includes is not greater than five for 0 message number,
First five load is only taken not extract corresponding 29 kinds of traffic characteristics for 0 message, and after giving up the 5th load not for 0 message
Other messages.
3. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 1, which is characterized in that described
Two step first layer classifier training methods are as follows: the label of training set sample is divided into Target class and Other class, and training is random
Forest binary classifier;In training, the weight for increasing target flow sample makes the spy of classifier preference learning target sample
Sign.
4. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 1, which is characterized in that described
Three step second layer classifier training methods are as follows:
Step 3.1 extracts fuzzy stream, method particularly includes: firstly, flow sample set is according to binary group < server ip, server end
Mouthful > be grouped;For each grouping, if the flow sample in grouping have it is multiple using label and a kind of without certain
It occupies an leading position using sample, then the flow in the group is fuzzy stream;
Step 3.2 constructs training sample set, method particularly includes: extract remaining each target after extracting fuzzy stream in former training set
The flow sample of application constitutes the training sample set of second layer classifier in conjunction with the fuzzy stream sample extracted in step 3.1,
It altogether include N+1 class sample;
Step 3.3 trains second layer N+1 member random forest grader.
5. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 1, which is characterized in that described
Four step third layer classifier training methods are as follows:
Step 4.1 extracts Other class sample, method particularly includes: using the first layer classifier of second step training to former training set
Classify, extracts the Other class sample for being wherein accidentally divided into Target class, constitute new Other class data set;
Step 4.2 constructs training sample set, method particularly includes: the Other data set for extracting step 4.1, in third step
Training set combines, and constitutes the training sample set of third layer classifier, altogether includes N+2 class sample, i.e. N class target application sample
This, obscures stream class and Other class;
Step 4.3 trains third layer classifier, method particularly includes: selection random forest and XGBoost model training third layer point
Class device;For each model, based on one-to-one method training (N+2) * (N+1)/2 binary classifier.
6. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 1, which is characterized in that described
The specific method is as follows for five moved further application traffics identification:
The classification of step 5.1 first layer classifier, method particularly includes: the port of extraction flow sample and 12 kinds of statistical natures use
The identification of first layer classifier;If being identified as Target, 5.2 are entered step;Otherwise it is determined as background classes flow, terminates;
The classification of step 5.2 second layer classifier, method particularly includes: the port of extraction flow sample and 12 kinds of statistical natures use
The identification of second layer classifier;If being identified as a certain target application, 5.3 are entered step;If being identified as fuzzy stream, terminate;
The classification of step 5.3 third layer classifier, method particularly includes: preceding 16 bytes for extracting flow sample use third layer
2* (N+1) a classifier is identified;When the recognition result and step 5.2 of 2* (N+1) a classifier are consistent, then it is determined as
Appi;Otherwise, terminate.
7. the mobile application method for recognizing flux based on Multilayer Classifier as claimed in claim 2, which is characterized in that described 29
Kind traffic characteristic respectively includes preceding 16 load bytes and 12 kinds of statistical natures of destination port, stream;12 kinds of statistical nature difference
Maximum value, minimum value for the message size of client to server end, the maximum of the message size of server end to client
Value, minimum value, average value, variance yields, the payload size of first 3 messages with non-zero load of client to server end, clothes
1st payload size with 3rd message with non-zero load of the business device end to client, the most parcel size of stream.
8. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 3 or 4, which is characterized in that institute
The feature for stating training set sample is port and 12 kinds of statistical natures;12 kinds of statistical natures are respectively report of the client to server end
Maximum value, the minimum value of literary size, maximum value, minimum value, average value, the variance of the message size of server end to client
Value, the payload size of first 3 messages with non-zero load of client to server end, the 1st of server end to client
With the payload size of the 3rd message with non-zero load, the most parcel size of stream.
9. the mobile application method for recognizing flux based on Multilayer Classifier as claimed in claim 6, which is characterized in that step
2* (N+1) a classifier in 5.3 is respectively N+1 random forest grader and N+1 XGBoost classifier;N+1 binary
Random forest grader includes the binary classifier for the training set training that Appi class sample and Appj class sample are constituted, and j is not equal to
The binary classifier and Appi class sample of the training set training that i, Appi class sample and Other class sample are constituted and fuzzy stream class
The binary classifier for the training set training that sample is constituted;N+1 binary XGBoost classifier includes Appi class sample and Appj class
The binary classifier for the training set training that sample is constituted, j are not equal to i, the training set that Appi class sample and Other class sample are constituted
The binary classifier for the training set training that trained binary classifier and Appi class sample and fuzzy stream class sample are constituted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811326852.4A CN109151880B (en) | 2018-11-08 | 2018-11-08 | Mobile application flow identification method based on multilayer classifier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811326852.4A CN109151880B (en) | 2018-11-08 | 2018-11-08 | Mobile application flow identification method based on multilayer classifier |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109151880A true CN109151880A (en) | 2019-01-04 |
CN109151880B CN109151880B (en) | 2021-06-22 |
Family
ID=64808236
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811326852.4A Active CN109151880B (en) | 2018-11-08 | 2018-11-08 | Mobile application flow identification method based on multilayer classifier |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109151880B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109818976A (en) * | 2019-03-15 | 2019-05-28 | 杭州迪普科技股份有限公司 | A kind of anomalous traffic detection method and device |
CN110247930A (en) * | 2019-07-01 | 2019-09-17 | 北京理工大学 | A kind of refined net method for recognizing flux based on deep neural network |
CN110311829A (en) * | 2019-05-24 | 2019-10-08 | 西安电子科技大学 | A kind of net flow assorted method accelerated based on machine learning |
CN110602041A (en) * | 2019-08-05 | 2019-12-20 | 中国人民解放军战略支援部队信息工程大学 | White list-based Internet of things equipment identification method and device and network architecture |
CN111382780A (en) * | 2020-02-13 | 2020-07-07 | 中国科学院信息工程研究所 | Encryption website fine-grained classification method and device based on HTTP different versions |
CN112953851A (en) * | 2019-12-10 | 2021-06-11 | 华为数字技术(苏州)有限公司 | Traffic classification method and traffic management equipment |
CN113449747A (en) * | 2020-03-24 | 2021-09-28 | 阿里巴巴集团控股有限公司 | Data processing method, device, equipment and storage medium |
CN113486935A (en) * | 2021-06-24 | 2021-10-08 | 南京烽火星空通信发展有限公司 | Block chain application flow identification method based on DPI and CNN |
CN114362982A (en) * | 2020-10-12 | 2022-04-15 | 中兴通讯股份有限公司 | Flow subdivision identification method, system, electronic device and storage medium |
CN114362982B (en) * | 2020-10-12 | 2024-09-03 | 南京中兴新软件有限责任公司 | Traffic subdivision identification method, system, electronic device and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101510841A (en) * | 2008-12-31 | 2009-08-19 | 成都市华为赛门铁克科技有限公司 | Method and system for recognizing end-to-end flux |
CN102315974A (en) * | 2011-10-17 | 2012-01-11 | 北京邮电大学 | Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows |
CN105141455A (en) * | 2015-08-24 | 2015-12-09 | 西南大学 | Noisy network traffic classification modeling method based on statistical characteristics |
US20180260705A1 (en) * | 2017-03-05 | 2018-09-13 | Verint Systems Ltd. | System and method for applying transfer learning to identification of user actions |
US20180278629A1 (en) * | 2017-03-27 | 2018-09-27 | Cisco Technology, Inc. | Machine learning-based traffic classification using compressed network telemetry data |
CN108632279A (en) * | 2018-05-08 | 2018-10-09 | 北京理工大学 | A kind of multilayer method for detecting abnormality based on network flow |
CN108737290A (en) * | 2018-05-11 | 2018-11-02 | 南开大学 | Non-encrypted method for recognizing flux based on load mapping and random forest |
-
2018
- 2018-11-08 CN CN201811326852.4A patent/CN109151880B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101510841A (en) * | 2008-12-31 | 2009-08-19 | 成都市华为赛门铁克科技有限公司 | Method and system for recognizing end-to-end flux |
CN102315974A (en) * | 2011-10-17 | 2012-01-11 | 北京邮电大学 | Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows |
CN105141455A (en) * | 2015-08-24 | 2015-12-09 | 西南大学 | Noisy network traffic classification modeling method based on statistical characteristics |
US20180260705A1 (en) * | 2017-03-05 | 2018-09-13 | Verint Systems Ltd. | System and method for applying transfer learning to identification of user actions |
US20180278629A1 (en) * | 2017-03-27 | 2018-09-27 | Cisco Technology, Inc. | Machine learning-based traffic classification using compressed network telemetry data |
CN108632279A (en) * | 2018-05-08 | 2018-10-09 | 北京理工大学 | A kind of multilayer method for detecting abnormality based on network flow |
CN108737290A (en) * | 2018-05-11 | 2018-11-02 | 南开大学 | Non-encrypted method for recognizing flux based on load mapping and random forest |
Non-Patent Citations (1)
Title |
---|
胡斌: "基于混合行为特征的流量识别技术研究与应用", 《中国优秀硕士学位论文全文数据库,信息科技辑(月刊)》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109818976A (en) * | 2019-03-15 | 2019-05-28 | 杭州迪普科技股份有限公司 | A kind of anomalous traffic detection method and device |
CN110311829B (en) * | 2019-05-24 | 2021-03-16 | 西安电子科技大学 | Network traffic classification method based on machine learning acceleration |
CN110311829A (en) * | 2019-05-24 | 2019-10-08 | 西安电子科技大学 | A kind of net flow assorted method accelerated based on machine learning |
CN110247930A (en) * | 2019-07-01 | 2019-09-17 | 北京理工大学 | A kind of refined net method for recognizing flux based on deep neural network |
CN110247930B (en) * | 2019-07-01 | 2020-05-12 | 北京理工大学 | Encrypted network flow identification method based on deep neural network |
CN110602041A (en) * | 2019-08-05 | 2019-12-20 | 中国人民解放军战略支援部队信息工程大学 | White list-based Internet of things equipment identification method and device and network architecture |
CN112953851A (en) * | 2019-12-10 | 2021-06-11 | 华为数字技术(苏州)有限公司 | Traffic classification method and traffic management equipment |
WO2021114844A1 (en) * | 2019-12-10 | 2021-06-17 | 华为技术有限公司 | Traffic classification method and traffic management device |
CN111382780A (en) * | 2020-02-13 | 2020-07-07 | 中国科学院信息工程研究所 | Encryption website fine-grained classification method and device based on HTTP different versions |
CN111382780B (en) * | 2020-02-13 | 2023-11-03 | 中国科学院信息工程研究所 | Encryption website fine granularity classification method and device based on HTTP (hyper text transport protocol) different versions |
CN113449747A (en) * | 2020-03-24 | 2021-09-28 | 阿里巴巴集团控股有限公司 | Data processing method, device, equipment and storage medium |
CN114362982A (en) * | 2020-10-12 | 2022-04-15 | 中兴通讯股份有限公司 | Flow subdivision identification method, system, electronic device and storage medium |
WO2022078042A1 (en) * | 2020-10-12 | 2022-04-21 | 中兴通讯股份有限公司 | Traffic segmentation recognition method and system, and electronic device and storage medium |
CN114362982B (en) * | 2020-10-12 | 2024-09-03 | 南京中兴新软件有限责任公司 | Traffic subdivision identification method, system, electronic device and storage medium |
CN113486935A (en) * | 2021-06-24 | 2021-10-08 | 南京烽火星空通信发展有限公司 | Block chain application flow identification method based on DPI and CNN |
Also Published As
Publication number | Publication date |
---|---|
CN109151880B (en) | 2021-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109151880A (en) | Mobile application flow identification method based on multilayer classifier | |
CN109117634B (en) | Malicious software detection method and system based on network traffic multi-view fusion | |
US11399288B2 (en) | Method for HTTP-based access point fingerprint and classification using machine learning | |
CN102739457B (en) | Network flow recognition system and method based on DPI (Deep Packet Inspection) and SVM (Support Vector Machine) technology | |
CN102035698B (en) | HTTP tunnel detection method based on decision tree classification algorithm | |
CN107733851A (en) | DNS tunnels Trojan detecting method based on communication behavior analysis | |
CN106789242B (en) | Intelligent identification application analysis method based on mobile phone client software dynamic feature library | |
CN107370752B (en) | Efficient remote control Trojan detection method | |
CN103297270A (en) | Application type recognition method and network equipment | |
Wang et al. | Multilevel identification and classification analysis of Tor on mobile and PC platforms | |
CN102724317A (en) | Network data flow classification method and device | |
CN111245784A (en) | Method for multi-dimensional detection of malicious domain name | |
CN104244035A (en) | Network video flow classification method based on multilayer clustering | |
CN110868404B (en) | Industrial control equipment automatic identification method based on TCP/IP fingerprint | |
Peraković et al. | Model for detection and classification of DDoS traffic based on artificial neural network | |
CN114422211B (en) | HTTP malicious traffic detection method and device based on graph attention network | |
CN110519228B (en) | Method and system for identifying malicious cloud robot in black-production scene | |
Zhao et al. | Identifying known and unknown mobile application traffic using a multilevel classifier | |
CN112036518B (en) | Application program flow classification method based on data packet byte distribution and storage medium | |
CN110493235A (en) | A kind of mobile terminal from malicious software synchronization detection method based on network flow characteristic | |
CN112003869A (en) | Vulnerability identification method based on flow | |
CN108462707A (en) | A kind of mobile application recognition methods based on deep learning sequence analysis | |
CN104994016A (en) | Method and apparatus for packet classification | |
CN109450733A (en) | A kind of network-termination device recognition methods and system based on machine learning | |
CN110225009B (en) | Proxy user detection method based on communication behavior portrait |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |