CN109151880A

CN109151880A - Mobile application flow identification method based on multilayer classifier

Info

Publication number: CN109151880A
Application number: CN201811326852.4A
Authority: CN
Inventors: 赵双; 陈曙晖; 孙品; 孙一品; 王飞; 苏金树
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2019-01-04
Anticipated expiration: 2038-11-08
Also published as: CN109151880B

Abstract

The invention belongs to the field of network traffic analysis, and provides a mobile application traffic identification method based on a multilayer classifier aiming at the problem that the existing mobile application traffic identification method cannot detect and process background traffic, wherein the technical scheme is as follows: firstly, extracting the characteristics of a flow training set to obtain the characteristic representation of a flow sample; secondly, training a first-layer classifier, and preliminarily detecting a sample to be detected as target flow or background flow; training a second-layer classifier, and performing fine-grained identification on the target flow; fourthly, training a third-layer classifier; and fifthly, carrying out mobile application flow identification on the sample to be detected by using the trained multilayer classifier. The invention fully considers the flow distribution condition in a real network, and under the condition of not having a complete background flow data set, the characteristics of the target flow sample are learned layer by layer, so that the classifier can identify the target flow and simultaneously eliminate the background flow, and the false positive number of the classifier is reduced.

Description

Mobile application method for recognizing flux based on Multilayer Classifier

Technical field

The invention belongs to network traffic analysis fields, are related to a kind of network flow identification method based on machine learning, tool Body is related to the mobile application method for recognizing flux based on Multilayer Classifier.

Background technique

With universal and mobile application the prosperity and development of mobile device, it is presently most used that mobile application has become people Network access.End the first quarter in 2018, Google's application market there are 3,800,000 sections to download using for user, and average every New 6,140 new opplications in the sky.To 2017, there is 57% network flow both to be from mobile device.Therefore, mobile network flows Amount, which alreadys exceed conventional workstation flow, becomes the chief component of network flow.The hot spot of concern is studied also from conventional operation Flow of standing identification turns to the identification of mobile network's flow.

The target of mobile network's flow identification technology is the mobile flow of identification using source.This technology is in network management With safety, market survey, there is important role in the fields such as customer analysis.For example, being based on this technology, service provider can be slapped Hold the mobile application flow distribution situation in network；Popular network application and optimize phase in the available garden of network administrator Internet resources distribution is closed to improve user experience；Advertising provider will be seen that a certain apply when and where more popular with users To formulate more reasonable advertisement serving policy etc..

Although the identification of mobile network's flow is similar with conventional desktop flow identification process, the particularity of mobile flow is to biography System flow identification technology brings huge challenge:

1) mobile application flow mostly uses HTTP/HTTPS agreement to transmit, this makes the flow identification technology based on port only This kind of mobile application flow can be identified as Web.Other transmission ports are generally random port number, so that this technology is lost completely Effect.

2) in order to protect privacy of user, mobile flow mostly uses cryptographic protocol to transmit, reduces based on deep-packet detection DPI The validity of the flow identification technology of (Deep Packet Inspection).

3) mobile application uses third party library more, causes different applications that can generate similar flow, these flows are difficult to It is identified using DPI technology and IP address.

4) CDN (Content Distribution Network, content distributing network) is that mobile application generally uses Technology.This technology causes the address of the IP of a server that may simultaneously be different application services.Therefore it reduces and is based on The validity of the flow identification technology of DNS (Domain Naming System, Domain Name Service System).In addition to this, some applications Server address may be obtained without using DNS, further reduce the scope of application based on DNS flow identification technology.

5) mobile application enormous amount, updating decision, emerging application emerge one after another, and identification technology need to constantly update, such as DPI Technology needs continuous updating load characteristic library etc..

For these reasons, traditional method for recognizing flux cannot effectively handle mobile flow.It is based on machine in recent years The flow identification technology of device study shows good classification performance in the identification of conventional desktop network flow, therefore has work Also it is applied in mobile application flow identification mission.

Wang et al. (Wang etc., I know what you did on your smartphone:Inferring app (I knows what your mobile phone doing to usage over encrypted data traffic: speculating movement by encryption flow Using situation) .IEEE Conference on Communications and Network Security (ieee communication With network security meeting), 2015,433-441) artificially collect under 13 kinds of iOS systems using each 5 minutes flows of self-operating, And training random forest grader.But the sample size that this work uses is very few, and therefore, it is difficult to assess the validity of this method. AppScanner (Vincent etc., Robust smartphone app identification via encrypted Network traffic analysis (the robustness mobile application recognition methods based on refined net flow analysis), IEEE Transactions on Information Forensics&Security (evidence obtaining of IEEE information and safe periodical), 2017, 13 (1): the fingerprint of application 63-78) is extracted using random forests algorithm and identifies flow.Its data set used comes from two not The flow that 110 kinds of applications in same Android device generate.But using " burst ", (i.e. interval time is small in certain time for the work In one group of packet of a certain threshold value) as flow identification basic object, cause this method to be only applicable to the flow of simple network Identify work.Wang et al. (Wang etc., End-to-end encrypted traffic classification with The one-dimensional convolution neural networks (End to End Encryption based on one-dimensional convolutional neural networks Method for recognizing flux), IEEE International Conference on Intelligence and Security Informatics (IEEE information and security information meeting), 2017,43-48) using one-dimensional convolutional neural networks model identification stream Amount, when classifying in fine granularity to flow, true rate is up to 86.6%.Deep Packet (Mohammad etc., Deep packet:A novel approach for encrypted traffic classification using deep Learning (a kind of Deep packet: encryption traffic classification method based on deep learning), arXiv, 2017) based on one-dimensional Convolutional neural networks and stack autocoder classify to mobile application flow.Giuseppe et al. (Giuseppe etc., The Mobile encrypted traffic classification using deep learning (shifting based on deep learning Dynamic encryption traffic classification), 2018) to being pointed out after four kinds of recognition methods neural network based, Wang et al. (Wang etc., End-to-end encrypted traffic classification with one-dimensional convolution Neural networks (the End to End Encryption method for recognizing flux based on one-dimensional convolutional neural networks), IEEE International Conference on Intelligence and Security Informatics (IEEE information with Security information meeting), 2017,43-48) propose classifier have optimal mobile application flow recognition performance.

In conclusion although mobile application method for recognizing flux set forth above all shows outstanding recognition result, These methods do not consider that Unknown Background flow is influenced to classifier performance bring, only testing classification device in a closed environment, I.e. test set flow both is from application involved in training set.And in live network, other than target application flow, unknown applications Thousands of and emerging application emerges one after another, these non-targeted application reasons for its use flows, which can bring classifier, greatly to be chosen War.And the test environment of the above method does not consider this problem, causes these methods that can not be deployed in real network environment.

Summary of the invention

The present invention cannot detect and handle background stream for the existing mobile application method for recognizing flux based on machine learning The problem of amount, a kind of mobile application method for recognizing flux based on Multilayer Classifier is provided, Level by level learning target flow sample is special Sign, to make classifier that can also exclude background traffic while identifying target flow, reduces the pseudo- positive number of classifier.

Technical solution is as follows:

The first step extracts the feature of flow training set, obtains the character representation of flow sample.Each flow sample is denoted as Stream.

Sample Preliminary detection to be detected is target flow or background traffic by second step, training first layer classifier.Remember mesh Mark flow is Target class, and background traffic is Other class.

Third step extracts fuzzy stream, constructs the training set of second layer classifier, second layer classifier then trained, to mesh It marks flow and carries out fine granularity identification.The similar flow that fuzzy stream refers to while being generated by multiple applications, such as third party library or advertisement Flow.Remember that i-th of target application is Appi.The number of target application is N, and N is natural number.

4th step extracts background traffic sample again, constructs the training set of third layer classifier, then trains third layer point Class device.

5th step carries out the identification of mobile application flow to sample to be detected using trained Multilayer Classifier.Method is: Firstly, using first layer classifier by flow measurement specimen discerning to be checked be Target class or Other class, be identified as the stream of Target class Amount sample enters second layer classifier and continues to test；Then, Target class is identified as certain to fine granularity again by second layer classifier One target application or fuzzy stream, if a certain sample is identified as target application Appi, into third layer classifier, by correlation Classifier continue to identify；When third layer classifier provides consistent recognition result, then final recognition result is provided, Otherwise refusal judgement.

As the further improvement of technical solution of the present invention, the first step extracts the feature of flow training set, specific side Method are as follows: first to original flow according to five-tuple<source IP, destination IP, source port, destination port, agreement>be grouped, composition Stream.If the load that a stream includes is not less than or equal to five for 0 message number, corresponding 29 kinds of flows are extracted according to whole stream Feature；If the load that a stream includes is not greater than five for 0 message number, first five load is only taken not extract for 0 message Corresponding 29 kinds of traffic characteristics, and giving up the 5th load not is other messages after 0 message.

As the further improvement of technical solution of the present invention, 29 kinds of traffic characteristics respectively include destination port, stream Preceding 16 load bytes and 12 kinds of statistical natures.12 kinds of statistical natures are respectively message size of the client to server end Maximum value, minimum value, the maximum value of the message size of server end to client, minimum value, average value, variance yields, client To the payload size of first 3 messages with non-zero load of server end, the 1st of server end to client has with the 3rd There are the payload size of the message of non-zero load, the most parcel size of stream.

As the further improvement of technical solution of the present invention, the second step first layer classifier training method particularly includes: The label of training set sample is divided into Target class and Other class, and training random forest binary classifier.In training, increase The weight of target flow sample is added to make the feature of classifier preference learning target sample.

As the further improvement of technical solution of the present invention, the feature of the training set sample in the second step be port and 12 kinds of statistical natures.

As the further improvement of technical solution of the present invention, the third step second layer classifier training specific method is such as Under:

Step 3.1 extracts fuzzy stream, method particularly includes: firstly, flow sample set is according to binary group < server ip, service Device port > be grouped.For each grouping, if the flow sample in grouping have it is multiple using label, and without certain One kind application sample is occupied an leading position, then the flow in the group is fuzzy stream.In the present invention, when the sample of one kind application When number accounts for 90% or more of packet samples number, it is believed that it occupies leading position.

Step 3.2 constructs training sample set, method particularly includes: it extracts remaining each after extracting fuzzy stream in former training set The flow sample of target application constitutes the training sample of second layer classifier in conjunction with the fuzzy stream sample extracted in step 3.1 Collection includes N+1 class sample altogether.

Step 3.3 trains second layer N+1 member random forest grader.

As the further improvement of technical solution of the present invention, the feature of the training set sample in the step 3.3 is port And 12 kinds of statistical natures.

As the further improvement of technical solution of the present invention, the 4th step third layer classifier training specific method is such as Under:

Step 4.1 extracts Other class sample, method particularly includes: original is instructed using the first layer classifier of second step training Practice collection to classify, extracts the Other class sample for being wherein accidentally divided into Target class, constitute new Other class data set.

Step 4.2 constructs training sample set, method particularly includes: the Other data set for extracting step 4.1, same to third step In training set combine, constitute the training sample set of third layer classifier, altogether include N+2 class sample, i.e. N class target application Sample obscures stream class and Other class.

Step 4.3 train third layer classifier, method particularly includes: selection random forest and XGBoost model (Chen etc., XGBoost:A Scalable Tree Boosting System (a kind of XGBoost: expansible set tree system), ACM International Conference on Knowledge Discovery and Data Mining (ACM Knowledge Discovery with Data mining international conference), 2016,785-794) training third layer classifier.For each model, based on one-to-one method (week Will China, machine learning, 2016,63-66), i.e., one binary classifier of training between any two classification, training (N+2) * (N+ 1)/2 binary classifier.Final third layer classifier includes (N+2) * (N+1) a classifier.The feature of training set is first 16 Load bytes.

As the further improvement of technical solution of the present invention, the 5th moved further application traffic identification specific method is such as Under:

The classification of step 5.1 first layer classifier, method particularly includes: the port of extraction flow sample and 12 kinds of statistical natures, It is identified using first layer classifier.If being identified as Target, 5.2 are entered step.Otherwise it is determined as background classes flow, terminates.

The classification of step 5.2 second layer classifier, method particularly includes: the port of extraction flow sample and 12 kinds of statistical natures, It is identified using second layer classifier.If being identified as a certain target application Appi, 5.3 are entered step；If being identified as fuzzy stream, knot Beam does not provide concrete application label.

The classification of step 5.3 third layer classifier, method particularly includes: preceding 16 bytes for extracting flow sample use third 2* (N+1) a classifier of layer is identified.When the recognition result and step 5.2 of 2* (N+1) a classifier are consistent, then it is determined as Otherwise Appi terminates, do not provide concrete application label.

As the further improvement of technical solution of the present invention, 2* (N+1) a classifier in above-mentioned steps 5.3 is respectively N+ 1 random forest grader and N+1 XGBoost classifier.N+1 binary random forest grader include Appi class sample with The binary classifier (j is not equal to i) for the training set training that Appj class sample is constituted, Appi class sample and Other class sample are constituted Training set training binary classifier and the binary point of training set training that constitutes of Appi class sample and fuzzy stream class sample Class device.N+1 binary XGBoost classifier includes the binary point for the training set training that Appi class sample and Appj class sample are constituted Class device (j is not equal to i), the binary classifier and Appi class of the training set training that Appi class sample and Other class sample are constituted The binary classifier for the training set training that sample and fuzzy stream class sample are constituted.

Compared with prior art, the beneficial effects of the present invention are:

Since unknown applications are thousands of, and new opplication emerges one after another, this leads to that complete background can not be collected Data on flows collection.Therefore classifier cannot learn all background traffic modes, thus cannot effectively exclude not learn Background traffic.The classifier that the present invention designs is not in the case where having complete background traffic data set, by successively more preferable Learning objective sample come have exclude non-targeted samples ability, alleviate unknown flow rate the classifier performance of classifier is brought Influence；

The present invention has fully considered the flow distribution situation in live network, and the multilayer device of proposition can effectively detect net A large amount of background traffics present in network have in practice to the mobile application method for recognizing flux based on machine learning to be deployed in Certain directive significance.

Detailed description of the invention

Fig. 1 is overview flow chart of the present invention；

Fig. 2 is classifier identification process figure of the invention；

Fig. 3 is the precision and recall rate of the second layer and third layer classifier in the embodiment of the present invention；

Fig. 4 is that whether there is or not the classifier precision of fuzzy stream detection and recall rate to compare in the embodiment of the present invention；

Fig. 5 is that whether there is or not the classifier puppet positive numbers of fuzzy stream detection to compare in the embodiment of the present invention；

Fig. 6 is that the classifier performance of different decision threshold value in the embodiment of the present invention compares；

Fig. 7 is that classifier precision compares in the embodiment of the present invention；

Fig. 8 is that classifier recall rate compares in the embodiment of the present invention；

Fig. 9 is that classifier puppet positive number compares in the embodiment of the present invention.

Specific embodiment

Embodiments of the present invention are further elaborated below with reference to example.

As shown in Figure 1, the present invention is based on the mobile application method for recognizing flux of Multilayer Classifier the following steps are included:

The first step is extracted flow training set feature, i.e., is indicated to each sample with feature, and feature is 29 kinds total.

Second step, training first layer classifier.Training dataset sample is divided into Target and Other class, and training one A binary random forest grader.First layer classifier in training result such as Fig. 2.

Third step, training second layer classifier.Fuzzy stream is extracted first, is constructed the training set of second layer classifier, is wrapped altogether The sample of class containing N+1, and one N+1 member random forest grader of training, identify target flow in fine granularity.Training result is as schemed Second layer classifier in 2.

4th step, training third layer classifier.It extracts background traffic sample again first, constructs the instruction of third layer classifier Practice collection, altogether include N+2 class sample, then train third layer classifier, to every class model, generates (N+2) * (N+1)/2 classification Device.Third layer Ji Fenleiqichi in training result such as Fig. 2.

5th step identifies sample to be detected.As shown in Fig. 2, using first layer first for flow measurement sample to be checked Classifier is identified, Target class or Other class are identified as.The flow sample for being identified as Target class at this time enters the second layer Classifier continues to test.It include target flow and background traffic in these samples.Second layer classifier will in fine granularity Target class is identified as a certain target application or fuzzy stream.If a certain sample is identified as target application Appi, in third Layer is continued to identify by relevant classifier.When the classifier of third layer provides consistent recognition result, then provide final Recognition result, otherwise refusal judgement.

The present invention is tested using live network flow, and is assessed effectiveness of the invention.

1) data set

It is collected locally the mobile application flow generated in 12 users nearly three months.Wherein, the mobile device brand being related to Including Huawei, millet, three magnitudes, totally 160 kinds of the mobile application covered, flow generate network environment include 2G, 3G, 4G and Wireless network.The flow finally collected is divided into two datasets, related data such as table 1.

1. data set of table is detailed

The three-layer classification device that data set 1 is proposed for training with test.Wherein, have 7 application stream sample numbers be more than 5000, and it is chosen as target application, remaining 131 applications are used as non-targeted application, and correlative flow is background traffic.Number It is only used for test three-layer classification device according to collection 2, and it includes the applications being not present in 22 kinds of data sets 1, account for 5569 stream samples altogether This.Therefore, this 22 kinds applications can be considered emerging application.The detailed composition of two datasets is as shown in table 2.

2. data set 1 of table is constituted with data set 2

2) experimental setup

Classifier is realized using Scikit-learn machine learning algorithm library, and most by three-layer classification device and current effect The good one-dimensional convolutional neural networks classifier (being abbreviated as 1D-CNN) of method (Wang etc., End-to-end encrypted Traffic classification with one-dimensional convolution neural networks (is based on The End to End Encryption method for recognizing flux of one-dimensional convolutional neural networks), IEEE International Conference on Intelligence and Security Informatics (IEEE information and security information meeting), 2017,43-48) and it is single One random forest benchmark classifier is compared.Benchmark classifier is the random forest grader of a N+1 class, includes 30 Tree, depth capacity 20.When realizing 1D-CNN, preceding 784 load structure one-dimensional vectors of every stream are extracted as the defeated of model Enter, and uses the library Keras training N+1 class 1D-CNN classifier.Wherein, the parameter of one-dimensional convolutional neural networks model and former work It is consistent.Fuzzy stream is not applied to base classifier and one-dimensional convolutional neural networks classifier.For three-layer classification device, preceding two For the Random Forest model of layer each comprising 30 trees, tree depth capacity is 20.Each Random Forest model of third layer includes 20 trees, depth capacity 20.Each XGBoost model includes 10 trees, and maximal tree depth is 5.

Really number TP (True Positive), puppet positive number FP (False Positive), puppet negative FN (False Negative), five kinds of evaluation indexes of precision (Precision) and recall rate (Recall) are used to evaluate the property of the classifier proposed Energy.

3) each layer performance test

Every layer of classification performance of the Multilayer Classifier proposed using 1 Dui of data is tested.Firstly, data set 1 is according to 7:3 Ratio be randomly divided into training set and test set.Then it trains and tests 10 times, finally provide average result.For first layer Classifier, precision and recall rate are respectively 12.21% and 99.40%.This result and expection are consistent, i.e., can only exclude a small amount of Background traffic, the accuracy of identification of target application flow can be lower, but recall rate is very high.The second layer and third layer classifier are to each The precision and recall rate of application are as shown in Figure 3.Fig. 3 shows that classifier has very high precision, shows to background traffic excellent Elegant elimination ability.But the identification recall rate that second layer classifier applies rear three classes simultaneously is very low.This is because second The test sample of layer classifier, last three classes application has 57.29%, 47.43% and 33.97% sample to be judged as mould respectively Paste stream.By going through the training dataset of second layer classifier, find last three kinds of applications in data set 1 other are non- Target application has very big connection.For example, QQ is a kind of instant communication software of prevalence and is integrated with very more functions, such as News push, mail management and music etc..However, these functions have corresponding independent utility, i.e. Tencent's news, QQ respectively Mailbox and QQ music.This flow for causing QQ to generate probably has similar or identical feature with other background applications flows. In order to reduce erroneous judgement, classifier can preferentially be identified as fuzzy stream without providing detailed classification, so that the knowledge of QQ flow Other recall rate is lower.There are similar situations for Taobao and Baidu.

4) stream detection test is obscured

According to above-mentioned experiment, it can be seen that the independent fuzzy stream of identification has huge shadow to the recall rate of second layer classifier It rings.Therefore, the performance that this experiment is used with or without fuzzy stream detection to Multilayer Classifier is compared.The data set used for Data set 1, experimental method are identical as above-mentioned experiment.Final Multilayer Classifier performance is more as shown in Figure 4, Figure 5, wherein Fig. 4 Statistics is that whether there is or not the classifier precision of fuzzy stream detection and recall rate to compare, and Fig. 5 statistics is that whether there is or not points of fuzzy stream detection Class device puppet positive number compares.

From Fig. 4, Fig. 5 it is found that the recall rate of each application can rise when not fuzzy stream detection, especially rear three The recall rate of kind application is improved significantly.But corresponding puppet positive number increases, and accuracy of identification slightly reduces.When to classifier When puppet just judges that tolerance is lower, it may be selected with the fuzzy classifier for flowing detection.When pursuing high recall rate, then it can remove mould Paste stream detection.

5) third layer classifier threshold testing

Third layer classifier carrys out the feature of learning objective application traffic using multiple binary classifiers.If working as all correlations Binary classifier the label of stream is just determined when all providing consistent judgement result, this method can make the identification of classifier smart Degree is big, and pseudo- positive elimination ability is strong, but is likely to the real judgement for being also easy to exclude the second layer, and the recall rate of classifier is made to be lower. Therefore different judgment thresholds of third layer classifier are tested compared at this.Experimental setup and the first two experiment keep one It causes, comparison result is as shown in Figure 6.It is consistent with the second layer that the value of RF in figure indicates that at least several Random Forest models are provided Classification results, the value of XG indicate that at least several XGBoost classifiers provide and the consistent classification results of second layer classifier, It can determine the label of a stream.

As can be seen from Figure 6, as decision condition is increasingly stringenter, the precision of classifier gradually rises to 99% from 80%, calls together The rate of returning gradually drops to 53% from 66%.Whether the value of XG becomes larger or the value of RF becomes larger, and can all cause the upper of precision The decline of liter and recall rate.When more demanding to recall rate, the value of XG or RF can be suitably reduced, when higher to required precision When, then it can increase the value of XG or RF.

6) classifier compares

Three-layer classification device and benchmark classifier and one-dimensional convolutional neural networks classifier (1D-CNN) of this experiment to proposition It is compared.It for the three-layer classification device of proposition, is detected using fuzzy stream, and the verification condition of third layer classifier is set as Most stringent of situation, i.e., the recognition result of all relevant classifiers all must keep one with the recognition result of second layer classifier It causes.

(1) data set 1 is tested

Data set 1 is randomly divided into training set and test set in the ratio of 7:3, trains and provide after testing 10 times test knot The average value of fruit.The three kinds of classifiers compared include the base classifier that single Random Forest model is constituted, what Wang et al. was proposed One-dimensional convolutional neural networks model (1D-CNN) (Wang etc., End-to-end encrypted traffic Classification with one-dimensional convolution neural networks (is based on one-dimensional convolution The End to End Encryption method for recognizing flux of neural network), IEEE International Conference on Intelligence and Security Informatics (IEEE information and security information meeting), 2017,43-48), with And three-layer classification device of the invention.Comparison result is as shown in four rows before table 3, to the accuracy of identification of each application, recall rate and puppet Positive number is respectively such as Fig. 7, Fig. 8, Fig. 9.

The classification performance of 3. 4 kinds of classifiers of table compares

The result shows that the classifier proposed has highest precision, nearly 99% accuracy of identification, the pseudo- positive number of generation are realized Well below other binary classifier.Compared to base classifier, the pseudo- negative that three-layer classification device proposed by the present invention generates is reduced 94%, show that third layer classifier has outstanding background traffic detectability.But propose Multilayer Classifier due to right The identification recall rate of three kinds of applications is extremely low afterwards causes its average recall rate far below base classifier.When identification target is application covering When rate, low recall rate has no effect on recognition result, but requires pseudo- positive judgement few as far as possible, thus this method be very suitable to it is such The application of scene.When identifying target is stream coverage rate, this method is also to be hoisted in recall rate, if but having to accuracy of identification Necessarily required scene, this method still have very big advantage on accuracy of identification.

It is furthermore noted that there is rear three classes applying for low recall rate to be lower than preceding four classes sample on training sample, therefore again New training classifier, and use SMOTEENN method (Batista et al., A study of the behavior of Several methods for balancing machine learning training data (several machine learning training The research of data balancing method), ACM Sigkdd Explorations Newsletter (ACM SIGKDD explores communication), 2004,20-29) Different categories of samples is sampled in the training process, keeps all kinds of training samples numbers balanced.To check sample Influence of the quantity to category of model performance.Classifier after re -training is named as " three-layer classification device+SMOTEENN ", identification As a result 3 last line of table and Fig. 6-8 are found in.It can be seen that sample size is not the main original for influencing classifier recall rate Cause.It is larger instead to the negative effect of accuracy of identification although improving certain recall rate by equilibrium sample.

(2) data set 2 is tested

It is used as test set testing classification device using the training classifier of data set 1, and by data set 2, further verifying classification The recognition performance of device.Test result is as shown in table 4.

Result shown in table 4 is similar with the test result of data set 1, compared to other two classifier, proposed by the present invention three Layer classifier has best background traffic detectability.On data set 2, three-layer classification device produces 152 puppets altogether and just sentences It is disconnected, wherein to have 45 streams from newly applying, that is, eliminate about 99.2% totally unknown flow.In contrast, random gloomy Woods produces 1478 puppets and just judges, has 403 streams to come from and newly applies；1D-CNN generates 3963 puppets and just judges have 1348 streams, which come from, newly to be applied.Then remaining 107 puppets are just judged to carry out detailed analysis.For 31 by mistake It is classified as the stream of Tencent's video, wherein there are 3 from QQ music, 28 streams come from Tencent's news.And QQ music, Tencent's news, Tencent's video is all the application software of Tencent, and since functional requirement may easily access identical resource.In addition, right In QQ, there are 46 puppets just to judge from Tencent's map, Tencent's microblogging etc., it is similar with Tencent's pseudo- positive situation of video.Search dog is spelled 1 puppet of sound just judges that the stream by checking the IP address of the stream, and with other in training set with identical IP address carries out Compare, it is found that all search dog phonetic of the label of the related streams in training set.Therefore, the sample that this puppet is just judging has very much can It can be the vicious sample label of tool.

4. data set of table, 2 test result

Therefore by the feature of the learning objective application traffic successively refined, present invention enhances classifiers to background stream The detectability of amount enables it to the flow that detection never learnt.The experimental results showed that this method has high identification essence Degree, and unknown applications and emerging application reasons for its use flow can be effectively detected, in the identification field for needing to realize application covering The great application advantage of Jing Zhongyou.

It should be noted last that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although ginseng It is described the invention in detail according to preferred embodiment, those skilled in the art should understand that, it can be to the present invention Technical solution be modified or replaced equivalently, without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. the mobile application method for recognizing flux based on Multilayer Classifier, which comprises the following steps:

The first step extracts the feature of flow training set, obtains the character representation of flow sample, each flow sample is denoted as stream；

Sample Preliminary detection to be detected is target flow or background traffic by second step, training first layer classifier；Remember target stream Amount is Target class, and background traffic is Other class；

Third step extracts fuzzy stream, constructs the training set of second layer classifier, second layer classifier then trained, to target stream Amount carries out fine granularity identification；The similar flow that fuzzy stream refers to while being generated by multiple applications；Remember that i-th of target application is Appi；The number of target application is N, and N is natural number；

4th step extracts background traffic sample again, constructs the training set of third layer classifier, then trains third layer classification Device；

5th step carries out the identification of mobile application flow to sample to be detected using trained Multilayer Classifier, and method is: first First, the use of first layer classifier is Target class or Other class by flow measurement specimen discerning to be checked, is identified as the flow of Target class Sample enters second layer classifier and continues to test；Then, Target class is identified as to fine granularity a certain by second layer classifier again Target application or fuzzy stream, if a certain sample is identified as target application, into third layer classifier, by relevant classification Device continues to identify；When third layer classifier provides consistent recognition result, then final recognition result is provided, is otherwise refused Judgement absolutely.

2. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 1, which is characterized in that described The feature of onestep extraction flow training set, detailed process are as follows: first to original flow according to five-tuple < source IP, destination IP, source Mouthful, destination port, agreement > be grouped constitutes stream；If the load that a stream includes is not less than or equal to five for 0 message number, Then corresponding 29 kinds of traffic characteristics are extracted according to whole stream；If the load that a stream includes is not greater than five for 0 message number, First five load is only taken not extract corresponding 29 kinds of traffic characteristics for 0 message, and after giving up the 5th load not for 0 message Other messages.

3. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 1, which is characterized in that described Two step first layer classifier training methods are as follows: the label of training set sample is divided into Target class and Other class, and training is random Forest binary classifier；In training, the weight for increasing target flow sample makes the spy of classifier preference learning target sample Sign.

4. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 1, which is characterized in that described Three step second layer classifier training methods are as follows:

Step 3.1 extracts fuzzy stream, method particularly includes: firstly, flow sample set is according to binary group < server ip, server end Mouthful > be grouped；For each grouping, if the flow sample in grouping have it is multiple using label and a kind of without certain It occupies an leading position using sample, then the flow in the group is fuzzy stream；

Step 3.2 constructs training sample set, method particularly includes: extract remaining each target after extracting fuzzy stream in former training set The flow sample of application constitutes the training sample set of second layer classifier in conjunction with the fuzzy stream sample extracted in step 3.1, It altogether include N+1 class sample；

Step 3.3 trains second layer N+1 member random forest grader.

5. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 1, which is characterized in that described Four step third layer classifier training methods are as follows:

Step 4.1 extracts Other class sample, method particularly includes: using the first layer classifier of second step training to former training set Classify, extracts the Other class sample for being wherein accidentally divided into Target class, constitute new Other class data set；

Step 4.2 constructs training sample set, method particularly includes: the Other data set for extracting step 4.1, in third step Training set combines, and constitutes the training sample set of third layer classifier, altogether includes N+2 class sample, i.e. N class target application sample This, obscures stream class and Other class；

Step 4.3 trains third layer classifier, method particularly includes: selection random forest and XGBoost model training third layer point Class device；For each model, based on one-to-one method training (N+2) * (N+1)/2 binary classifier.

6. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 1, which is characterized in that described The specific method is as follows for five moved further application traffics identification:

The classification of step 5.1 first layer classifier, method particularly includes: the port of extraction flow sample and 12 kinds of statistical natures use The identification of first layer classifier；If being identified as Target, 5.2 are entered step；Otherwise it is determined as background classes flow, terminates；

The classification of step 5.2 second layer classifier, method particularly includes: the port of extraction flow sample and 12 kinds of statistical natures use The identification of second layer classifier；If being identified as a certain target application, 5.3 are entered step；If being identified as fuzzy stream, terminate；

The classification of step 5.3 third layer classifier, method particularly includes: preceding 16 bytes for extracting flow sample use third layer 2* (N+1) a classifier is identified；When the recognition result and step 5.2 of 2* (N+1) a classifier are consistent, then it is determined as Appi；Otherwise, terminate.

7. the mobile application method for recognizing flux based on Multilayer Classifier as claimed in claim 2, which is characterized in that described 29 Kind traffic characteristic respectively includes preceding 16 load bytes and 12 kinds of statistical natures of destination port, stream；12 kinds of statistical nature difference Maximum value, minimum value for the message size of client to server end, the maximum of the message size of server end to client Value, minimum value, average value, variance yields, the payload size of first 3 messages with non-zero load of client to server end, clothes 1st payload size with 3rd message with non-zero load of the business device end to client, the most parcel size of stream.

8. the mobile application method for recognizing flux based on Multilayer Classifier as described in claim 3 or 4, which is characterized in that institute The feature for stating training set sample is port and 12 kinds of statistical natures；12 kinds of statistical natures are respectively report of the client to server end Maximum value, the minimum value of literary size, maximum value, minimum value, average value, the variance of the message size of server end to client Value, the payload size of first 3 messages with non-zero load of client to server end, the 1st of server end to client With the payload size of the 3rd message with non-zero load, the most parcel size of stream.

9. the mobile application method for recognizing flux based on Multilayer Classifier as claimed in claim 6, which is characterized in that step 2* (N+1) a classifier in 5.3 is respectively N+1 random forest grader and N+1 XGBoost classifier；N+1 binary Random forest grader includes the binary classifier for the training set training that Appi class sample and Appj class sample are constituted, and j is not equal to The binary classifier and Appi class sample of the training set training that i, Appi class sample and Other class sample are constituted and fuzzy stream class The binary classifier for the training set training that sample is constituted；N+1 binary XGBoost classifier includes Appi class sample and Appj class The binary classifier for the training set training that sample is constituted, j are not equal to i, the training set that Appi class sample and Other class sample are constituted The binary classifier for the training set training that trained binary classifier and Appi class sample and fuzzy stream class sample are constituted.