CN110460502B

CN110460502B - Application program flow identification method under VPN based on distributed feature random forest

Info

Publication number: CN110460502B
Application number: CN201910852780.5A
Authority: CN
Inventors: 杨超; 王岁兴; 苏锐丹; 郑昱; 任秋凝; 马建峰; 郭刚; 刘丙楠
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2022-03-04
Anticipated expiration: 2039-09-10
Also published as: CN110460502A

Abstract

The invention provides a method for identifying application program flow under VPN based on distributed feature random forest, which is used for solving the technical problem of low accuracy rate when identifying the application program flow under the shadow shuttle VPN in the prior art and comprises the following steps: preprocessing flow data of the smart phone, acquiring a time characteristic vector set, a statistical characteristic vector set and a distribution characteristic vector set of a flow data set, forming the sets into a characteristic matrix, acquiring a training set, a shadow shuttle VPN training set and a test set, acquiring a two-classification model and a multi-classification model based on a random forest algorithm, and finally acquiring an identification result of application program flow. The method for recognizing the application program flow under the shadow shuttle VPN by extracting the distribution characteristics of the flow data to form two training sets and inputting the two training sets into a random forest algorithm to obtain two classification models can obviously improve the accuracy of flow recognition, and meanwhile, the efficiency of the recognition method can be improved due to the use of two smaller models.

Description

Application program flow identification method under VPN based on distributed feature random forest

Technical Field

The invention belongs to the technical field of networks, and relates to a method for identifying application program flow under a shadow shuttle VPN (virtual private network), in particular to a method for identifying application program flow under the shadow shuttle VPN based on distribution characteristics and a random forest algorithm.

Background

A Virtual Private Network (VPN) is a technology for establishing a Private Network on a public Network to implement encrypted communication, and a common VPN is implemented based on an IPSec protocol. The VPN based on the IPSec protocol is divided into a client and a server, key agreement is firstly carried out between the client and the server, then the client encrypts a network layer payload of a data packet, encapsulates the encrypted network layer payload and adds a new network layer header, and then the client and the server communicate through the encrypted data packet. The technology is used by an illegal network actor to penetrate a firewall and avoid the traffic identification and filtration of the firewall, but the VPN based on the IPSec protocol requires that communication parties firstly establish a session of key agreement, the process has obvious characteristics, and the traffic identification technology of the VPN based on the IPSec protocol is relatively mature. At present, illegal network actors often use another VPN technology to shadow the VPN to penetrate the firewall.

The shadow-shuttle VPN is a VPN based on an encrypted Socks5 protocol, and can encrypt user communication and hide the identity of the user. The shadow shuttle VPN is divided into a client, a local server, a remote server and a target server; the process of communicating using the shadow-shuttle VPN is as follows: the local server receives the data packet from the client, encrypts and disguises the data packet into a common HTTPS data packet, transmits the data packet to the remote server through the external network, and the remote server decrypts the data packet and forwards the data packet to a real target server so as to realize the communication between the client and the target server. Because the encryption process adopts a static key, a key negotiation process does not exist between a local server and a remote server, a transmitted data packet is almost not different from a common HTTPS data packet in appearance, and the data packet is subjected to de-characterization through a confusion algorithm, so that the lengths of all the data packets are almost equal, namely the flow is highly homogenized, and the flow of an application program using the shadow shuttle VPN is difficult to be identified by a firewall established on an outer network.

The use of the shadow-shuttle VPN to penetrate the firewall and bypass the supervision and examination to engage in illegal network activities such as data theft, pornography, network attack and hidden network transaction has become an important means for illegal network activities engaged in by illegal network actors. Therefore, if the method for identifying the application program accessed by the user by using the shadow-shuttle VPN through the network flow is adopted, the method can be applied to a firewall identification system, the user who uses the shadow-shuttle VPN to perform illegal network activities can be automatically monitored in real time, and the network security management personnel can conveniently and quickly respond to the illegal network activities.

The current traffic identification method of an application program under VPN is usually based on traditional machine learning or deep learning, the traffic identification method based on traditional machine learning usually needs to extract features from original traffic data, the features form a feature matrix, namely a data set, the data set is randomly divided into a training set and a test set, then the training set is input into a supervised learning algorithm or an unsupervised learning algorithm for training to obtain a classification model, then the test set is input into the classification model to obtain a prediction result corresponding to each traffic, namely, an application program from which the traffic comes is identified. The traffic identification method based on deep learning is different from the traffic identification method based on traditional machine learning in that the deep learning does not need to manually extract features, but automatically finds the features of original data through an algorithm, trains a classification model and identifies an application program, but the deep learning is not suitable for all situations, and the traffic identification based on the traditional machine learning performs better under certain specific situations.

For example, in 2017, a paper "End-to-End encrypted traffic classification with one-dimensional convolutional neural network" published by Wei Wang et al, university of chinese science and technology at IEEE International Conference on intelligent and Security information university discloses a method for identifying application program traffic under VPN based on a one-dimensional convolutional neural network (1D-CNN), which divides traffic data into streams, normalizes and converts original data into gray-scale maps by using a data visualization technology, finds differences between traffic by using the gray-scale maps, extracts payload data of a plurality of first data packets of the streams, establishes a 1D-CNN model, identifies application program traffic using VPN, and makes a remarkable contribution to application program traffic identification using VPN. However, when the method is used for identifying the application program flow under the shadow shuttle VPN, the application program cannot be obviously distinguished from the extracted characteristics of the highly homogeneous flow, so that the identification accuracy of the application flow of the shadow shuttle VPN is low; in addition, because a complex deep learning algorithm is adopted and only one multi-classification model containing all classes is established, the method has high algorithm complexity and long data training time, and is difficult to be used for a network detection system on a large-flow system to perform real-time flow identification.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a method for identifying the application program flow under the shadow shuttle VPN based on the distribution characteristics and the random forest algorithm, and is used for solving the technical problem of low accuracy in identifying the application program flow under the shadow shuttle VPN in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) preprocessing the flow data of the smart phone:

the method comprises the steps of forming original data sets by data packets except lost and retransmitted data packets in smart phone flow, forming data packets with the same source IP address, destination IP address, source port number, destination port number and transport layer protocol in the original data sets into stream data, and forming the stream data sets into stream data by all the stream data, wherein the stream data sets form stream data sets (flow)₁,flow₂,...,flow_k,...,flow_x}，flow_kIs the kth stream data, x is the total number of the stream data, and x is more than or equal to 10000;

(2) acquiring a time characteristic vector set CT of a flow data set DataSet:

(2a) flow data flow in flow data set DataSet_kAll the data packets in the sequence are arranged according to the ascending sequence of the time stamps, and the difference between the time stamp of the last data packet and the time stamp of the first data packet, namely flow, is calculated_kWhile calculating statistics of the time intervals between adjacent data packets: maximum, minimum, mean and standard deviation;

(2b) taking the data packet with the same sending direction as the first data packet in all the sequenced data packets as an input packet, taking the other data packets as output packets, and calculating the statistic value of the time interval between the adjacent data packets in the input packet and the statistic value of the time interval between the adjacent data packets in the output packet;

(2c) will flow_kDuration, flow of_kThe statistic of the time interval between adjacent data packets in the input packet, the statistic of the time interval between adjacent data packets in the input packet and the statistic of the time interval between adjacent data packets in the output packet constitute a flow_kTemporal feature vector CT of_kThe time characteristic vectors corresponding to the stream data set DataSet form a time characteristic vector set CT;

(3) acquiring a statistical characteristic vector set CS of a stream data set:

(3a) flow data flow in statistical flow data set DataSet_kThe number of all data packets in the system and calculating the statistic value of the length of all data packets;

(3b) will flow_kNumber and flow of all packets in_kThe statistic of all data packet lengths in the flow_kStatistical feature vector CS of_kForming a statistical characteristic vector set CS by the statistical characteristic vectors corresponding to the stream data set DataSet;

(4) acquiring a uniformly distributed feature vector set CD1 of the stream data set DataSet:

(4a) setting a uniform distribution interval set USEC comprising m sub-intervals:

USEC＝{USEC₁,USEC₂,...,USEC_i,...,USEC_m}

wherein, the USEC_iDenotes the ith subinterval, USEC_i＝((i-1)d,id]I is equal to {1,2,. eta., m }, m is equal to or more than 20 and is equal to or less than 200, d represents the step length,

l represents the maximum of all packet lengths in the original data set,

represents rounding up;

(4b) acquiring flow data flow in flow data set DataSet_kLength of each packet in (1) and will satisfy the length ∈ USEC_iNumber of packets Count of_i(USEC_i) Arranging the i in the order from small to large to form a uniformly distributed characteristic vector CD1_kThe uniformly distributed characteristic vectors corresponding to the stream data set constitute a uniformly distributed characteristic vector set CD 1;

(5) acquiring a log distribution characteristic vector set CD2 of the stream data set DataSet:

(5a) setting a logarithm distribution interval set LSEC comprising n subintervals:

LSEC＝{LSEC₁,LSEC₂,...,LSEC_j,...,LSEC_n}

wherein, LSEC_jDenotes the jth sub-interval, LSEC_j＝(b^j-1,b^j]，j∈{1,2,...,n}，log_bL＜n≤log_bL +1 and n are integers, b represents the base number, and b is more than 1 and less than or equal to 3;

(5b) acquiring flow data flow in flow data set DataSet_kLength of each packet in (1) and will satisfy the length ∈ LSEC_jNumber of packets Count of_j(LSEC_j) The characteristic vectors CD2 are formed according to the sequence of j from small to large_kThe logarithm distribution characteristic vectors corresponding to the stream data set DataSet form a logarithm distribution characteristic vector set CD 2;

(6) acquiring a feature matrix C of the stream data set:

feature vector CT in set CT of temporal feature vectors_kFeature vector CS in statistical feature vector set CS_kFeature vectors CD1 in a uniformly distributed set of feature vectors CD1_kAnd the feature vector CD2 in the log-distributed feature vector set CD2_kTransversely splicing to obtain flow data flow_kCharacteristic vector C of_kAll feature vectors C₁,C₂,...,C_k,...,C_xPerforming longitudinal splicing to obtain a feature matrix C of the stream data set;

(7) acquiring a training set TR, a shadow shuttle VPN training set TRSS and a test set TE:

(7a) forming a characteristic vector set TRS (character vector set) by more than half of randomly selected characteristic vectors in a characteristic matrix C of a stream data set (C)₁,C₂,...,C_s,...,C_zAnd forming a test set TE, C by the rest characteristic vectors_sRepresenting the s-th feature vector in the TRS, z representing the total number of feature vectors;

(7b) for feature vector C_sAdd tag L1, L1 ∈ { label ∈₀,label₁}，label₁Indicating virtual private network VPN flow data, label, of shadow shuttle₀Representing the normal flow data of the non-shadow virtual private network VPN, and forming a training set TR by using z feature vectors and a label L1 of each feature vector;

(7c) all the label values in the copy training set TR are label₁Changing the label L1 of each feature vector into an application program L2 from which the data packet in the stream data corresponding to the feature vector comes, and then combining the feature vectors and the label L2 of each feature vector to form a shadow shuttle VPN training set TRSS;

(8) acquiring a two-classification model M1 and a multi-classification model M2 based on a random forest algorithm:

respectively taking a training set TR and a shadow shuttle VPN training set TRSS as the input of a random forest algorithm, and constructing a secondary classification model M1 and a multi-classification model M2;

(9) acquiring an identification result of the application program flow:

(9a) predicting by taking the test set TE as the input of a binary model M1 to obtain a prediction result set R consisting of labels of feature vectors, and taking the label value in R as label₁The corresponding feature vectors form a shuttle VPN test set TESS;

(9b) and (3) predicting by taking the shadow shuttle VPN test set TESS as the input of a multi-classification model M2 to obtain an application program label corresponding to each feature vector in the TESS.

Compared with the prior art, the invention has the following advantages:

1. the invention adopts a scheme of extracting a plurality of characteristics of streaming data, including uniformly distributed characteristics, logarithmic distributed characteristics and the like to construct a training set, inputting a random forest algorithm to train to obtain two classification models to identify the flow of an application program under the shadow shuttle VPN, wherein the extracted logarithmic distributed characteristics can show the obvious difference of high homogenization of the flow, almost equal length of a data packet, unequal length of the shadow shuttle VPN in concentrated distribution and more dispersed common flow, the extracted uniformly distributed characteristics can show the obvious difference of the flow of different application programs under the shadow shuttle VPN through the difference of concentrated intervals of the flow in the distributed characteristics, thereby leading a classification model established through the random forest algorithm according to the two distributed characteristics to more accurately identify the flow of the application program of the shadow shuttle VPN agent, solving the problem of lower accuracy of the prior art when identifying the flow of the application program of the shadow shuttle VPN agent, the method realizes the improvement of the flow identification accuracy of the application program of the shadow shuttle VPN agent.

2. The invention adopts the scheme of establishing two smaller classification models, sequentially classifying the shadow shuttle VPN flow and the non-shadow shuttle VPN flow and identifying the application program flow under the shadow shuttle VPN, thereby reducing the algorithm complexity of O (N)²) The class number N in the random forest algorithm can reduce the time required for establishing the classification model in a square level, solves the problems of high algorithm complexity and long training time of a method for directly establishing a large multi-classification model containing all classes in the prior art, improves the efficiency for establishing the classification model, and improves the application program flow identification efficiency of the shadow shuttle VPN agent.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and specific examples:

referring to fig. 1, the present invention includes the steps of:

step 1), acquiring flow data of the smart phone and preprocessing the flow data of the smart phone:

step 1a) a computer with a wireless network card is started to a hot spot, an experimental smart phone is connected with the hot spot, Telegram, Facebook, YouTube, Whatsapp and Twitter are respectively operated under the condition of starting a shadow shuttle VPN, then other application programs are operated under the condition of not starting the shadow shuttle VPN, the flow of the smart phone from the wireless network card is captured on the computer by utilizing a Wireshark packet capturing program and is respectively stored in different pcap files according to the application programs, the files are added with marks according to the names of the application programs, and the marks are conveniently added to training set data later;

step 1b) forming data packets except lost and retransmitted data packets in the smart phone flow into an original data set, and combining data packets with the same source IP address, destination IP address, source port number, destination port number and transport layer protocol in the original data set into stream data, wherein all the stream data form a stream data set DataSet ═ flow₁,flow₂,...,flow_k,...,flow_x}，flow_kThe kth stream data is, x is the total number of the stream data, in this embodiment, x is 20000, the divided stream data is used as a data unit for subsequent traffic identification, and this method for obtaining stream data can make the granularity of the data unit as small as possible on the premise of keeping traffic characteristics of different applications, so that the identification sensitivity of the method is high;

step 2) acquiring a time characteristic vector set CT of the stream data set DataSet:

step 2a) flow data in the flow data set DataSet_kAll the data packets in the sequence of the time stamps from small to large are arranged, andcalculating the difference, flow, between the timestamp of the last packet and the timestamp of the first packet_kWhile calculating statistics of the time intervals between adjacent data packets: maximum, minimum, mean and standard deviation;

step 2b) taking the data packet with the same sending direction as the first data packet in all the sequenced data packets as an input packet, taking the rest data packets as output packets, and calculating the statistic value of the time interval between the adjacent data packets in the input packet and the statistic value of the time interval between the adjacent data packets in the output packet;

step 2c) flow_kDuration, flow of_kThe statistic of the time interval between adjacent data packets in the input packet, the statistic of the time interval between adjacent data packets in the input packet and the statistic of the time interval between adjacent data packets in the output packet constitute a flow_kTemporal feature vector CT of_kThe time characteristic vectors corresponding to the stream data set DataSet form a time characteristic vector set CT, and the time characteristics of the obtained stream data are helpful for distinguishing an application program which frequently sends data from an application program which intermittently sends data;

step 3), acquiring a statistical characteristic vector set CS of the stream data set DataSet:

step 3a) flow data flow in statistical flow data set DataSet_kThe number of all data packets in the packet, and the statistical value of all data packet lengths is calculated: the method comprises the following steps of average value, minimum value, maximum value, absolute difference, absolute median, standard deviation, variance, deviation, kurtosis, 10% percentile, 20% percentile, 30% percentile, 40% percentile, 50% percentile, 60% percentile, 70% percentile, 80% percentile and 90% percentile, wherein the calculation method of the per% percentile is as follows: first all count (flow)_k) The length of each data packet is arranged from small to large, and then the length of the lambda-th data packet is taken, wherein

Represents rounding up;

step 3b) flow_kNumber and flow of all packets in_kThe statistic of all data packet lengths in the flow_kStatistical feature vector CS of_kStatistical feature vectors corresponding to the stream data set DataSet form a statistical feature vector set CS, and the length statistical features of the obtained data packets can embody the distinctive characteristics of different application program flows from the angles of the average flow, the difference of the lengths of the data packets and the like, so that the application program flows can be more accurately identified by a classification model established by a random forest algorithm later;

step 4), acquiring a uniform distribution characteristic vector set CD1 of the stream data set DataSet:

step 4a) setting a uniformly distributed interval set USEC comprising m sub-intervals:

USEC＝{USEC₁,USEC₂,...,USEC_i,...,USEC_m}

l represents the maximum value of all data packet lengths in the original data set, the smaller m is, the fewer the subintervals are, the more difficult the distinctive characteristics of different application program flow are reflected, the larger m is, the more the subintervals are, the more the training set characteristic numbers obtained later are, the easier the overfitting of the model established by using the random forest algorithm is, and the generalization capability is reduced, so that m is 89, which is the optimal value obtained through multiple times of debugging;

step 4b) acquiring the flow data flow in the flow data set DataSet_kLength of each packet in (1) and will satisfy the length ∈ USEC_iNumber of packets Count of_i(USEC_i) Arranging the i in the order from small to large to form a uniformly distributed characteristic vector CD1_kThe uniformly distributed feature vectors corresponding to the stream data set constitute a uniformly distributed feature vector set CD1, so as to obtain the stream dataThe data packets are distributed in the uniformly divided intervals, the lengths of the data packets generated by different application programs are not consistent on the whole, and the distinguishing characteristics of the positions of the intervals in which the lengths of the data packets are concentrated can be better embodied through uniform distribution, so that the accuracy of identifying the flow of different application programs under the shadow shuttle VPN by using a multi-classification model established by a random forest algorithm can be improved, the characteristics of a large number of short data packet flows can be embodied by adopting the uniformly distributed characteristics, and the method is more suitable for identifying the application programs such as chatting and social media;

step 5) obtaining a logarithm distribution characteristic vector set CD2 of the stream data set DataSet:

step 5a) a log distribution interval set LSEC comprising n sub-intervals is set:

LSEC＝{LSEC₁,LSEC₂,...,LSEC_j,...,LSEC_n}

wherein, LSEC_jDenotes the jth sub-interval, LSEC_j＝(b^j-1,b^j]，j∈{1,2,...,n}，log_bL＜n≤log_bL +1 and n is an integer, b denotes a base number, 1 < b.ltoreq.3, where b is 2, a suitable number of divided compartments being obtained;

step 5b) acquiring flow data flow in the flow data set DataSet_kLength of each packet in (1) and will satisfy the length ∈ LSEC_jNumber of packets Count of_j(LSEC_j) The characteristic vectors CD2 are formed according to the sequence of j from small to large_kThe logarithm distribution characteristic vectors corresponding to the stream data set DataSet form a logarithm distribution characteristic vector set CD2, so that the distribution of data packets in the stream data on a logarithmically divided interval is obtained, the shadow shuttle VPN flow is highly homogeneous, the data packet length is almost equal, the shadow shuttle VPN flow is concentrated in the logarithm distribution, the data packet length of the common flow is unequal, the shadow shuttle VPN flow is dispersed in the logarithm distribution, the distinguishing characteristic of the shadow shuttle VPN flow and the common flow can be better embodied through the logarithm distribution, the accuracy of classifying the shadow shuttle VPN flow and the common flow by a binary model established by a random forest algorithm can be improved, and the logarithm distribution characteristic is adoptedThe method can embody the characteristics of a large amount of long data packet flow, and is more suitable for identifying application programs such as videos and downloads;

step 6), acquiring a characteristic matrix C of the stream data set DataSet:

feature vector CT in set CT of temporal feature vectors_kFeature vector CS in statistical feature vector set CS_kFeature vectors CD1 in a uniformly distributed set of feature vectors CD1_kAnd the feature vector CD2 in the log-distributed feature vector set CD2_kTransversely splicing to obtain flow data flow_kCharacteristic vector C of_kAll feature vectors C₁,C₂,...,C_k,...,C_xPerforming longitudinal splicing to obtain a feature matrix C of the stream data set, wherein the feature matrix C is a processed data set, each row of data represents a feature vector corresponding to stream data, and each column of data represents a feature;

step 7), acquiring a training set TR, a shadow shuttle VPN training set TRSS and a test set TE:

step 7a) forming a feature vector set TRS (character vector set) by using 80% of feature vectors randomly selected from a feature matrix C of a stream data set into a feature vector set { C₁,C₂,...,C_s,...,C_zAnd (4) forming the rest 20% of feature vectors into a test set TE, C_sRepresenting the s-th feature vector in the TRS, z representing the total number of feature vectors;

step 7b) for the feature vector C_sAdd tag L1, L1 ∈ { label ∈₀,label₁}，label₁Indicating virtual private network VPN flow data, label, of shadow shuttle₀Representing the normal flow data of the non-shadow virtual private network VPN, and forming a training set TR by using z feature vectors and a label L1 of each feature vector;

step 7c) duplicating all the label values in the training set TR as label₁Changing the label L1 of each feature vector into an application program L2 from which the data packet in the stream data corresponding to the feature vector comes, and then combining the feature vectors and the label L2 of each feature vector to form a shadow shuttle VPN training set TRSS;

step 8) obtaining a two-classification model M1 and a multi-classification model M2 based on a random forest algorithm:

step 8a) obtaining a sub-training set TSET from the training set TR:

extracting p feature vectors from a training set TR in a mode of Q times of repeated samples, and respectively forming Q sub-training sets, wherein the Q sub-training sets form a sub-training set TSET (time dependent transient) set, and the TSET is { T ═ T₁,T₂,...,T_r,...,T_QWhere T is_rIs the r-th sub-training set, and T_r＝{C_r1,C_r2,...,C_rt,...,C_rp}，C_rtIs the t-th feature vector, and C_rt＝(fea₁,fea₂,...,fea_u,...,fea_w)，fea_uW is the total number of features, where Q is 100;

step 8b) generating a binary model M1 by using the sub training set TSET:

from T_rCharacteristic vector C in_rtIn random selection of o_rFeature vector Co of a feature component_rChild training set T_rCorresponding partial feature vectors constitute a partial feature sub-training set To_rTraining set To partial feature_rAs input of decision Tree algorithm, constructing decision Tree Tree_rQ mutually independent decision trees corresponding to the sub-training set TSET form a model M1;

step 8c) step 8a) to step 8b) are realized by calling a random forest algorithm function RandomForestClassiier in a Sklear library of Python, a shadow shuttle VPN training set TRSS is used as the input of a random forest algorithm, a multi-classification model M2 is constructed in the same way as the steps 8a) to 8b), two smaller classification models are built, shadow shuttle VPN flow and non-shadow shuttle VPN flow are sequentially classified, and the scheme of application program flow identification under the shadow shuttle VPN is adopted, so that the algorithm complexity is reduced to O (N)²) The random forest algorithm has the advantages that the class number N is increased, so that the time for establishing a classification model can be reduced in a square level manner, and the problems of high algorithm complexity and long training time of the method for directly establishing a large multi-classification model containing all classes in the prior art are solvedThe efficiency of establishing a classification model is improved, so that the efficiency of identifying the flow of the application program of the shadow shuttle VPN agent is improved;

step 9) obtaining the identification result of the application program flow:

step 9a) predicting the test set TE as the input of a binary model M1 to obtain a prediction result set R consisting of labels of feature vectors, and setting the label value in R as label₁The corresponding feature vectors form a shuttle VPN test set TESS;

and step 9b) taking the shuttle VPN test set TESS as the input of the multi-classification model M2 for prediction to obtain the labels of application programs Telegram, Facebook, YouTube, Whatsapp and Twitter corresponding to each feature vector in the TESS.

The foregoing description is only an example of the present invention and should not be construed as limiting the invention in any way, and it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the principles and arrangements of the invention, but such changes and modifications are within the scope of the invention as defined by the appended claims.

Claims

1. A method for identifying application program flow under a shadow shuttle VPN based on distribution characteristics and a random forest algorithm is characterized by comprising the following steps:

(1) preprocessing the flow data of the smart phone:

(2) acquiring a time characteristic vector set CT of a flow data set DataSet:

(3) acquiring a statistical characteristic vector set CS of a stream data set:

USEC＝{USEC₁,USEC₂,...,USEC_i,...,USEC_m}

wherein, the USEC_iDenotes the ith subinterval, USEC_i＝((i-1)d,id]，i∈{1M is more than or equal to 20 and less than or equal to 200, d represents the step length,

l represents the maximum of all packet lengths in the original data set,

represents rounding up;

LSEC＝{LSEC₁,LSEC₂,...,LSEC_j,...,LSEC_n}

(6) acquiring a feature matrix C of the stream data set:

feature vector CT in set CT of temporal feature vectors_kFeature vector CS in statistical feature vector set CS_kFeatures in the set of evenly distributed feature vectors CD1Vector CD1_kAnd the feature vector CD2 in the log-distributed feature vector set CD2_kTransversely splicing to obtain flow data flow_kCharacteristic vector C of_kAll feature vectors C₁,C₂,...,C_k,...,C_xPerforming longitudinal splicing to obtain a feature matrix C of the stream data set;

respectively taking a training set TR and a shadow shuttle VPN training set TRSS as the input of a random forest algorithm, and constructing a two-classification model M1 and a multi-classification model M2, wherein the implementation steps are as follows:

(8a) acquiring a sub-training set TSET from the training set TR:

by calling a random forest algorithm function RandomForestClassiier in a Sklearn library of Python, taking a training set TR as the input of a random forest algorithm, extracting p feature vectors from the training set TR each time in a Q-time release sampling mode, and respectively extracting p feature vectorsForming Q sub-training sets, wherein the Q sub-training sets form a sub-training set TSET (time-to-time) set, and the TSET is { T ═ T }₁,T₂,...,T_r,...,T_QWhere T is_rIs the r-th sub-training set, and T_r＝{C_r1,C_r2,...,C_rt,...,C_rp}，C_rtIs the t-th feature vector, and C_rt＝(fea₁,fea₂,...,fea_u,...,fea_w)，fea_uIs the u-th feature, w is the total number of features;

(8b) generating a binary model M1 by using the sub training set TSET:

(8c) by calling a random forest algorithm function RandomForestClassiier in a sklern library of Python, a shadow-shuttle VPN training set TRSS is used as the input of a random forest algorithm, and a multi-classification model M2 is constructed in the same manner as the steps (8a) to (8 b);

(9) acquiring an identification result of the application program flow:

2. The method for identifying the flow of the application program under the shadow shuttle VPN based on the distribution characteristics and the random forest algorithm as claimed in claim 1, wherein the statistical values of the packet lengths in the step (3a) comprise an average value, a minimum value, a maximum value, an absolute difference, an absolute median difference, a standard deviation, a variance, a skew, a kurtosis, a 10% percentile, a 20% percentile, a 30% percentile, a 40% percentile, a 50% percentile, a 60% percentile, a 70% percentile, an 80% percentile and a 90% percentile of the packet lengths.