CN110460502B - Application program flow identification method under VPN based on distributed feature random forest - Google Patents

Application program flow identification method under VPN based on distributed feature random forest Download PDF

Info

Publication number
CN110460502B
CN110460502B CN201910852780.5A CN201910852780A CN110460502B CN 110460502 B CN110460502 B CN 110460502B CN 201910852780 A CN201910852780 A CN 201910852780A CN 110460502 B CN110460502 B CN 110460502B
Authority
CN
China
Prior art keywords
flow
data
vpn
feature
shuttle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910852780.5A
Other languages
Chinese (zh)
Other versions
CN110460502A (en
Inventor
杨超
王岁兴
苏锐丹
郑昱
任秋凝
马建峰
郭刚
刘丙楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910852780.5A priority Critical patent/CN110460502B/en
Publication of CN110460502A publication Critical patent/CN110460502A/en
Application granted granted Critical
Publication of CN110460502B publication Critical patent/CN110460502B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/026Capturing of monitoring data using flow identification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • H04L43/062Generation of reports related to network traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • H04L43/067Generation of reports using time frame reporting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/08Testing, supervising or monitoring using real traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a method for identifying application program flow under VPN based on distributed feature random forest, which is used for solving the technical problem of low accuracy rate when identifying the application program flow under the shadow shuttle VPN in the prior art and comprises the following steps: preprocessing flow data of the smart phone, acquiring a time characteristic vector set, a statistical characteristic vector set and a distribution characteristic vector set of a flow data set, forming the sets into a characteristic matrix, acquiring a training set, a shadow shuttle VPN training set and a test set, acquiring a two-classification model and a multi-classification model based on a random forest algorithm, and finally acquiring an identification result of application program flow. The method for recognizing the application program flow under the shadow shuttle VPN by extracting the distribution characteristics of the flow data to form two training sets and inputting the two training sets into a random forest algorithm to obtain two classification models can obviously improve the accuracy of flow recognition, and meanwhile, the efficiency of the recognition method can be improved due to the use of two smaller models.

Description

Application program flow identification method under VPN based on distributed feature random forest
Technical Field
The invention belongs to the technical field of networks, and relates to a method for identifying application program flow under a shadow shuttle VPN (virtual private network), in particular to a method for identifying application program flow under the shadow shuttle VPN based on distribution characteristics and a random forest algorithm.
Background
A Virtual Private Network (VPN) is a technology for establishing a Private Network on a public Network to implement encrypted communication, and a common VPN is implemented based on an IPSec protocol. The VPN based on the IPSec protocol is divided into a client and a server, key agreement is firstly carried out between the client and the server, then the client encrypts a network layer payload of a data packet, encapsulates the encrypted network layer payload and adds a new network layer header, and then the client and the server communicate through the encrypted data packet. The technology is used by an illegal network actor to penetrate a firewall and avoid the traffic identification and filtration of the firewall, but the VPN based on the IPSec protocol requires that communication parties firstly establish a session of key agreement, the process has obvious characteristics, and the traffic identification technology of the VPN based on the IPSec protocol is relatively mature. At present, illegal network actors often use another VPN technology to shadow the VPN to penetrate the firewall.
The shadow-shuttle VPN is a VPN based on an encrypted Socks5 protocol, and can encrypt user communication and hide the identity of the user. The shadow shuttle VPN is divided into a client, a local server, a remote server and a target server; the process of communicating using the shadow-shuttle VPN is as follows: the local server receives the data packet from the client, encrypts and disguises the data packet into a common HTTPS data packet, transmits the data packet to the remote server through the external network, and the remote server decrypts the data packet and forwards the data packet to a real target server so as to realize the communication between the client and the target server. Because the encryption process adopts a static key, a key negotiation process does not exist between a local server and a remote server, a transmitted data packet is almost not different from a common HTTPS data packet in appearance, and the data packet is subjected to de-characterization through a confusion algorithm, so that the lengths of all the data packets are almost equal, namely the flow is highly homogenized, and the flow of an application program using the shadow shuttle VPN is difficult to be identified by a firewall established on an outer network.
The use of the shadow-shuttle VPN to penetrate the firewall and bypass the supervision and examination to engage in illegal network activities such as data theft, pornography, network attack and hidden network transaction has become an important means for illegal network activities engaged in by illegal network actors. Therefore, if the method for identifying the application program accessed by the user by using the shadow-shuttle VPN through the network flow is adopted, the method can be applied to a firewall identification system, the user who uses the shadow-shuttle VPN to perform illegal network activities can be automatically monitored in real time, and the network security management personnel can conveniently and quickly respond to the illegal network activities.
The current traffic identification method of an application program under VPN is usually based on traditional machine learning or deep learning, the traffic identification method based on traditional machine learning usually needs to extract features from original traffic data, the features form a feature matrix, namely a data set, the data set is randomly divided into a training set and a test set, then the training set is input into a supervised learning algorithm or an unsupervised learning algorithm for training to obtain a classification model, then the test set is input into the classification model to obtain a prediction result corresponding to each traffic, namely, an application program from which the traffic comes is identified. The traffic identification method based on deep learning is different from the traffic identification method based on traditional machine learning in that the deep learning does not need to manually extract features, but automatically finds the features of original data through an algorithm, trains a classification model and identifies an application program, but the deep learning is not suitable for all situations, and the traffic identification based on the traditional machine learning performs better under certain specific situations.
For example, in 2017, a paper "End-to-End encrypted traffic classification with one-dimensional convolutional neural network" published by Wei Wang et al, university of chinese science and technology at IEEE International Conference on intelligent and Security information university discloses a method for identifying application program traffic under VPN based on a one-dimensional convolutional neural network (1D-CNN), which divides traffic data into streams, normalizes and converts original data into gray-scale maps by using a data visualization technology, finds differences between traffic by using the gray-scale maps, extracts payload data of a plurality of first data packets of the streams, establishes a 1D-CNN model, identifies application program traffic using VPN, and makes a remarkable contribution to application program traffic identification using VPN. However, when the method is used for identifying the application program flow under the shadow shuttle VPN, the application program cannot be obviously distinguished from the extracted characteristics of the highly homogeneous flow, so that the identification accuracy of the application flow of the shadow shuttle VPN is low; in addition, because a complex deep learning algorithm is adopted and only one multi-classification model containing all classes is established, the method has high algorithm complexity and long data training time, and is difficult to be used for a network detection system on a large-flow system to perform real-time flow identification.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, provides a method for identifying the application program flow under the shadow shuttle VPN based on the distribution characteristics and the random forest algorithm, and is used for solving the technical problem of low accuracy in identifying the application program flow under the shadow shuttle VPN in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) preprocessing the flow data of the smart phone:
the method comprises the steps of forming original data sets by data packets except lost and retransmitted data packets in smart phone flow, forming data packets with the same source IP address, destination IP address, source port number, destination port number and transport layer protocol in the original data sets into stream data, and forming the stream data sets into stream data by all the stream data, wherein the stream data sets form stream data sets (flow)1,flow2,...,flowk,...,flowx},flowkIs the kth stream data, x is the total number of the stream data, and x is more than or equal to 10000;
(2) acquiring a time characteristic vector set CT of a flow data set DataSet:
(2a) flow data flow in flow data set DataSetkAll the data packets in the sequence are arranged according to the ascending sequence of the time stamps, and the difference between the time stamp of the last data packet and the time stamp of the first data packet, namely flow, is calculatedkWhile calculating statistics of the time intervals between adjacent data packets: maximum, minimum, mean and standard deviation;
(2b) taking the data packet with the same sending direction as the first data packet in all the sequenced data packets as an input packet, taking the other data packets as output packets, and calculating the statistic value of the time interval between the adjacent data packets in the input packet and the statistic value of the time interval between the adjacent data packets in the output packet;
(2c) will flowkDuration, flow ofkThe statistic of the time interval between adjacent data packets in the input packet, the statistic of the time interval between adjacent data packets in the input packet and the statistic of the time interval between adjacent data packets in the output packet constitute a flowkTemporal feature vector CT ofkThe time characteristic vectors corresponding to the stream data set DataSet form a time characteristic vector set CT;
(3) acquiring a statistical characteristic vector set CS of a stream data set:
(3a) flow data flow in statistical flow data set DataSetkThe number of all data packets in the system and calculating the statistic value of the length of all data packets;
(3b) will flowkNumber and flow of all packets inkThe statistic of all data packet lengths in the flowkStatistical feature vector CS ofkForming a statistical characteristic vector set CS by the statistical characteristic vectors corresponding to the stream data set DataSet;
(4) acquiring a uniformly distributed feature vector set CD1 of the stream data set DataSet:
(4a) setting a uniform distribution interval set USEC comprising m sub-intervals:
USEC={USEC1,USEC2,...,USECi,...,USECm}
wherein, the USECiDenotes the ith subinterval, USECi=((i-1)d,id]I is equal to {1,2,. eta., m }, m is equal to or more than 20 and is equal to or less than 200, d represents the step length,
Figure BDA0002197362920000031
l represents the maximum of all packet lengths in the original data set,
Figure BDA0002197362920000032
represents rounding up;
(4b) acquiring flow data flow in flow data set DataSetkLength of each packet in (1) and will satisfy the length ∈ USECiNumber of packets Count ofi(USECi) Arranging the i in the order from small to large to form a uniformly distributed characteristic vector CD1kThe uniformly distributed characteristic vectors corresponding to the stream data set constitute a uniformly distributed characteristic vector set CD 1;
(5) acquiring a log distribution characteristic vector set CD2 of the stream data set DataSet:
(5a) setting a logarithm distribution interval set LSEC comprising n subintervals:
LSEC={LSEC1,LSEC2,...,LSECj,...,LSECn}
wherein, LSECjDenotes the jth sub-interval, LSECj=(bj-1,bj],j∈{1,2,...,n},logbL<n≤logbL +1 and n are integers, b represents the base number, and b is more than 1 and less than or equal to 3;
(5b) acquiring flow data flow in flow data set DataSetkLength of each packet in (1) and will satisfy the length ∈ LSECjNumber of packets Count ofj(LSECj) The characteristic vectors CD2 are formed according to the sequence of j from small to largekThe logarithm distribution characteristic vectors corresponding to the stream data set DataSet form a logarithm distribution characteristic vector set CD 2;
(6) acquiring a feature matrix C of the stream data set:
feature vector CT in set CT of temporal feature vectorskFeature vector CS in statistical feature vector set CSkFeature vectors CD1 in a uniformly distributed set of feature vectors CD1kAnd the feature vector CD2 in the log-distributed feature vector set CD2kTransversely splicing to obtain flow data flowkCharacteristic vector C ofkAll feature vectors C1,C2,...,Ck,...,CxPerforming longitudinal splicing to obtain a feature matrix C of the stream data set;
(7) acquiring a training set TR, a shadow shuttle VPN training set TRSS and a test set TE:
(7a) forming a characteristic vector set TRS (character vector set) by more than half of randomly selected characteristic vectors in a characteristic matrix C of a stream data set (C)1,C2,...,Cs,...,CzAnd forming a test set TE, C by the rest characteristic vectorssRepresenting the s-th feature vector in the TRS, z representing the total number of feature vectors;
(7b) for feature vector CsAdd tag L1, L1 ∈ { label ∈0,label1},label1Indicating virtual private network VPN flow data, label, of shadow shuttle0Representing the normal flow data of the non-shadow virtual private network VPN, and forming a training set TR by using z feature vectors and a label L1 of each feature vector;
(7c) all the label values in the copy training set TR are label1Changing the label L1 of each feature vector into an application program L2 from which the data packet in the stream data corresponding to the feature vector comes, and then combining the feature vectors and the label L2 of each feature vector to form a shadow shuttle VPN training set TRSS;
(8) acquiring a two-classification model M1 and a multi-classification model M2 based on a random forest algorithm:
respectively taking a training set TR and a shadow shuttle VPN training set TRSS as the input of a random forest algorithm, and constructing a secondary classification model M1 and a multi-classification model M2;
(9) acquiring an identification result of the application program flow:
(9a) predicting by taking the test set TE as the input of a binary model M1 to obtain a prediction result set R consisting of labels of feature vectors, and taking the label value in R as label1The corresponding feature vectors form a shuttle VPN test set TESS;
(9b) and (3) predicting by taking the shadow shuttle VPN test set TESS as the input of a multi-classification model M2 to obtain an application program label corresponding to each feature vector in the TESS.
Compared with the prior art, the invention has the following advantages:
1. the invention adopts a scheme of extracting a plurality of characteristics of streaming data, including uniformly distributed characteristics, logarithmic distributed characteristics and the like to construct a training set, inputting a random forest algorithm to train to obtain two classification models to identify the flow of an application program under the shadow shuttle VPN, wherein the extracted logarithmic distributed characteristics can show the obvious difference of high homogenization of the flow, almost equal length of a data packet, unequal length of the shadow shuttle VPN in concentrated distribution and more dispersed common flow, the extracted uniformly distributed characteristics can show the obvious difference of the flow of different application programs under the shadow shuttle VPN through the difference of concentrated intervals of the flow in the distributed characteristics, thereby leading a classification model established through the random forest algorithm according to the two distributed characteristics to more accurately identify the flow of the application program of the shadow shuttle VPN agent, solving the problem of lower accuracy of the prior art when identifying the flow of the application program of the shadow shuttle VPN agent, the method realizes the improvement of the flow identification accuracy of the application program of the shadow shuttle VPN agent.
2. The invention adopts the scheme of establishing two smaller classification models, sequentially classifying the shadow shuttle VPN flow and the non-shadow shuttle VPN flow and identifying the application program flow under the shadow shuttle VPN, thereby reducing the algorithm complexity of O (N)2) The class number N in the random forest algorithm can reduce the time required for establishing the classification model in a square level, solves the problems of high algorithm complexity and long training time of a method for directly establishing a large multi-classification model containing all classes in the prior art, improves the efficiency for establishing the classification model, and improves the application program flow identification efficiency of the shadow shuttle VPN agent.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and specific examples:
referring to fig. 1, the present invention includes the steps of:
step 1), acquiring flow data of the smart phone and preprocessing the flow data of the smart phone:
step 1a) a computer with a wireless network card is started to a hot spot, an experimental smart phone is connected with the hot spot, Telegram, Facebook, YouTube, Whatsapp and Twitter are respectively operated under the condition of starting a shadow shuttle VPN, then other application programs are operated under the condition of not starting the shadow shuttle VPN, the flow of the smart phone from the wireless network card is captured on the computer by utilizing a Wireshark packet capturing program and is respectively stored in different pcap files according to the application programs, the files are added with marks according to the names of the application programs, and the marks are conveniently added to training set data later;
step 1b) forming data packets except lost and retransmitted data packets in the smart phone flow into an original data set, and combining data packets with the same source IP address, destination IP address, source port number, destination port number and transport layer protocol in the original data set into stream data, wherein all the stream data form a stream data set DataSet ═ flow1,flow2,...,flowk,...,flowx},flowkThe kth stream data is, x is the total number of the stream data, in this embodiment, x is 20000, the divided stream data is used as a data unit for subsequent traffic identification, and this method for obtaining stream data can make the granularity of the data unit as small as possible on the premise of keeping traffic characteristics of different applications, so that the identification sensitivity of the method is high;
step 2) acquiring a time characteristic vector set CT of the stream data set DataSet:
step 2a) flow data in the flow data set DataSetkAll the data packets in the sequence of the time stamps from small to large are arranged, andcalculating the difference, flow, between the timestamp of the last packet and the timestamp of the first packetkWhile calculating statistics of the time intervals between adjacent data packets: maximum, minimum, mean and standard deviation;
step 2b) taking the data packet with the same sending direction as the first data packet in all the sequenced data packets as an input packet, taking the rest data packets as output packets, and calculating the statistic value of the time interval between the adjacent data packets in the input packet and the statistic value of the time interval between the adjacent data packets in the output packet;
step 2c) flowkDuration, flow ofkThe statistic of the time interval between adjacent data packets in the input packet, the statistic of the time interval between adjacent data packets in the input packet and the statistic of the time interval between adjacent data packets in the output packet constitute a flowkTemporal feature vector CT ofkThe time characteristic vectors corresponding to the stream data set DataSet form a time characteristic vector set CT, and the time characteristics of the obtained stream data are helpful for distinguishing an application program which frequently sends data from an application program which intermittently sends data;
step 3), acquiring a statistical characteristic vector set CS of the stream data set DataSet:
step 3a) flow data flow in statistical flow data set DataSetkThe number of all data packets in the packet, and the statistical value of all data packet lengths is calculated: the method comprises the following steps of average value, minimum value, maximum value, absolute difference, absolute median, standard deviation, variance, deviation, kurtosis, 10% percentile, 20% percentile, 30% percentile, 40% percentile, 50% percentile, 60% percentile, 70% percentile, 80% percentile and 90% percentile, wherein the calculation method of the per% percentile is as follows: first all count (flow)k) The length of each data packet is arranged from small to large, and then the length of the lambda-th data packet is taken, wherein
Figure BDA0002197362920000061
Figure BDA0002197362920000062
Represents rounding up;
step 3b) flowkNumber and flow of all packets inkThe statistic of all data packet lengths in the flowkStatistical feature vector CS ofkStatistical feature vectors corresponding to the stream data set DataSet form a statistical feature vector set CS, and the length statistical features of the obtained data packets can embody the distinctive characteristics of different application program flows from the angles of the average flow, the difference of the lengths of the data packets and the like, so that the application program flows can be more accurately identified by a classification model established by a random forest algorithm later;
step 4), acquiring a uniform distribution characteristic vector set CD1 of the stream data set DataSet:
step 4a) setting a uniformly distributed interval set USEC comprising m sub-intervals:
USEC={USEC1,USEC2,...,USECi,...,USECm}
wherein, the USECiDenotes the ith subinterval, USECi=((i-1)d,id]I is equal to {1,2,. eta., m }, m is equal to or more than 20 and is equal to or less than 200, d represents the step length,
Figure BDA0002197362920000063
l represents the maximum value of all data packet lengths in the original data set, the smaller m is, the fewer the subintervals are, the more difficult the distinctive characteristics of different application program flow are reflected, the larger m is, the more the subintervals are, the more the training set characteristic numbers obtained later are, the easier the overfitting of the model established by using the random forest algorithm is, and the generalization capability is reduced, so that m is 89, which is the optimal value obtained through multiple times of debugging;
step 4b) acquiring the flow data flow in the flow data set DataSetkLength of each packet in (1) and will satisfy the length ∈ USECiNumber of packets Count ofi(USECi) Arranging the i in the order from small to large to form a uniformly distributed characteristic vector CD1kThe uniformly distributed feature vectors corresponding to the stream data set constitute a uniformly distributed feature vector set CD1, so as to obtain the stream dataThe data packets are distributed in the uniformly divided intervals, the lengths of the data packets generated by different application programs are not consistent on the whole, and the distinguishing characteristics of the positions of the intervals in which the lengths of the data packets are concentrated can be better embodied through uniform distribution, so that the accuracy of identifying the flow of different application programs under the shadow shuttle VPN by using a multi-classification model established by a random forest algorithm can be improved, the characteristics of a large number of short data packet flows can be embodied by adopting the uniformly distributed characteristics, and the method is more suitable for identifying the application programs such as chatting and social media;
step 5) obtaining a logarithm distribution characteristic vector set CD2 of the stream data set DataSet:
step 5a) a log distribution interval set LSEC comprising n sub-intervals is set:
LSEC={LSEC1,LSEC2,...,LSECj,...,LSECn}
wherein, LSECjDenotes the jth sub-interval, LSECj=(bj-1,bj],j∈{1,2,...,n},logbL<n≤logbL +1 and n is an integer, b denotes a base number, 1 < b.ltoreq.3, where b is 2, a suitable number of divided compartments being obtained;
step 5b) acquiring flow data flow in the flow data set DataSetkLength of each packet in (1) and will satisfy the length ∈ LSECjNumber of packets Count ofj(LSECj) The characteristic vectors CD2 are formed according to the sequence of j from small to largekThe logarithm distribution characteristic vectors corresponding to the stream data set DataSet form a logarithm distribution characteristic vector set CD2, so that the distribution of data packets in the stream data on a logarithmically divided interval is obtained, the shadow shuttle VPN flow is highly homogeneous, the data packet length is almost equal, the shadow shuttle VPN flow is concentrated in the logarithm distribution, the data packet length of the common flow is unequal, the shadow shuttle VPN flow is dispersed in the logarithm distribution, the distinguishing characteristic of the shadow shuttle VPN flow and the common flow can be better embodied through the logarithm distribution, the accuracy of classifying the shadow shuttle VPN flow and the common flow by a binary model established by a random forest algorithm can be improved, and the logarithm distribution characteristic is adoptedThe method can embody the characteristics of a large amount of long data packet flow, and is more suitable for identifying application programs such as videos and downloads;
step 6), acquiring a characteristic matrix C of the stream data set DataSet:
feature vector CT in set CT of temporal feature vectorskFeature vector CS in statistical feature vector set CSkFeature vectors CD1 in a uniformly distributed set of feature vectors CD1kAnd the feature vector CD2 in the log-distributed feature vector set CD2kTransversely splicing to obtain flow data flowkCharacteristic vector C ofkAll feature vectors C1,C2,...,Ck,...,CxPerforming longitudinal splicing to obtain a feature matrix C of the stream data set, wherein the feature matrix C is a processed data set, each row of data represents a feature vector corresponding to stream data, and each column of data represents a feature;
step 7), acquiring a training set TR, a shadow shuttle VPN training set TRSS and a test set TE:
step 7a) forming a feature vector set TRS (character vector set) by using 80% of feature vectors randomly selected from a feature matrix C of a stream data set into a feature vector set { C1,C2,...,Cs,...,CzAnd (4) forming the rest 20% of feature vectors into a test set TE, CsRepresenting the s-th feature vector in the TRS, z representing the total number of feature vectors;
step 7b) for the feature vector CsAdd tag L1, L1 ∈ { label ∈0,label1},label1Indicating virtual private network VPN flow data, label, of shadow shuttle0Representing the normal flow data of the non-shadow virtual private network VPN, and forming a training set TR by using z feature vectors and a label L1 of each feature vector;
step 7c) duplicating all the label values in the training set TR as label1Changing the label L1 of each feature vector into an application program L2 from which the data packet in the stream data corresponding to the feature vector comes, and then combining the feature vectors and the label L2 of each feature vector to form a shadow shuttle VPN training set TRSS;
step 8) obtaining a two-classification model M1 and a multi-classification model M2 based on a random forest algorithm:
step 8a) obtaining a sub-training set TSET from the training set TR:
extracting p feature vectors from a training set TR in a mode of Q times of repeated samples, and respectively forming Q sub-training sets, wherein the Q sub-training sets form a sub-training set TSET (time dependent transient) set, and the TSET is { T ═ T1,T2,...,Tr,...,TQWhere T isrIs the r-th sub-training set, and Tr={Cr1,Cr2,...,Crt,...,Crp},CrtIs the t-th feature vector, and Crt=(fea1,fea2,...,feau,...,feaw),feauW is the total number of features, where Q is 100;
step 8b) generating a binary model M1 by using the sub training set TSET:
from TrCharacteristic vector C inrtIn random selection of orFeature vector Co of a feature componentrChild training set TrCorresponding partial feature vectors constitute a partial feature sub-training set TorTraining set To partial featurerAs input of decision Tree algorithm, constructing decision Tree TreerQ mutually independent decision trees corresponding to the sub-training set TSET form a model M1;
step 8c) step 8a) to step 8b) are realized by calling a random forest algorithm function RandomForestClassiier in a Sklear library of Python, a shadow shuttle VPN training set TRSS is used as the input of a random forest algorithm, a multi-classification model M2 is constructed in the same way as the steps 8a) to 8b), two smaller classification models are built, shadow shuttle VPN flow and non-shadow shuttle VPN flow are sequentially classified, and the scheme of application program flow identification under the shadow shuttle VPN is adopted, so that the algorithm complexity is reduced to O (N)2) The random forest algorithm has the advantages that the class number N is increased, so that the time for establishing a classification model can be reduced in a square level manner, and the problems of high algorithm complexity and long training time of the method for directly establishing a large multi-classification model containing all classes in the prior art are solvedThe efficiency of establishing a classification model is improved, so that the efficiency of identifying the flow of the application program of the shadow shuttle VPN agent is improved;
step 9) obtaining the identification result of the application program flow:
step 9a) predicting the test set TE as the input of a binary model M1 to obtain a prediction result set R consisting of labels of feature vectors, and setting the label value in R as label1The corresponding feature vectors form a shuttle VPN test set TESS;
and step 9b) taking the shuttle VPN test set TESS as the input of the multi-classification model M2 for prediction to obtain the labels of application programs Telegram, Facebook, YouTube, Whatsapp and Twitter corresponding to each feature vector in the TESS.
The foregoing description is only an example of the present invention and should not be construed as limiting the invention in any way, and it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the principles and arrangements of the invention, but such changes and modifications are within the scope of the invention as defined by the appended claims.

Claims (2)

1. A method for identifying application program flow under a shadow shuttle VPN based on distribution characteristics and a random forest algorithm is characterized by comprising the following steps:
(1) preprocessing the flow data of the smart phone:
the method comprises the steps of forming original data sets by data packets except lost and retransmitted data packets in smart phone flow, forming data packets with the same source IP address, destination IP address, source port number, destination port number and transport layer protocol in the original data sets into stream data, and forming the stream data sets into stream data by all the stream data, wherein the stream data sets form stream data sets (flow)1,flow2,...,flowk,...,flowx},flowkIs the kth stream data, x is the total number of the stream data, and x is more than or equal to 10000;
(2) acquiring a time characteristic vector set CT of a flow data set DataSet:
(2a) flow data flow in flow data set DataSetkAll the data packets in the sequence are arranged according to the ascending sequence of the time stamps, and the difference between the time stamp of the last data packet and the time stamp of the first data packet, namely flow, is calculatedkWhile calculating statistics of the time intervals between adjacent data packets: maximum, minimum, mean and standard deviation;
(2b) taking the data packet with the same sending direction as the first data packet in all the sequenced data packets as an input packet, taking the other data packets as output packets, and calculating the statistic value of the time interval between the adjacent data packets in the input packet and the statistic value of the time interval between the adjacent data packets in the output packet;
(2c) will flowkDuration, flow ofkThe statistic of the time interval between adjacent data packets in the input packet, the statistic of the time interval between adjacent data packets in the input packet and the statistic of the time interval between adjacent data packets in the output packet constitute a flowkTemporal feature vector CT ofkThe time characteristic vectors corresponding to the stream data set DataSet form a time characteristic vector set CT;
(3) acquiring a statistical characteristic vector set CS of a stream data set:
(3a) flow data flow in statistical flow data set DataSetkThe number of all data packets in the system and calculating the statistic value of the length of all data packets;
(3b) will flowkNumber and flow of all packets inkThe statistic of all data packet lengths in the flowkStatistical feature vector CS ofkForming a statistical characteristic vector set CS by the statistical characteristic vectors corresponding to the stream data set DataSet;
(4) acquiring a uniformly distributed feature vector set CD1 of the stream data set DataSet:
(4a) setting a uniform distribution interval set USEC comprising m sub-intervals:
USEC={USEC1,USEC2,...,USECi,...,USECm}
wherein, the USECiDenotes the ith subinterval, USECi=((i-1)d,id],i∈{1M is more than or equal to 20 and less than or equal to 200, d represents the step length,
Figure FDA0003404865400000021
l represents the maximum of all packet lengths in the original data set,
Figure FDA0003404865400000022
represents rounding up;
(4b) acquiring flow data flow in flow data set DataSetkLength of each packet in (1) and will satisfy the length ∈ USECiNumber of packets Count ofi(USECi) Arranging the i in the order from small to large to form a uniformly distributed characteristic vector CD1kThe uniformly distributed characteristic vectors corresponding to the stream data set constitute a uniformly distributed characteristic vector set CD 1;
(5) acquiring a log distribution characteristic vector set CD2 of the stream data set DataSet:
(5a) setting a logarithm distribution interval set LSEC comprising n subintervals:
LSEC={LSEC1,LSEC2,...,LSECj,...,LSECn}
wherein, LSECjDenotes the jth sub-interval, LSECj=(bj-1,bj],j∈{1,2,...,n},logbL<n≤logbL +1 and n are integers, b represents the base number, and b is more than 1 and less than or equal to 3;
(5b) acquiring flow data flow in flow data set DataSetkLength of each packet in (1) and will satisfy the length ∈ LSECjNumber of packets Count ofj(LSECj) The characteristic vectors CD2 are formed according to the sequence of j from small to largekThe logarithm distribution characteristic vectors corresponding to the stream data set DataSet form a logarithm distribution characteristic vector set CD 2;
(6) acquiring a feature matrix C of the stream data set:
feature vector CT in set CT of temporal feature vectorskFeature vector CS in statistical feature vector set CSkFeatures in the set of evenly distributed feature vectors CD1Vector CD1kAnd the feature vector CD2 in the log-distributed feature vector set CD2kTransversely splicing to obtain flow data flowkCharacteristic vector C ofkAll feature vectors C1,C2,...,Ck,...,CxPerforming longitudinal splicing to obtain a feature matrix C of the stream data set;
(7) acquiring a training set TR, a shadow shuttle VPN training set TRSS and a test set TE:
(7a) forming a characteristic vector set TRS (character vector set) by more than half of randomly selected characteristic vectors in a characteristic matrix C of a stream data set (C)1,C2,...,Cs,...,CzAnd forming a test set TE, C by the rest characteristic vectorssRepresenting the s-th feature vector in the TRS, z representing the total number of feature vectors;
(7b) for feature vector CsAdd tag L1, L1 ∈ { label ∈0,label1},label1Indicating virtual private network VPN flow data, label, of shadow shuttle0Representing the normal flow data of the non-shadow virtual private network VPN, and forming a training set TR by using z feature vectors and a label L1 of each feature vector;
(7c) all the label values in the copy training set TR are label1Changing the label L1 of each feature vector into an application program L2 from which the data packet in the stream data corresponding to the feature vector comes, and then combining the feature vectors and the label L2 of each feature vector to form a shadow shuttle VPN training set TRSS;
(8) acquiring a two-classification model M1 and a multi-classification model M2 based on a random forest algorithm:
respectively taking a training set TR and a shadow shuttle VPN training set TRSS as the input of a random forest algorithm, and constructing a two-classification model M1 and a multi-classification model M2, wherein the implementation steps are as follows:
(8a) acquiring a sub-training set TSET from the training set TR:
by calling a random forest algorithm function RandomForestClassiier in a Sklearn library of Python, taking a training set TR as the input of a random forest algorithm, extracting p feature vectors from the training set TR each time in a Q-time release sampling mode, and respectively extracting p feature vectorsForming Q sub-training sets, wherein the Q sub-training sets form a sub-training set TSET (time-to-time) set, and the TSET is { T ═ T }1,T2,...,Tr,...,TQWhere T isrIs the r-th sub-training set, and Tr={Cr1,Cr2,...,Crt,...,Crp},CrtIs the t-th feature vector, and Crt=(fea1,fea2,...,feau,...,feaw),feauIs the u-th feature, w is the total number of features;
(8b) generating a binary model M1 by using the sub training set TSET:
from TrCharacteristic vector C inrtIn random selection of orFeature vector Co of a feature componentrChild training set TrCorresponding partial feature vectors constitute a partial feature sub-training set TorTraining set To partial featurerAs input of decision Tree algorithm, constructing decision Tree TreerQ mutually independent decision trees corresponding to the sub-training set TSET form a model M1;
(8c) by calling a random forest algorithm function RandomForestClassiier in a sklern library of Python, a shadow-shuttle VPN training set TRSS is used as the input of a random forest algorithm, and a multi-classification model M2 is constructed in the same manner as the steps (8a) to (8 b);
(9) acquiring an identification result of the application program flow:
(9a) predicting by taking the test set TE as the input of a binary model M1 to obtain a prediction result set R consisting of labels of feature vectors, and taking the label value in R as label1The corresponding feature vectors form a shuttle VPN test set TESS;
(9b) and (3) predicting by taking the shadow shuttle VPN test set TESS as the input of a multi-classification model M2 to obtain an application program label corresponding to each feature vector in the TESS.
2. The method for identifying the flow of the application program under the shadow shuttle VPN based on the distribution characteristics and the random forest algorithm as claimed in claim 1, wherein the statistical values of the packet lengths in the step (3a) comprise an average value, a minimum value, a maximum value, an absolute difference, an absolute median difference, a standard deviation, a variance, a skew, a kurtosis, a 10% percentile, a 20% percentile, a 30% percentile, a 40% percentile, a 50% percentile, a 60% percentile, a 70% percentile, an 80% percentile and a 90% percentile of the packet lengths.
CN201910852780.5A 2019-09-10 2019-09-10 Application program flow identification method under VPN based on distributed feature random forest Active CN110460502B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910852780.5A CN110460502B (en) 2019-09-10 2019-09-10 Application program flow identification method under VPN based on distributed feature random forest

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910852780.5A CN110460502B (en) 2019-09-10 2019-09-10 Application program flow identification method under VPN based on distributed feature random forest

Publications (2)

Publication Number Publication Date
CN110460502A CN110460502A (en) 2019-11-15
CN110460502B true CN110460502B (en) 2022-03-04

Family

ID=68491533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910852780.5A Active CN110460502B (en) 2019-09-10 2019-09-10 Application program flow identification method under VPN based on distributed feature random forest

Country Status (1)

Country Link
CN (1) CN110460502B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111541621B (en) * 2019-12-25 2021-09-07 西安交通大学 VPN flow classification method based on turn packet interval probability distribution
CN112235254B (en) * 2020-09-22 2023-03-24 东南大学 Rapid identification method for Tor network bridge in high-speed backbone network
CN112217834B (en) * 2020-10-21 2021-06-18 北京理工大学 Internet encryption flow interactive feature extraction method based on graph structure
CN113067724B (en) * 2021-03-11 2022-04-19 西安电子科技大学 Periodic flow prediction method based on random forest
CN113283498A (en) * 2021-05-21 2021-08-20 东南大学 VPN flow rapid identification method facing high-speed network
CN114338400B (en) * 2021-12-31 2024-05-14 中国电信股份有限公司 SDN network dynamic control method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504365A (en) * 2014-11-24 2015-04-08 闻泰通讯股份有限公司 System and method for smiling face recognition in video sequence
CN109144837A (en) * 2018-09-04 2019-01-04 南京大学 A kind of user behavior pattern recognition methods for supporting precisely to service push
CN109726735A (en) * 2018-11-27 2019-05-07 南京邮电大学 A kind of mobile applications recognition methods based on K-means cluster and random forests algorithm
CN109729099A (en) * 2019-03-08 2019-05-07 西安电子科技大学 A kind of Internet of Things traffic flow analysis method based on Android VPNService
CN109951444A (en) * 2019-01-29 2019-06-28 中国科学院信息工程研究所 A kind of encryption Anonymizing networks method for recognizing flux

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10356117B2 (en) * 2017-07-13 2019-07-16 Cisco Technology, Inc. Bayesian tree aggregation in decision forests to increase detection of rare malware

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504365A (en) * 2014-11-24 2015-04-08 闻泰通讯股份有限公司 System and method for smiling face recognition in video sequence
CN109144837A (en) * 2018-09-04 2019-01-04 南京大学 A kind of user behavior pattern recognition methods for supporting precisely to service push
CN109726735A (en) * 2018-11-27 2019-05-07 南京邮电大学 A kind of mobile applications recognition methods based on K-means cluster and random forests algorithm
CN109951444A (en) * 2019-01-29 2019-06-28 中国科学院信息工程研究所 A kind of encryption Anonymizing networks method for recognizing flux
CN109729099A (en) * 2019-03-08 2019-05-07 西安电子科技大学 A kind of Internet of Things traffic flow analysis method based on Android VPNService

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Koki Kodama等.A New Method to Evaluate Flow Classified One-to-all Reliability.《The 18th Asia-Pacific Network Operations and Management Symposium (APNOMS) 2016》.2016, *
李瑞杏.基于网站指纹的shadowsocks匿名流量识别技术的研究.《万方学位论文》.2018, *
王琳.基于混合方法的 SSL VPN 加密流量识别研究.《计算机应用与软件》.2019, *
赵双等.基于机器学习的流量识别技术综述与展望.《计算机工程与科学》.2018, *

Also Published As

Publication number Publication date
CN110460502A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN110460502B (en) Application program flow identification method under VPN based on distributed feature random forest
Shapira et al. FlowPic: A generic representation for encrypted traffic classification and applications identification
Dong et al. CETAnalytics: Comprehensive effective traffic information analytics for encrypted traffic classification
CN110012029B (en) Method and system for distinguishing encrypted and non-encrypted compressed flow
CN108768986B (en) Encrypted traffic classification method, server and computer readable storage medium
CN107733851A (en) DNS tunnels Trojan detecting method based on communication behavior analysis
CN105871832A (en) Network application encrypted traffic recognition method and device based on protocol attributes
CN111464485A (en) Encrypted proxy flow detection method and device
CN106972927A (en) A kind of encryption method and system for different safety class
Lingyu et al. A hierarchical classification approach for tor anonymous traffic
Khakpour et al. An information-theoretical approach to high-speed flow nature identification
CN111611280A (en) Encrypted traffic identification method based on CNN and SAE
Lu et al. A heuristic-based co-clustering algorithm for the internet traffic classification
CN114301850B (en) Military communication encryption flow identification method based on generation of countermeasure network and model compression
CN113283498A (en) VPN flow rapid identification method facing high-speed network
Jackson et al. Amazon echo security: Machine learning to classify encrypted traffic
Chiu et al. CAPC: packet-based network service classifier with convolutional autoencoder
CN109728977B (en) JAP anonymous flow detection method and system
Liang et al. FECC: DNS Tunnel Detection model based on CNN and Clustering
Redondi et al. Passive classification of Wi-Fi enabled devices
Altschaffel et al. Statistical pattern recognition based content analysis on encrypted network: Traffic for the teamviewer application
CN112383488A (en) Content identification method suitable for encrypted and non-encrypted data streams
Hao et al. IoTTFID: An Incremental IoT device identification model based on traffic fingerprint
Wang et al. A two-phase approach to fast and accurate classification of encrypted traffic
Ma et al. A Multi-perspective Feature Approach to Few-shot Classification of IoT Traffic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant