CN110417729B - Service and application classification method and system for encrypted traffic - Google Patents

Service and application classification method and system for encrypted traffic Download PDF

Info

Publication number
CN110417729B
CN110417729B CN201910504060.XA CN201910504060A CN110417729B CN 110417729 B CN110417729 B CN 110417729B CN 201910504060 A CN201910504060 A CN 201910504060A CN 110417729 B CN110417729 B CN 110417729B
Authority
CN
China
Prior art keywords
flow
traffic
encrypted
session
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910504060.XA
Other languages
Chinese (zh)
Other versions
CN110417729A (en
Inventor
崔苏苏
卢志刚
姜波
徐健锋
刘松
崔泽林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201910504060.XA priority Critical patent/CN110417729B/en
Publication of CN110417729A publication Critical patent/CN110417729A/en
Application granted granted Critical
Publication of CN110417729B publication Critical patent/CN110417729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload

Abstract

The invention discloses a method and a system for classifying services and applications of encrypted traffic. The method comprises the following steps: 1) segmenting continuous flow to be processed into a plurality of session flows according to session granularity; 2) segmenting each processed conversation flow according to the data packet granularity, and segmenting each conversation flow into a plurality of flow groups, wherein the number of data packets in each flow group does not exceed a set maximum value; 3) unifying the sizes of the flow groups, converting each flow group into a flow matrix, and packaging the flow matrix and a label thereof into an IDX flow file; 4) training a capsNet model by using the IDX flow file to obtain an identification model with automatic feature selection capability; 5) and for an encrypted flow to be identified, dividing the encrypted flow, converting the encrypted flow into a flow matrix, and inputting the flow matrix into the identification model to obtain the service type and the application class of the flow to be identified. The invention can effectively classify the encrypted flow.

Description

Service and application classification method and system for encrypted traffic
Technical Field
The invention provides a service and application classification method for encrypted flow, provides a novel flow secondary segmentation mechanism, and simultaneously combines a capsule neural network (CapsNet) to realize effective classification of the encrypted flow.
Background
In recent years, with the continuous development of internet technology and information science technology, network traffic is increased explosively. According to the visual network index prediction report published by cisco, IP traffic data transmitted over public and private networks, including hosted IP traffic, consumer-generated mobile data traffic, and internet traffic, produced 122EB (1EB 220TB) traffic data globally averaged every month in 2017, while global IP traffic increased twice by 2022 to 396 EBs per month. Meanwhile, with the continuous change of the demand of netizens on the network world, various novel services emerge endlessly. These new services bring convenience to netizens and increase the heterogeneity and complexity of the network, which brings unprecedented challenges to network security.
In the aspect of network security, in recent years, network security has become one of the core problems faced by the internet, malicious network behaviors such as information leakage, illegal intrusion, DDoS attack and the like increasingly affect the use of the internet by users, and with the development and progress of technologies, the traffic characteristics of network malicious attack become increasingly complex and hidden. According to the data of the identity theft resource center in 2018, nearly 3420 ten thousand theft records are obtained by 9 months in 2018; according to the infrastructure safety report of the Arbor Networks in the 13 th year, the DDoS peak attack amount in the first half year of 2018 reaches 1.7Tbps, which is increased by 179 percent compared with the first half year of 2017, and by 2022, the total number of global DDoS attacks is doubled compared with the last half year of 2017, and reaches 1450 ten thousand. Network managers need to classify and identify network traffic to quickly and accurately locate abnormal behaviors in the network, cut off the propagation path of malicious intrusion in time, and reduce harm and loss of the malicious intrusion to users as much as possible. Meanwhile, unknown and disguised Webshell can be found through the flow identification technology, the whole attack process is restored from the Kill Chain, and an attacker, an attack tool, an attack technique and the like are deeply analyzed and portrayed.
The classification and identification technology of network traffic is an essential part in network security situation awareness through each module of the security situation awareness. A large number of network traffic classification and identification technologies have been proposed, and can be roughly classified into port-based traffic identification technology, deep packet inspection-based traffic identification technology, statistics-based traffic identification technology, and behavior-based traffic identification technology.
The network flow identification technology has a good identification effect on traditional network application. However, global encrypted network traffic is constantly increasing after exposure to "prism" monitoring items. The report of Sandvine 2018 shows that over 50% of traffic on the internet is encrypted and will continue to grow. In order to avoid the detection of firewalls and antivirus software, most malicious software generally uses a traffic encryption technology to hide communication information. The traffic encryption is almost a fact standard practice of all network applications including malicious software, an identification technology based on encrypted traffic becomes an important means for detecting security threats under the situation that contents cannot be interpreted, key information such as network behaviors and process behaviors is analyzed through the encrypted traffic identification technology, an attack process is restored through KillChain analysis, and a threat processing suggestion is provided for a security administrator.
Although many research results are obtained in the current research on traffic identification, most of the existing results are directed to non-encrypted traffic identification research. In the actual flow identification process, the encrypted flow identification is not suitable for the traditional flow identification technology. With the advent of P2P application and the widespread use of dynamic port number technology, the method of identifying traffic using port numbers is no longer effective; the development of port obfuscation techniques further limits its effectiveness. The increasing encryption flow rate can not be identified by using a deep packet detection method because the load characteristic is hidden, and the application of encapsulation protocol technologies such as a tunnel and the like is further limited. In addition to this, since deep packets are identified by analyzing application layer data, this involves a problem of invasion of user privacy. Due to the lack of effective encryption traffic analysis and management technology, huge challenges are brought to network management and security.
At present, methods for identifying encrypted traffic based on machine learning are very rich, but traditional machine learning needs manual feature extraction and classification accuracy excessively depends on feature selection, which not only limits the expandability of the method, but also prevents the method from realizing real-time classification. Deep learning is an effective way for solving the problem of manually extracting features in traditional machine learning, can automatically extract features from input data without human intervention, establishes a model and explains data in a human brain simulation mode to achieve the purpose of identifying encrypted flow in the Internet, and is a brand-new attempt.
According to statistics, the encrypted traffic identification algorithm based on deep learning mainly comprises a multilayer perceptron (MLP), a stacked encoder (SAE) and a one-dimensional convolutional neural network (1dCNN), and in comparison of the encrypted traffic identification algorithms of a large number of researchers, the identification algorithm based on deep learning achieves higher identification precision than that of the traditional machine learning, and in the algorithm based on deep learning, the 1dCNN algorithm achieves the best encrypted traffic identification effect.
However, 1dCNN requires that features be location independent and only the presence or absence of features is considered in the recognition process without regard to the location and other attributes of the features. But we consider the position of a specific string in the traffic and the order of the packets to be considered as one of the features. Besides, in the task of identifying encrypted traffic, these encoded files are not equivalent to picture files, which are no longer suitable for the pooling operation of CNNs. There is no question of whether the max pooling operation or the min pooling operation would discard some information and change the active features behind the encoded string.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides a service and application classification method and system for encrypted traffic. The invention discloses an encrypted flow classification model based on a capsule neural network (capsNet) secondary segmentation mechanism, which is named as SPcaps and can effectively classify encrypted flows. The present invention classifies different division scenarios of encryption traffic, specifically including service classification of encryption traffic, that is, classifying according to service type of encryption traffic, such as: web browsing, streaming media, instant messaging, etc.; the application classification of the encrypted traffic is to be classified according to the application program to which the encrypted traffic belongs, such as: skype, BitTorrent, YouTube, etc.
The invention provides a novel flow secondary segmentation mechanism in the data preprocessing process, develops a set of preprocessing tool set integrated by an EditCap tool, a SplitCap tool, a Powershell script and a Python script, and aims to dilute the proportion of irrelevant flow and increase the weight of effective flow. In addition, the invention realizes the training of the model by combining the CapsNet algorithm, and in the encrypted flow classification, the CapsNet can make up the disadvantages of 1dCNN, which is mainly embodied in the following aspects: 1) in the invention, the length of the vector represents the probability of the class to which the flow belongs, and the direction of the vector represents the attribute of the class, including the fixed position of a specific character string in the flow and the arrangement sequence of data packets. 2) The CapsNet does not use the pooling operation in the convolutional neural network, the pooling operation reduces the connection parameters and refines the characteristics, and simultaneously discards some necessary information, and the CapsNet discarding pooling operation is more suitable for the coding files such as the flow. 3) On the premise of ensuring the identification precision, the capsNet has a higher identification speed than the CNN, so that the method is more suitable for flow identification in a real-time environment.
In order to achieve the purpose, the invention adopts the specific technical scheme that:
an identification method of encrypted traffic comprises the following steps:
1) performing first segmentation according to session granularity: the deep learning-based traffic classification method requires that continuous traffic is first segmented into a plurality of discrete units according to a certain granularity. There are five ways of network traffic segmentation: TCP connection, flow, session, service, host. Where flows and sessions are the more heavily used traffic manifestations in current research. Therefore, the invention carries out the first segmentation on the original flow to be processed according to the conversation granularity. A session refers to a packet of traffic composed of bi-directional flows, i.e. having the same five-tuple (source IP, source port, destination IP, destination port, transport layer protocol), where source IP and destination IP can be interchanged.
2) And (3) encrypted flow cleaning: in traffic classification, Mac addresses of the data link layer and IP addresses of the network layer (source IP, destination IP) cannot be characterized as the classification. If the traffic capture environment is limited, the Mac address and the IP address may affect the training of the model to a certain extent, resulting in overfitting of the classification, so we delete the fields of the Mac address and the IP address in the packet.
3) And carrying out second segmentation according to the granularity of the data packet: since the traffic collected from the actual network environment contains some packets that are not related to classification, this will directly affect the training and testing of the model. We therefore continue to slice the traffic by setting the maximum number of packets in the traffic through step 2). Since most of the sessions to be divided show normal communication processes, the step dilutes the specific weight of irrelevant flow in the original flow and increases the weight of effective flow.
4) Input form of the standard encrypted traffic: the training data using the neural network needs input with a fixed size, so the traffic files subjected to the above steps are unified in size according to fixed bytes, if the traffic files are larger than the set fixed bytes, the bytes after the deletion are deleted, and if the traffic files are smaller than the fixed bytes, the fixed bytes are supplemented with 00. Finally, the traffic processed in the above way is converted into a traffic matrix, and traffic matrix samples and labels thereof are packed by an IDX file, which is an input file standard format used by many CapsNet and CNN models.
5) Model training based on the CapsNet: and learning the spatial characteristics of the encrypted flow by using the IDX-format flow file processed in the steps and adopting convolution operation and a dynamic routing mechanism based on the CapsNet, and establishing an efficient identification model with automatic characteristic selection capability, so that the efficient identification model can be effectively classified according to the identified encrypted flow and the service type and the application type of the flow.
6) And (3) encrypted flow identification: the identification and classification of the encrypted traffic are completed by using the model trained in the steps, wherein the method can realize effective encrypted traffic classification in the following scenes, and comprises the following steps: 1) service classification, namely identifying the service type to which the encrypted traffic belongs; 2) application classification, i.e., identifying the specific application to which the encrypted traffic belongs.
The invention provides a service and application classification system for encrypted traffic, which is characterized by comprising a traffic preprocessing module, a model training module and an encrypted traffic identification module; wherein the content of the first and second substances,
the flow preprocessing module is used for segmenting continuous flow to be processed into a plurality of session flows according to the session granularity; then, each conversation flow is segmented according to the data packet granularity, each conversation flow is segmented into a plurality of flow groups, and the number of data packets in each flow group does not exceed a set maximum value; then unifying the size of each flow group, converting each flow group into a flow matrix, and packaging the flow matrix and a label thereof into an IDX flow file;
the model training module is used for training a Capsule Net model by using an IDX flow file to obtain an identification model with automatic feature selection capability;
and the encrypted flow identification module is used for inputting the flow matrix of the encrypted flow to be identified into the identification model to obtain the service type and the application category of the flow to be identified.
Compared with the prior art, the invention has the following positive effects:
1. the invention provides an encrypted traffic identification model based on a CapsNet, which can take a specific position of a fixed code in traffic and an arrangement sequence between packets as one of learning characteristics.
2. The invention provides a secondary segmentation mechanism of flow, which is used for diluting the proportion of irrelevant flow and increasing the weight of effective flow, and can realize effective noise reduction of the flow while determining the flow expression.
3. The invention adopts the publicly available ISCX VPN-non VPN data set to evaluate the SPCaps model, and the experimental result shows that the SPCaps are superior to the most advanced identification method in the encrypted flow service and the application identification task.
Drawings
FIG. 1 is an overall flow chart of the present invention.
Fig. 2 is a schematic diagram of the flow double segmentation mechanism of the present invention.
FIG. 3 is a diagram of a model architecture based on the CapsNet of the present invention.
Fig. 4 is a size distribution diagram of original traffic in an ISCX VPN-non VPN dataset at session granularity.
Detailed Description
In order to make the technical solutions in the embodiments of the present invention better understood and make the objects, features, and advantages of the present invention more comprehensible, the technical core of the present invention is described in further detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the invention, a service and application classification method for encrypted traffic is designed. The general idea of the method is to segment, clean and standardize the encrypted flow under the real environment through a preprocessing tool set, dilute the proportion of irrelevant flow and increase the weight of effective flow, further establish a model based on the CapsNet to learn the spatial characteristics of the encrypted flow, and finally realize effective encrypted flow identification and classification of service and application.
The overall flow chart of the invention is shown in fig. 1, and the details of the steps of the method are described as follows:
(1) conversion of raw traffic
In the data preprocessing stage of the invention, in order to reduce the noise of the original flow and normalize the input form of the original flow, the conversion of the original flow is completed by the following five steps: Pcap-Sessions are segmented, MAC addresses and IP addresses are deleted, Session-Packets are segmented, input sizes are unified, and the sizes are converted into IDX.
1) Pcap-Sessions segmentation: the deep learning based traffic identification method needs to segment continuous traffic into discrete units with a certain granularity. The original traffic P is a set of different packets, denoted P ═ P1,…,p|P|}. Wherein one data packet piIs defined as:
pi=(xi,bi,ti) (1)
wherein, i ═ 1, 2., | P |, bi∈(0,∞),ti∈[0,∞),xiIs the five tuple of the ith packet (source IP, source port, destination IP,destination port, transport layer protocol), biIs the byte length of the ith packet and tiIs the start time of the ith packet. The original flow is cut for the first time according to the session granularity. One session SiIs a set of bi-directional streams containing the same five tuples, defined as:
Si={p1=(x1,b1,t1),...,pn=(xn,bn,tn)} (2)
wherein x is1=…=xn,t1<…<tnN is SiThe number of packets in (1). The step is specifically realized by a SplitCap tool.
2) Delete Mac address and IP address: the Mac address and the IP address cannot be used as features in the training process, but rather, the existence of the Mac address and the IP address easily causes overfitting of the model, so the Mac address and the IP address are deleted by discarding the character strings of corresponding positions in the data packet. This step is specifically implemented by the EditCap tool.
3) Session-Packets segmentation: there are a large number of smaller-sized sessions of network traffic captured in real environment, which are often traffic identification independent sessions, such as SNMP, DNS, and ARP data segments, which seriously affect the effective identification of traffic. Since those sessions with larger size are the main activities in the communication process and they have only a small number of irrelevant Packets, we propose a Session-Packets splitting method for diluting the weight of the irrelevant traffic and increasing the weight of the effective traffic. The method includes continuously segmenting each session (namely discrete units) by setting the maximum value of a data packet in session traffic to obtain a plurality of traffic groups corresponding to each session, wherein the data packet in each traffic group can not exceed the set maximum value at most. G represents the latest traffic group after Session-Packets segmentation, and is defined as:
Figure BDA0002091181010000061
wherein G isijIs the ith session traffic SiThe jth latest traffic of (1)Group, m is GijC is the maximum number of packets, which is defined as:
Figure BDA0002091181010000062
wherein L issampleRepresenting the byte length, L, of the file storing the traffic groupheaderHeader byte length, L, of a file representing a stored traffic grouppacketIndicating the data byte length of the data packet; the traffic matrix is represented in a pcap file before being converted into an IDX file, which includes, in addition to traffic data, a file header identifying file information (files such as txt. jpg all have a file header), the file header occupies 112 bytes, the maximum value C of all discrete units is uniform, and C is set to 16. The byte length of the flow group is unified to 784 bytes, the minimum byte length of the data packet after the Mac address and the IP address are deleted is 40 bytes, 112 bytes are fixed in the file header after the above processing, the theoretical maximum packet number is 16.8 bytes, but the packet number is an integer, and the communication sequence between the data packets is easily disturbed according to odd number segmentation, so that C is set to 16. The reason for this is that we want to be able to make full use of the traffic group G to predict the whole session. In our view, the larger the number of packets contained in the traffic group G, the more representative it is. Therefore, we make C as large as possible to fully exploit the traffic group G representativeness. We summarize the two-pass segmentation mechanisms (Pcap-Sessions segmentation and Session-Packets segmentation) as shown in fig. 2, where the original traffic is subjected to the Sessions-Sessions segmentation according to the traffic expression form of the Session, and then the Session-Packets segmentation is performed on the Session traffic by setting the maximum number C of Packets in the Session traffic. This step is specifically implemented by the EditCap tool.
4) Unified input size: using a neural network requires a fixed size input, so we unify the traffic group G to 784 bytes, only the first 784 bytes are reserved if the traffic size is larger than 784 bytes; if the traffic size is less than 784 bytes, then the 784 bytes are padded with a set string (e.g., 0x 00). The step is specifically realized by Powershell script.
5) Conversion to IDX: we convert 784 bytes of traffic into a 28 x 28 traffic matrix, i.e., one-dimensional 784 bytes of traffic encoding order into a 28 x 28 traffic matrix. These traffic matrices and their labels are then packaged into IDX files, which are the standard input for many CapsNet and CNN models. The step is specifically realized by a Python script.
(2) CapsNet-based training model
The invention is based on the CapsNet algorithm, and takes the flow matrix and the label packaged by IDX as a data set to establish a service classification model and an application classification model for the encrypted flow. The algorithm mainly comprises convolution operation and dynamic routing, and the architecture diagram is shown in FIG. 3.
1) Convolution operation
The model first reads the 28 x 28 flow matrices that were preprocessed above, while normalizing them. In the ReLU convolutional layer, first, 256 convolutional cores with size 9 × 9 are used to perform a convolution operation with step number 1 on each flow matrix, and 256 feature matrices with size 20 × 20 are generated. The second convolution layer Primarycaps then serves as the input layer for the capsule to construct the vector structure. The PrimaryCaps perform 8 convolution operations with different weights in 256 feature matrices, each convolution operation will use 32 convolution kernels with the size of 9 × 9 to perform convolution operation with the step number of 2, and finally generate 6 × 32 8-dimensional vectors, i.e., active vectors, each of which is a capsule unit composed of 8 ordinary convolution units.
2) Dynamic routing
The third layer of DigitCaps of the neural network is used to deliver and update the capsule's input, including two steps of affine transformation and dynamic routing. In affine transformation, the activity vector u output by the lower Primarycaps layeriAnd a weight matrix WijMultiplying to obtain a prediction vector
Figure BDA0002091181010000071
Input of high-level capsules sjBy
Figure BDA0002091181010000072
The weighted sum, defined as:
Figure BDA0002091181010000073
wherein each motion vector uiRespectively correspond to a weight matrix Wij,WijInitialized and generated by random numbers conforming to standard normal distribution, updated by loss functions, cijIs the coupling coefficient determined by the iterative dynamic routing.
The dynamic routing mechanism aims at finding the best path between the capsule output and the next layer of capsule input, and one of the methods for finding the "best path" is to find the input vector which best matches the output in an iterative manner, and the matching degree is characterized by the inner product of the output vector and the input vector (the vector after affine transformation and weighted summation), and the matching degree is directly added to cijIn the invention, the iteration number is set to be 3 through multiple parameter optimization decision. C in formula (5)ijThe update formula of (2) is as follows:
cij=softmax(bij) (6)
wherein, bijIs the log prior probability that capsule i is coupled to capsule j.
The length of the capsule output vector represents the probability of belonging to a certain class, and therefore the value range thereof should be [0,1], and the process is realized by a compression function, which is defined as follows:
Figure BDA0002091181010000081
wherein v isjIs the output vector, s, of capsule jjIs the input vector for capsule j.
WijAnd other convolution parameters of the whole network are updated by a loss function, so we adopt a Marginloss function as the loss function, which is defined as:
Lc=Tcmax(0,m+-||vc||2)+λ(1-Tc)max(0,||vc||-m-) (8)
where c is the prediction class, TcIs an indicator function, when c predicts correctly, TcEqual to 1, otherwise, TcEqual to 0. m is+Is the vector length vcThe upper boundary of |, m-is the vector length | | | vcThe lower boundary of | l. In addition, we scale down the reconstruction loss by 0.0005 so that it does not dominate the Margin loss function during training.
The flow matrix to be identified passes through the CapsNet, N16-dimensional vectors are output, N represents the total number of classes of the flow to be classified, the length of the vector represents the probability that the flow belongs to a certain class, and the direction of the vector represents the attribute of the flow, including the position of a fixed character string and the sequence among data packets. And then outputting the probability that the flow matrix to be identified belongs to each category by the N16-dimensional vectors through a softmax classifier, wherein the category with the maximum probability is the prediction category of the flow, and the prediction category is the final output of the model.
(3) Identification of encrypted traffic and application and service classification
The identification and classification of the encrypted traffic are completed by using the model trained in the above steps, that is, for the traffic to be identified and classified, the traffic is firstly divided and converted into a traffic matrix, and then the traffic matrix is input into the trained model to obtain the class of the traffic, including: 1) service classification, 2) application classification.
(4) Comparison of Experimental results
To verify the validity of the present invention, we used ISCX VPN-non VPN dataset as raw data, which contains 150 raw traffic files, including 6 regular encrypted traffic (Chat, Streaming, VoIP, etc.) and 6 VPN traffic (VPNChat, VPNStreaming, VPNVoIP, etc.), and in addition, there are 9 raw traffic files that are traffic generated by 5 different applications captured by Tor software. Since Tor traffic only supports encrypted links and TCP flows on the internet, it is difficult to track and analyze their traffic. Therefore, we extract them to implement Tor's application classification. Finally, the effectiveness of the invention is evaluated by comparing four indexes of precision, recall and F1 values with the existing method.
Specifically, we divided the experiment into: 1) evaluating and comparing the effectiveness of deleting the MAC address and the IP address and the secondary segmentation mechanism in the data preprocessing; 2) evaluating and comparing the effectiveness of SPcaps in the classification task of the encrypted traffic service; 3) the effectiveness of SPCaps in the classification task of encrypted traffic application is evaluated and compared.
1) Results of pretreatment
We use the above-mentioned original traffic conversion steps to preprocess the ISCX VPN-non VPN dataset, and after performing Pcap-Sessions segmentation, we count the byte size distribution of the session traffic for service classification as shown in fig. 4.
It can be seen that the size distribution of conversational traffic is highly unbalanced, and over 50% of these 12 traffic is less than 0.5KB, most of which are traffic unrelated to the classification task. In particular, more than 80% of Chat, Email, File, and Voip have less than 0.2KB of session traffic. Thus, the size distribution of the Session traffic confirms the necessity and reasonableness of the Session-Packets segmentation in the preprocessing process. According to equation (4), we set the maximum number of Packets per Session to 16 in the Session-Packets segmentation step. Finally, in the service classification task of encrypting traffic, the category name, the included application and the traffic are summed up as shown in table 1.
TABLE 1 sample content for encrypted traffic service classification
Categories Application program Total of
Chat AIM Facebook Hangouts ICQ Skype 11365
Email Email Gmail 12822
File Ftps SCP Sftp Skype 19553
P2P Torrent 60000
Streaming Facebook Hangouts Netflix Skype Spotify Vimeo YouTube 21273
Voip Facebook Hangouts Skype Voipbuster 21000
VPNChat AIM Facebook Hangouts ICQ Skype 13710
VPNEmail Email 2890
VPNFile Ftps Sftp Skype 17528
VPNP2P Bittorrent 6000
VPNStreaming Facebook Netflix Spotify Vimeo YouTube 12000
VPNVoip Hangouts Skype Voipbuster 14805
2) Comparison of pretreatment
In the conversion process of original traffic, we propose to delete the Mac address and the IP address to avoid overfitting, and we propose Session-Packets segmentation to perform the second segmentation on the traditional Session traffic. In addition, to demonstrate that CapsNet is more suitable than 1dCNN in flow classification, we compared using these two neural network algorithms in each experiment. Therefore, we performed six different classification tasks for encrypted traffic services on the ISCX VPN-non VPN dataset, and the experimental results are shown in table 2.
Table 2 shows the results of comparative pretreatment experiments
Figure BDA0002091181010000091
Figure BDA0002091181010000101
The results show that both 1dCNN and CapsNet, our proposed deletion of Mac and IP addresses and Session-Packets segmentation demonstrated better classification. In addition, it can be seen in comparative experiments of the two neural networks that CapsNet demonstrated higher classification accuracy and F1 values than 1 dCNN.
3) Comparison of encrypted traffic service classifications
To evaluate and compare the effectiveness of SPCaps in encrypted traffic service classification, we performed experiments with traffic in 12 in ISCX VPN-non VPN. As shown in Table 3, the experimental results show that the precision can reach 99.1%, and the precision ratio and the recall ratio of each category are both more than 97%.
Table 3 shows the classification experimental results of encrypted traffic service
Figure BDA0002091181010000102
Next, in encrypted traffic service classification, we compare SPCaps with the existing baseline method, and the comparison results are shown in table 4. The results show that SPCaps show better classification effect and reach the practical application standard.
Table 4 shows the comparison of SPcaps and baseline method for encrypted traffic service classification
Method of producing a composite material Input form Recall ratio of Precision ratio F1 value
SPCaps Session-Packets 99.3 99.3 99.3
1dCNN Session 90.6 88.9 89.7
SAE Deep Packets 92 92 92
1dCNN Deep Packets 94 93 93
4) Comparison of encrypted traffic application classifications
To evaluate the effectiveness of SPCaps in the application classification task for Tor traffic, we performed experiments on 5 different applications' traffic captured by Tor in ISCX VPN-non VPN, with the experimental results shown in table 5. The result shows that the accuracy of the SPCap can reach 99.8% in the application program classification task of the Tor flow.
Table 5 shows the classification experimental results of the encrypted flow application
Figure BDA0002091181010000103
Figure BDA0002091181010000111
Next, in encrypted traffic application classification, we compare SPCaps with the existing baseline method, and the comparison results are shown in table 6. The results show that SPCaps achieves a breakthrough effect in Tor application classification.
Table 6 shows the comparison results of the classification of the encrypted traffic application SPcaps and the baseline method
Method of producing a composite material Recall ratio of Precision ratio F1 value
SPcaps 99.4 99.5 99.5
SAE 57 44 30
1dCNN 35 40 36
The experiments show that the SPcaps can realize effective encryption flow classification, and the experimental result reaches the standard of practical application.
The above-mentioned embodiments only express the implementation mode of the present invention, and the description thereof is specific, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the appended claims.

Claims (10)

1. A service and application classification method for encrypted traffic comprises the following steps:
1) segmenting continuous flow to be processed into a plurality of session flows according to session granularity;
2) segmenting each processed conversation flow according to the data packet granularity, and segmenting each conversation flow into a plurality of flow groups, wherein the number of data packets in each flow group does not exceed a set maximum value;
3) unifying the sizes of the flow groups, converting each flow group into a flow matrix, and packaging the flow matrix and a label thereof into an IDX flow file;
4) training a capsNet model by using the IDX flow file to obtain an identification model with automatic feature selection capability;
5) and for an encrypted flow to be identified, dividing the encrypted flow, converting the encrypted flow into a flow matrix, and inputting the flow matrix into the identification model to obtain the service type and the application class of the flow to be identified.
2. The method of claim 1, wherein the ith session traffic SiThe jth traffic group in (1) is Gij(ii) a Wherein G isij={p1=(x1,b1,t1),…,pi=(xi,bi,ti),…,pm=(xm,bm,tm)},
Figure FDA0002570885370000011
m is GijC is the set maximum number of packets, session flow SiOf the ith data packet pi=(xi,bi,ti),xiIs GijQuintuple of the ith packet, biIs GijByte length of the ith data packet, tiIs GijStart time of the ith packet, | SiL is the session traffic SiTotal number of packets in (1).
3. The method of claim 2,
Figure FDA0002570885370000012
wherein L issampleRepresenting the byte length, L, of the file storing the traffic groupheaderHeader byte length, L, of a file representing a stored traffic grouppacketIndicating the byte length of the data packet.
4. The method of claim 1, wherein data scrubbing is performed on each session flow, and Mac addresses and IP addresses are deleted; then step 2) is performed.
5. The method of claim 1, wherein converting the traffic groups into the traffic matrix is by: converting the one-dimensional flow coding sequence of the flow group into a two-dimensional flow matrix; the traffic group of uniform size is 784 bytes, and the converted traffic matrix is a 28 × 28 traffic matrix.
6. The method of claim 1, wherein the method of training the CapsNet model using the IDX traffic file is: firstly, performing convolution operation on each flow matrix by utilizing a first convolution layer to generate a plurality of characteristic matrixes; then carrying out convolution operation on the feature matrix to generate a plurality of activity vectors; then, each activity vector is multiplied by the corresponding weight matrix to obtain a prediction vector, and the prediction vectors of the lower layer are weighted and summed to be used as the input of the high-layer capsule.
7. A service and application classification system for encrypted traffic is characterized by comprising a traffic preprocessing module, a model training module and an encrypted traffic identification module; wherein the content of the first and second substances,
the flow preprocessing module is used for segmenting continuous flow to be processed into a plurality of session flows according to the session granularity; then, each conversation flow is segmented according to the data packet granularity, each conversation flow is segmented into a plurality of flow groups, and the number of data packets in each flow group does not exceed a set maximum value; then unifying the size of each flow group, converting each flow group into a flow matrix, and packaging the flow matrix and a label thereof into an IDX flow file;
the model training module is used for training a Capsule Net model by using an IDX flow file to obtain an identification model with automatic feature selection capability;
and the encrypted flow identification module is used for inputting the flow matrix of the encrypted flow to be identified into the identification model to obtain the service type and the application category of the flow to be identified.
8. The system of claim 7, wherein the ith session traffic SiThe jth traffic group in (1) is Gij(ii) a Wherein G isij={p1=(x1,b1,t1),…,pi=(xi,bi,ti),…,pm=(xm,bm,tm)},
Figure FDA0002570885370000021
m is GijC is the set maximum number of packets, session flow SiOf the ith data packet pi=(xi,bi,ti),xiIs GijQuintuple of the ith packet, biIs GijByte length of the ith data packet, tiIs GijStart time of the ith packet, | SiL is the session traffic SiTotal number of packets in (1).
9. The system of claim 8,
Figure FDA0002570885370000022
wherein L issampleRepresenting the byte length, L, of the file storing the traffic groupheaderHeader byte length, L, of a file representing a stored traffic grouppacketIndicating the byte length of the data packet.
10. The system of claim 7, wherein the traffic preprocessing module performs data cleaning on each session traffic to remove Mac addresses and IP addresses; then, each session flow is segmented according to the data packet granularity.
CN201910504060.XA 2019-06-12 2019-06-12 Service and application classification method and system for encrypted traffic Active CN110417729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910504060.XA CN110417729B (en) 2019-06-12 2019-06-12 Service and application classification method and system for encrypted traffic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910504060.XA CN110417729B (en) 2019-06-12 2019-06-12 Service and application classification method and system for encrypted traffic

Publications (2)

Publication Number Publication Date
CN110417729A CN110417729A (en) 2019-11-05
CN110417729B true CN110417729B (en) 2020-10-27

Family

ID=68358996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910504060.XA Active CN110417729B (en) 2019-06-12 2019-06-12 Service and application classification method and system for encrypted traffic

Country Status (1)

Country Link
CN (1) CN110417729B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111967798B (en) * 2020-09-07 2023-10-03 度小满科技(北京)有限公司 Method, device and equipment for distributing experimental samples and computer readable storage medium
CN112468324B (en) * 2020-11-11 2023-04-07 国网冀北电力有限公司信息通信分公司 Graph convolution neural network-based encrypted traffic classification method and device
CN113162908B (en) * 2021-03-04 2022-11-15 中国科学院信息工程研究所 Encrypted flow detection method and system based on deep learning
CN113037646A (en) * 2021-03-04 2021-06-25 西南交通大学 Train communication network flow identification method based on deep learning
CN113472751B (en) * 2021-06-04 2023-01-17 中国科学院信息工程研究所 Encrypted flow identification method and device based on data packet header
CN113794601B (en) * 2021-08-17 2024-03-22 中移(杭州)信息技术有限公司 Network traffic processing method, device and computer readable storage medium
CN114386079B (en) * 2022-03-23 2022-12-06 清华大学 Encrypted traffic classification method and device based on contrast learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106790019A (en) * 2016-12-14 2017-05-31 北京天融信网络安全技术有限公司 The encryption method for recognizing flux and device of feature based self study
WO2017221152A1 (en) * 2016-06-20 2017-12-28 Telefonaktiebolaget Lm Ericsson (Publ) Method for classifying the payload of encrypted traffic flows
CN109660656A (en) * 2018-11-20 2019-04-19 重庆邮电大学 A kind of intelligent terminal method for identifying application program
CN109831422A (en) * 2019-01-17 2019-05-31 中国科学院信息工程研究所 A kind of encryption traffic classification method based on end-to-end sequence network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8638795B2 (en) * 2010-08-12 2014-01-28 Citrix Systems, Inc. Systems and methods for quality of service of encrypted network traffic
CN102394827A (en) * 2011-11-09 2012-03-28 浙江万里学院 Hierarchical classification method for internet flow
CN106452953A (en) * 2016-09-30 2017-02-22 苏州迈科网络安全技术股份有限公司 Synthetic data feature analysis method and system based on DPI (Deep Packet Inspection) technology
CN107749859B (en) * 2017-11-08 2020-03-31 南京邮电大学 Malicious mobile application detection method for network encryption traffic

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017221152A1 (en) * 2016-06-20 2017-12-28 Telefonaktiebolaget Lm Ericsson (Publ) Method for classifying the payload of encrypted traffic flows
CN106790019A (en) * 2016-12-14 2017-05-31 北京天融信网络安全技术有限公司 The encryption method for recognizing flux and device of feature based self study
CN109660656A (en) * 2018-11-20 2019-04-19 重庆邮电大学 A kind of intelligent terminal method for identifying application program
CN109831422A (en) * 2019-01-17 2019-05-31 中国科学院信息工程研究所 A kind of encryption traffic classification method based on end-to-end sequence network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Capsule network assisted IoT Traffic Classification Mechanism for Smart Cities";Haipeng Yao;《IEEE》;20190225;全文 *

Also Published As

Publication number Publication date
CN110417729A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN110417729B (en) Service and application classification method and system for encrypted traffic
Shapira et al. FlowPic: A generic representation for encrypted traffic classification and applications identification
Idhammad et al. Detection system of HTTP DDoS attacks in a cloud environment based on information theoretic entropy and random forest
CN110011931B (en) Encrypted flow type detection method and system
CN109951444B (en) Encrypted anonymous network traffic identification method
Bakhshi et al. On internet traffic classification: A two-phased machine learning approach
CN111064678A (en) Network traffic classification method based on lightweight convolutional neural network
Cui et al. A session-packets-based encrypted traffic classification using capsule neural networks
CN113259313A (en) Malicious HTTPS flow intelligent analysis method based on online training algorithm
CN108696543B (en) Distributed reflection denial of service attack detection and defense method based on deep forest
CN112804253B (en) Network flow classification detection method, system and storage medium
CN111565156B (en) Method for identifying and classifying network traffic
CN111224994A (en) Botnet detection method based on feature selection
Khakpour et al. An information-theoretical approach to high-speed flow nature identification
CN112949739A (en) Information transmission scheduling method and system based on intelligent traffic classification
CN115134250B (en) Network attack tracing evidence obtaining method
CN113472751B (en) Encrypted flow identification method and device based on data packet header
Wang et al. Using CNN-based representation learning method for malicious traffic identification
CN114239737A (en) Encrypted malicious flow detection method based on space-time characteristics and double-layer attention
CN112800424A (en) Botnet malicious traffic monitoring method based on random forest
CN112019500B (en) Encrypted traffic identification method based on deep learning and electronic device
CN112583852A (en) Abnormal flow detection method
Muliukha et al. Analysis and classification of encrypted network traffic using machine learning
Sacramento et al. FlowHacker: Detecting unknown network attacks in big traffic data using network flows
Feng et al. BotFlowMon: Learning-based, content-agnostic identification of social bot traffic flows

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant