CN110417729B

CN110417729B - Service and application classification method and system for encrypted traffic

Info

Publication number: CN110417729B
Application number: CN201910504060.XA
Authority: CN
Inventors: 崔苏苏; 卢志刚; 姜波; 徐健锋; 刘松; 崔泽林
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-06-12
Filing date: 2019-06-12
Publication date: 2020-10-27
Anticipated expiration: 2039-06-12
Also published as: CN110417729A

Abstract

The invention discloses a method and a system for classifying services and applications of encrypted traffic. The method comprises the following steps: 1) segmenting continuous flow to be processed into a plurality of session flows according to session granularity; 2) segmenting each processed conversation flow according to the data packet granularity, and segmenting each conversation flow into a plurality of flow groups, wherein the number of data packets in each flow group does not exceed a set maximum value; 3) unifying the sizes of the flow groups, converting each flow group into a flow matrix, and packaging the flow matrix and a label thereof into an IDX flow file; 4) training a capsNet model by using the IDX flow file to obtain an identification model with automatic feature selection capability; 5) and for an encrypted flow to be identified, dividing the encrypted flow, converting the encrypted flow into a flow matrix, and inputting the flow matrix into the identification model to obtain the service type and the application class of the flow to be identified. The invention can effectively classify the encrypted flow.

Description

Service and application classification method and system for encrypted traffic

Technical Field

The invention provides a service and application classification method for encrypted flow, provides a novel flow secondary segmentation mechanism, and simultaneously combines a capsule neural network (CapsNet) to realize effective classification of the encrypted flow.

Background

In recent years, with the continuous development of internet technology and information science technology, network traffic is increased explosively. According to the visual network index prediction report published by cisco, IP traffic data transmitted over public and private networks, including hosted IP traffic, consumer-generated mobile data traffic, and internet traffic, produced 122EB (1EB 220TB) traffic data globally averaged every month in 2017, while global IP traffic increased twice by 2022 to 396 EBs per month. Meanwhile, with the continuous change of the demand of netizens on the network world, various novel services emerge endlessly. These new services bring convenience to netizens and increase the heterogeneity and complexity of the network, which brings unprecedented challenges to network security.

In the aspect of network security, in recent years, network security has become one of the core problems faced by the internet, malicious network behaviors such as information leakage, illegal intrusion, DDoS attack and the like increasingly affect the use of the internet by users, and with the development and progress of technologies, the traffic characteristics of network malicious attack become increasingly complex and hidden. According to the data of the identity theft resource center in 2018, nearly 3420 ten thousand theft records are obtained by 9 months in 2018; according to the infrastructure safety report of the Arbor Networks in the 13 th year, the DDoS peak attack amount in the first half year of 2018 reaches 1.7Tbps, which is increased by 179 percent compared with the first half year of 2017, and by 2022, the total number of global DDoS attacks is doubled compared with the last half year of 2017, and reaches 1450 ten thousand. Network managers need to classify and identify network traffic to quickly and accurately locate abnormal behaviors in the network, cut off the propagation path of malicious intrusion in time, and reduce harm and loss of the malicious intrusion to users as much as possible. Meanwhile, unknown and disguised Webshell can be found through the flow identification technology, the whole attack process is restored from the Kill Chain, and an attacker, an attack tool, an attack technique and the like are deeply analyzed and portrayed.

The classification and identification technology of network traffic is an essential part in network security situation awareness through each module of the security situation awareness. A large number of network traffic classification and identification technologies have been proposed, and can be roughly classified into port-based traffic identification technology, deep packet inspection-based traffic identification technology, statistics-based traffic identification technology, and behavior-based traffic identification technology.

The network flow identification technology has a good identification effect on traditional network application. However, global encrypted network traffic is constantly increasing after exposure to "prism" monitoring items. The report of Sandvine 2018 shows that over 50% of traffic on the internet is encrypted and will continue to grow. In order to avoid the detection of firewalls and antivirus software, most malicious software generally uses a traffic encryption technology to hide communication information. The traffic encryption is almost a fact standard practice of all network applications including malicious software, an identification technology based on encrypted traffic becomes an important means for detecting security threats under the situation that contents cannot be interpreted, key information such as network behaviors and process behaviors is analyzed through the encrypted traffic identification technology, an attack process is restored through KillChain analysis, and a threat processing suggestion is provided for a security administrator.

Although many research results are obtained in the current research on traffic identification, most of the existing results are directed to non-encrypted traffic identification research. In the actual flow identification process, the encrypted flow identification is not suitable for the traditional flow identification technology. With the advent of P2P application and the widespread use of dynamic port number technology, the method of identifying traffic using port numbers is no longer effective; the development of port obfuscation techniques further limits its effectiveness. The increasing encryption flow rate can not be identified by using a deep packet detection method because the load characteristic is hidden, and the application of encapsulation protocol technologies such as a tunnel and the like is further limited. In addition to this, since deep packets are identified by analyzing application layer data, this involves a problem of invasion of user privacy. Due to the lack of effective encryption traffic analysis and management technology, huge challenges are brought to network management and security.

At present, methods for identifying encrypted traffic based on machine learning are very rich, but traditional machine learning needs manual feature extraction and classification accuracy excessively depends on feature selection, which not only limits the expandability of the method, but also prevents the method from realizing real-time classification. Deep learning is an effective way for solving the problem of manually extracting features in traditional machine learning, can automatically extract features from input data without human intervention, establishes a model and explains data in a human brain simulation mode to achieve the purpose of identifying encrypted flow in the Internet, and is a brand-new attempt.

According to statistics, the encrypted traffic identification algorithm based on deep learning mainly comprises a multilayer perceptron (MLP), a stacked encoder (SAE) and a one-dimensional convolutional neural network (1dCNN), and in comparison of the encrypted traffic identification algorithms of a large number of researchers, the identification algorithm based on deep learning achieves higher identification precision than that of the traditional machine learning, and in the algorithm based on deep learning, the 1dCNN algorithm achieves the best encrypted traffic identification effect.

However, 1dCNN requires that features be location independent and only the presence or absence of features is considered in the recognition process without regard to the location and other attributes of the features. But we consider the position of a specific string in the traffic and the order of the packets to be considered as one of the features. Besides, in the task of identifying encrypted traffic, these encoded files are not equivalent to picture files, which are no longer suitable for the pooling operation of CNNs. There is no question of whether the max pooling operation or the min pooling operation would discard some information and change the active features behind the encoded string.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a service and application classification method and system for encrypted traffic. The invention discloses an encrypted flow classification model based on a capsule neural network (capsNet) secondary segmentation mechanism, which is named as SPcaps and can effectively classify encrypted flows. The present invention classifies different division scenarios of encryption traffic, specifically including service classification of encryption traffic, that is, classifying according to service type of encryption traffic, such as: web browsing, streaming media, instant messaging, etc.; the application classification of the encrypted traffic is to be classified according to the application program to which the encrypted traffic belongs, such as: skype, BitTorrent, YouTube, etc.

The invention provides a novel flow secondary segmentation mechanism in the data preprocessing process, develops a set of preprocessing tool set integrated by an EditCap tool, a SplitCap tool, a Powershell script and a Python script, and aims to dilute the proportion of irrelevant flow and increase the weight of effective flow. In addition, the invention realizes the training of the model by combining the CapsNet algorithm, and in the encrypted flow classification, the CapsNet can make up the disadvantages of 1dCNN, which is mainly embodied in the following aspects: 1) in the invention, the length of the vector represents the probability of the class to which the flow belongs, and the direction of the vector represents the attribute of the class, including the fixed position of a specific character string in the flow and the arrangement sequence of data packets. 2) The CapsNet does not use the pooling operation in the convolutional neural network, the pooling operation reduces the connection parameters and refines the characteristics, and simultaneously discards some necessary information, and the CapsNet discarding pooling operation is more suitable for the coding files such as the flow. 3) On the premise of ensuring the identification precision, the capsNet has a higher identification speed than the CNN, so that the method is more suitable for flow identification in a real-time environment.

In order to achieve the purpose, the invention adopts the specific technical scheme that:

an identification method of encrypted traffic comprises the following steps:

1) performing first segmentation according to session granularity: the deep learning-based traffic classification method requires that continuous traffic is first segmented into a plurality of discrete units according to a certain granularity. There are five ways of network traffic segmentation: TCP connection, flow, session, service, host. Where flows and sessions are the more heavily used traffic manifestations in current research. Therefore, the invention carries out the first segmentation on the original flow to be processed according to the conversation granularity. A session refers to a packet of traffic composed of bi-directional flows, i.e. having the same five-tuple (source IP, source port, destination IP, destination port, transport layer protocol), where source IP and destination IP can be interchanged.

2) And (3) encrypted flow cleaning: in traffic classification, Mac addresses of the data link layer and IP addresses of the network layer (source IP, destination IP) cannot be characterized as the classification. If the traffic capture environment is limited, the Mac address and the IP address may affect the training of the model to a certain extent, resulting in overfitting of the classification, so we delete the fields of the Mac address and the IP address in the packet.

3) And carrying out second segmentation according to the granularity of the data packet: since the traffic collected from the actual network environment contains some packets that are not related to classification, this will directly affect the training and testing of the model. We therefore continue to slice the traffic by setting the maximum number of packets in the traffic through step 2). Since most of the sessions to be divided show normal communication processes, the step dilutes the specific weight of irrelevant flow in the original flow and increases the weight of effective flow.

4) Input form of the standard encrypted traffic: the training data using the neural network needs input with a fixed size, so the traffic files subjected to the above steps are unified in size according to fixed bytes, if the traffic files are larger than the set fixed bytes, the bytes after the deletion are deleted, and if the traffic files are smaller than the fixed bytes, the fixed bytes are supplemented with 00. Finally, the traffic processed in the above way is converted into a traffic matrix, and traffic matrix samples and labels thereof are packed by an IDX file, which is an input file standard format used by many CapsNet and CNN models.

5) Model training based on the CapsNet: and learning the spatial characteristics of the encrypted flow by using the IDX-format flow file processed in the steps and adopting convolution operation and a dynamic routing mechanism based on the CapsNet, and establishing an efficient identification model with automatic characteristic selection capability, so that the efficient identification model can be effectively classified according to the identified encrypted flow and the service type and the application type of the flow.

6) And (3) encrypted flow identification: the identification and classification of the encrypted traffic are completed by using the model trained in the steps, wherein the method can realize effective encrypted traffic classification in the following scenes, and comprises the following steps: 1) service classification, namely identifying the service type to which the encrypted traffic belongs; 2) application classification, i.e., identifying the specific application to which the encrypted traffic belongs.

The invention provides a service and application classification system for encrypted traffic, which is characterized by comprising a traffic preprocessing module, a model training module and an encrypted traffic identification module; wherein the content of the first and second substances,

the flow preprocessing module is used for segmenting continuous flow to be processed into a plurality of session flows according to the session granularity; then, each conversation flow is segmented according to the data packet granularity, each conversation flow is segmented into a plurality of flow groups, and the number of data packets in each flow group does not exceed a set maximum value; then unifying the size of each flow group, converting each flow group into a flow matrix, and packaging the flow matrix and a label thereof into an IDX flow file;

the model training module is used for training a Capsule Net model by using an IDX flow file to obtain an identification model with automatic feature selection capability;

and the encrypted flow identification module is used for inputting the flow matrix of the encrypted flow to be identified into the identification model to obtain the service type and the application category of the flow to be identified.

Compared with the prior art, the invention has the following positive effects:

1. the invention provides an encrypted traffic identification model based on a CapsNet, which can take a specific position of a fixed code in traffic and an arrangement sequence between packets as one of learning characteristics.

2. The invention provides a secondary segmentation mechanism of flow, which is used for diluting the proportion of irrelevant flow and increasing the weight of effective flow, and can realize effective noise reduction of the flow while determining the flow expression.

3. The invention adopts the publicly available ISCX VPN-non VPN data set to evaluate the SPCaps model, and the experimental result shows that the SPCaps are superior to the most advanced identification method in the encrypted flow service and the application identification task.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Fig. 2 is a schematic diagram of the flow double segmentation mechanism of the present invention.

FIG. 3 is a diagram of a model architecture based on the CapsNet of the present invention.

Fig. 4 is a size distribution diagram of original traffic in an ISCX VPN-non VPN dataset at session granularity.

Detailed Description

In order to make the technical solutions in the embodiments of the present invention better understood and make the objects, features, and advantages of the present invention more comprehensible, the technical core of the present invention is described in further detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the invention, a service and application classification method for encrypted traffic is designed. The general idea of the method is to segment, clean and standardize the encrypted flow under the real environment through a preprocessing tool set, dilute the proportion of irrelevant flow and increase the weight of effective flow, further establish a model based on the CapsNet to learn the spatial characteristics of the encrypted flow, and finally realize effective encrypted flow identification and classification of service and application.

The overall flow chart of the invention is shown in fig. 1, and the details of the steps of the method are described as follows:

(1) conversion of raw traffic

In the data preprocessing stage of the invention, in order to reduce the noise of the original flow and normalize the input form of the original flow, the conversion of the original flow is completed by the following five steps: Pcap-Sessions are segmented, MAC addresses and IP addresses are deleted, Session-Packets are segmented, input sizes are unified, and the sizes are converted into IDX.

1) Pcap-Sessions segmentation: the deep learning based traffic identification method needs to segment continuous traffic into discrete units with a certain granularity. The original traffic P is a set of different packets, denoted P ═ P¹,…,p^|P|}. Wherein one data packet pⁱIs defined as:

pⁱ＝(xⁱ,bⁱ,tⁱ) (1)

wherein, i ═ 1, 2., | P |, bⁱ∈(0,∞)，tⁱ∈[0,∞)，xⁱIs the five tuple of the ith packet (source IP, source port, destination IP,destination port, transport layer protocol), bⁱIs the byte length of the ith packet and tⁱIs the start time of the ith packet. The original flow is cut for the first time according to the session granularity. One session SⁱIs a set of bi-directional streams containing the same five tuples, defined as:

S_i＝{p¹＝(x¹,b¹,t¹),...,pⁿ＝(xⁿ,bⁿ,tⁿ)} (2)

wherein x is¹＝…＝xⁿ，t¹<…<tⁿN is S_iThe number of packets in (1). The step is specifically realized by a SplitCap tool.

2) Delete Mac address and IP address: the Mac address and the IP address cannot be used as features in the training process, but rather, the existence of the Mac address and the IP address easily causes overfitting of the model, so the Mac address and the IP address are deleted by discarding the character strings of corresponding positions in the data packet. This step is specifically implemented by the EditCap tool.

3) Session-Packets segmentation: there are a large number of smaller-sized sessions of network traffic captured in real environment, which are often traffic identification independent sessions, such as SNMP, DNS, and ARP data segments, which seriously affect the effective identification of traffic. Since those sessions with larger size are the main activities in the communication process and they have only a small number of irrelevant Packets, we propose a Session-Packets splitting method for diluting the weight of the irrelevant traffic and increasing the weight of the effective traffic. The method includes continuously segmenting each session (namely discrete units) by setting the maximum value of a data packet in session traffic to obtain a plurality of traffic groups corresponding to each session, wherein the data packet in each traffic group can not exceed the set maximum value at most. G represents the latest traffic group after Session-Packets segmentation, and is defined as:

wherein G is_ijIs the ith session traffic S_iThe jth latest traffic of (1)Group, m is G_ijC is the maximum number of packets, which is defined as:

wherein L is_sampleRepresenting the byte length, L, of the file storing the traffic group_headerHeader byte length, L, of a file representing a stored traffic group_packetIndicating the data byte length of the data packet; the traffic matrix is represented in a pcap file before being converted into an IDX file, which includes, in addition to traffic data, a file header identifying file information (files such as txt. jpg all have a file header), the file header occupies 112 bytes, the maximum value C of all discrete units is uniform, and C is set to 16. The byte length of the flow group is unified to 784 bytes, the minimum byte length of the data packet after the Mac address and the IP address are deleted is 40 bytes, 112 bytes are fixed in the file header after the above processing, the theoretical maximum packet number is 16.8 bytes, but the packet number is an integer, and the communication sequence between the data packets is easily disturbed according to odd number segmentation, so that C is set to 16. The reason for this is that we want to be able to make full use of the traffic group G to predict the whole session. In our view, the larger the number of packets contained in the traffic group G, the more representative it is. Therefore, we make C as large as possible to fully exploit the traffic group G representativeness. We summarize the two-pass segmentation mechanisms (Pcap-Sessions segmentation and Session-Packets segmentation) as shown in fig. 2, where the original traffic is subjected to the Sessions-Sessions segmentation according to the traffic expression form of the Session, and then the Session-Packets segmentation is performed on the Session traffic by setting the maximum number C of Packets in the Session traffic. This step is specifically implemented by the EditCap tool.

4) Unified input size: using a neural network requires a fixed size input, so we unify the traffic group G to 784 bytes, only the first 784 bytes are reserved if the traffic size is larger than 784 bytes; if the traffic size is less than 784 bytes, then the 784 bytes are padded with a set string (e.g., 0x 00). The step is specifically realized by Powershell script.

5) Conversion to IDX: we convert 784 bytes of traffic into a 28 x 28 traffic matrix, i.e., one-dimensional 784 bytes of traffic encoding order into a 28 x 28 traffic matrix. These traffic matrices and their labels are then packaged into IDX files, which are the standard input for many CapsNet and CNN models. The step is specifically realized by a Python script.

(2) CapsNet-based training model

The invention is based on the CapsNet algorithm, and takes the flow matrix and the label packaged by IDX as a data set to establish a service classification model and an application classification model for the encrypted flow. The algorithm mainly comprises convolution operation and dynamic routing, and the architecture diagram is shown in FIG. 3.

1) Convolution operation

The model first reads the 28 x 28 flow matrices that were preprocessed above, while normalizing them. In the ReLU convolutional layer, first, 256 convolutional cores with size 9 × 9 are used to perform a convolution operation with step number 1 on each flow matrix, and 256 feature matrices with size 20 × 20 are generated. The second convolution layer Primarycaps then serves as the input layer for the capsule to construct the vector structure. The PrimaryCaps perform 8 convolution operations with different weights in 256 feature matrices, each convolution operation will use 32 convolution kernels with the size of 9 × 9 to perform convolution operation with the step number of 2, and finally generate 6 × 32 8-dimensional vectors, i.e., active vectors, each of which is a capsule unit composed of 8 ordinary convolution units.

2) Dynamic routing

The third layer of DigitCaps of the neural network is used to deliver and update the capsule's input, including two steps of affine transformation and dynamic routing. In affine transformation, the activity vector u output by the lower Primarycaps layer_iAnd a weight matrix W_ijMultiplying to obtain a prediction vector

Input of high-level capsules s_jBy

The weighted sum, defined as:

wherein each motion vector u_iRespectively correspond to a weight matrix W_ij，W_ijInitialized and generated by random numbers conforming to standard normal distribution, updated by loss functions, c_ijIs the coupling coefficient determined by the iterative dynamic routing.

The dynamic routing mechanism aims at finding the best path between the capsule output and the next layer of capsule input, and one of the methods for finding the "best path" is to find the input vector which best matches the output in an iterative manner, and the matching degree is characterized by the inner product of the output vector and the input vector (the vector after affine transformation and weighted summation), and the matching degree is directly added to c_ijIn the invention, the iteration number is set to be 3 through multiple parameter optimization decision. C in formula (5)_ijThe update formula of (2) is as follows:

c_ij＝softmax(b_ij) (6)

wherein, b_ijIs the log prior probability that capsule i is coupled to capsule j.

The length of the capsule output vector represents the probability of belonging to a certain class, and therefore the value range thereof should be [0,1], and the process is realized by a compression function, which is defined as follows:

wherein v is_jIs the output vector, s, of capsule j_jIs the input vector for capsule j.

W_ijAnd other convolution parameters of the whole network are updated by a loss function, so we adopt a Marginloss function as the loss function, which is defined as:

L_c＝T_cmax(0,m⁺-||v_c||²)+λ(1-T_c)max(0,||v_c||-m^-) (8)

where c is the prediction class, T_cIs an indicator function, when c predicts correctly, T_cEqual to 1, otherwise, T_cEqual to 0. m is⁺Is the vector length v_cThe upper boundary of |, m-is the vector length | | | v_cThe lower boundary of | l. In addition, we scale down the reconstruction loss by 0.0005 so that it does not dominate the Margin loss function during training.

The flow matrix to be identified passes through the CapsNet, N16-dimensional vectors are output, N represents the total number of classes of the flow to be classified, the length of the vector represents the probability that the flow belongs to a certain class, and the direction of the vector represents the attribute of the flow, including the position of a fixed character string and the sequence among data packets. And then outputting the probability that the flow matrix to be identified belongs to each category by the N16-dimensional vectors through a softmax classifier, wherein the category with the maximum probability is the prediction category of the flow, and the prediction category is the final output of the model.

(3) Identification of encrypted traffic and application and service classification

The identification and classification of the encrypted traffic are completed by using the model trained in the above steps, that is, for the traffic to be identified and classified, the traffic is firstly divided and converted into a traffic matrix, and then the traffic matrix is input into the trained model to obtain the class of the traffic, including: 1) service classification, 2) application classification.

(4) Comparison of Experimental results

To verify the validity of the present invention, we used ISCX VPN-non VPN dataset as raw data, which contains 150 raw traffic files, including 6 regular encrypted traffic (Chat, Streaming, VoIP, etc.) and 6 VPN traffic (VPNChat, VPNStreaming, VPNVoIP, etc.), and in addition, there are 9 raw traffic files that are traffic generated by 5 different applications captured by Tor software. Since Tor traffic only supports encrypted links and TCP flows on the internet, it is difficult to track and analyze their traffic. Therefore, we extract them to implement Tor's application classification. Finally, the effectiveness of the invention is evaluated by comparing four indexes of precision, recall and F1 values with the existing method.

Specifically, we divided the experiment into: 1) evaluating and comparing the effectiveness of deleting the MAC address and the IP address and the secondary segmentation mechanism in the data preprocessing; 2) evaluating and comparing the effectiveness of SPcaps in the classification task of the encrypted traffic service; 3) the effectiveness of SPCaps in the classification task of encrypted traffic application is evaluated and compared.

1) Results of pretreatment

We use the above-mentioned original traffic conversion steps to preprocess the ISCX VPN-non VPN dataset, and after performing Pcap-Sessions segmentation, we count the byte size distribution of the session traffic for service classification as shown in fig. 4.

It can be seen that the size distribution of conversational traffic is highly unbalanced, and over 50% of these 12 traffic is less than 0.5KB, most of which are traffic unrelated to the classification task. In particular, more than 80% of Chat, Email, File, and Voip have less than 0.2KB of session traffic. Thus, the size distribution of the Session traffic confirms the necessity and reasonableness of the Session-Packets segmentation in the preprocessing process. According to equation (4), we set the maximum number of Packets per Session to 16 in the Session-Packets segmentation step. Finally, in the service classification task of encrypting traffic, the category name, the included application and the traffic are summed up as shown in table 1.

TABLE 1 sample content for encrypted traffic service classification

Categories	Application program	Total of
			Chat	AIM Facebook Hangouts ICQ Skype	11365
Email	Email Gmail	12822
			File	Ftps SCP Sftp Skype	19553
P2P	Torrent	60000
			Streaming	Facebook Hangouts Netflix Skype Spotify Vimeo YouTube	21273
Voip	Facebook Hangouts Skype Voipbuster	21000
			VPNChat	AIM Facebook Hangouts ICQ Skype	13710
VPNEmail	Email	2890
			VPNFile	Ftps Sftp Skype	17528
VPNP2P	Bittorrent	6000
			VPNStreaming	Facebook Netflix Spotify Vimeo YouTube	12000
VPNVoip	Hangouts Skype Voipbuster	14805

2) Comparison of pretreatment

In the conversion process of original traffic, we propose to delete the Mac address and the IP address to avoid overfitting, and we propose Session-Packets segmentation to perform the second segmentation on the traditional Session traffic. In addition, to demonstrate that CapsNet is more suitable than 1dCNN in flow classification, we compared using these two neural network algorithms in each experiment. Therefore, we performed six different classification tasks for encrypted traffic services on the ISCX VPN-non VPN dataset, and the experimental results are shown in table 2.

Table 2 shows the results of comparative pretreatment experiments

The results show that both 1dCNN and CapsNet, our proposed deletion of Mac and IP addresses and Session-Packets segmentation demonstrated better classification. In addition, it can be seen in comparative experiments of the two neural networks that CapsNet demonstrated higher classification accuracy and F1 values than 1 dCNN.

3) Comparison of encrypted traffic service classifications

To evaluate and compare the effectiveness of SPCaps in encrypted traffic service classification, we performed experiments with traffic in 12 in ISCX VPN-non VPN. As shown in Table 3, the experimental results show that the precision can reach 99.1%, and the precision ratio and the recall ratio of each category are both more than 97%.

Table 3 shows the classification experimental results of encrypted traffic service

Next, in encrypted traffic service classification, we compare SPCaps with the existing baseline method, and the comparison results are shown in table 4. The results show that SPCaps show better classification effect and reach the practical application standard.

Table 4 shows the comparison of SPcaps and baseline method for encrypted traffic service classification

Method of producing a composite material	Input form	Recall ratio of	Precision ratio	F1 value
					SPCaps	Session-Packets	99.3	99.3	99.3
1dCNN	Session	90.6	88.9	89.7
					SAE	Deep Packets	92	92	92
1dCNN	Deep Packets	94	93	93

4) Comparison of encrypted traffic application classifications

To evaluate the effectiveness of SPCaps in the application classification task for Tor traffic, we performed experiments on 5 different applications' traffic captured by Tor in ISCX VPN-non VPN, with the experimental results shown in table 5. The result shows that the accuracy of the SPCap can reach 99.8% in the application program classification task of the Tor flow.

Table 5 shows the classification experimental results of the encrypted flow application

Next, in encrypted traffic application classification, we compare SPCaps with the existing baseline method, and the comparison results are shown in table 6. The results show that SPCaps achieves a breakthrough effect in Tor application classification.

Table 6 shows the comparison results of the classification of the encrypted traffic application SPcaps and the baseline method

Method of producing a composite material	Recall ratio of	Precision ratio	F1 value
				SPcaps	99.4	99.5	99.5
SAE	57	44	30
				1dCNN	35	40	36

The experiments show that the SPcaps can realize effective encryption flow classification, and the experimental result reaches the standard of practical application.

The above-mentioned embodiments only express the implementation mode of the present invention, and the description thereof is specific, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the appended claims.

Claims

1. A service and application classification method for encrypted traffic comprises the following steps:

1) segmenting continuous flow to be processed into a plurality of session flows according to session granularity;

2) segmenting each processed conversation flow according to the data packet granularity, and segmenting each conversation flow into a plurality of flow groups, wherein the number of data packets in each flow group does not exceed a set maximum value;

3) unifying the sizes of the flow groups, converting each flow group into a flow matrix, and packaging the flow matrix and a label thereof into an IDX flow file;

4) training a capsNet model by using the IDX flow file to obtain an identification model with automatic feature selection capability;

5) and for an encrypted flow to be identified, dividing the encrypted flow, converting the encrypted flow into a flow matrix, and inputting the flow matrix into the identification model to obtain the service type and the application class of the flow to be identified.

2. The method of claim 1, wherein the ith session traffic S_iThe jth traffic group in (1) is G_ij(ii) a Wherein G is_ij＝{p¹＝(x¹,b¹,t¹),…,pⁱ＝(xⁱ,bⁱ,tⁱ),…,p^m＝(x^m,b^m,t^m)}，

m is G_ijC is the set maximum number of packets, session flow S_iOf the ith data packet pⁱ＝(xⁱ,bⁱ,tⁱ)，xⁱIs G_ijQuintuple of the ith packet, bⁱIs G_ijByte length of the ith data packet, tⁱIs G_ijStart time of the ith packet, | S_iL is the session traffic S_iTotal number of packets in (1).

3. The method of claim 2,

wherein L is_sampleRepresenting the byte length, L, of the file storing the traffic group_headerHeader byte length, L, of a file representing a stored traffic group_packetIndicating the byte length of the data packet.

4. The method of claim 1, wherein data scrubbing is performed on each session flow, and Mac addresses and IP addresses are deleted; then step 2) is performed.

5. The method of claim 1, wherein converting the traffic groups into the traffic matrix is by: converting the one-dimensional flow coding sequence of the flow group into a two-dimensional flow matrix; the traffic group of uniform size is 784 bytes, and the converted traffic matrix is a 28 × 28 traffic matrix.

6. The method of claim 1, wherein the method of training the CapsNet model using the IDX traffic file is: firstly, performing convolution operation on each flow matrix by utilizing a first convolution layer to generate a plurality of characteristic matrixes; then carrying out convolution operation on the feature matrix to generate a plurality of activity vectors; then, each activity vector is multiplied by the corresponding weight matrix to obtain a prediction vector, and the prediction vectors of the lower layer are weighted and summed to be used as the input of the high-layer capsule.

7. A service and application classification system for encrypted traffic is characterized by comprising a traffic preprocessing module, a model training module and an encrypted traffic identification module; wherein the content of the first and second substances,

8. The system of claim 7, wherein the ith session traffic S_iThe jth traffic group in (1) is G_ij(ii) a Wherein G is_ij＝{p¹＝(x¹,b¹,t¹),…,pⁱ＝(xⁱ,bⁱ,tⁱ),…,p^m＝(x^m,b^m,t^m)}，

9. The system of claim 8,

10. The system of claim 7, wherein the traffic preprocessing module performs data cleaning on each session traffic to remove Mac addresses and IP addresses; then, each session flow is segmented according to the data packet granularity.