CN117155595A

CN117155595A - Malicious encryption traffic detection method and model based on visual attention network

Info

Publication number: CN117155595A
Application number: CN202310530512.8A
Authority: CN
Inventors: 汤艳君; 薛秋爽; 王世航; 王子晨; 王子昂
Original assignee: China Criminal Police University
Current assignee: China Criminal Police University
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2023-12-01

Abstract

The application relates to the technical field of network security, in particular to a malicious encryption traffic detection method and model based on a visual attention network. The method comprises the following steps: preprocessing an experimental data set to obtain a preprocessed data set; building a visual attention network; training a visual attention network through a preprocessing data set to obtain a malicious encryption flow detection model; acquiring network traffic to be detected and preprocessing the network traffic to be detected to obtain preprocessed network traffic; inputting the preprocessed network traffic to a malicious encrypted traffic detection model to obtain a classification result of malicious traffic/normal traffic. According to the application, the flow is converted into the two-dimensional structure with the image visually reflecting the flow, so that the flow can be visually presented, the converted image is input into the detection model to perform characteristic extraction and flow classification, local information can be effectively extracted, and the problems of long-distance dependence and the like can be solved.

Description

Malicious encryption traffic detection method and model based on visual attention network

Technical Field

The application relates to the technical field of network security, in particular to a malicious encryption traffic detection method and model based on a visual attention network.

Background

With the frequent occurrence of network privacy security events, public concern is increasing about confidentiality and security of data transmission, and encrypted traffic transmission modes are more prone to be adopted than unencrypted traffic transmission modes. The encrypted traffic transmission mode plays a vital role in protecting the privacy of the country and the public, but also provides convenience for an attacker to use the encrypted traffic to bypass the existing network security products to transmit information and engage in illegal criminal activities, so that malicious traffic is not easy to detect due to encryption. Therefore, how to quickly detect unknown threats in encrypted traffic and respond is a key problem in network security situation awareness, and provides a basis for identifying whether the traffic belongs to malicious traffic or not in network crime cases handled by public security departments, thereby further laying a foundation for conviction and sentency.

The current detection method for malicious encrypted traffic can be divided into the following four modes: port number-based methods, deep packet inspection-based methods, machine learning-based methods, and deep learning-based methods. The port number-based method detects the source port number and the destination port number of TCP or UDP by identifying the protocol class, such as HTTP traffic typically corresponds to 80 ports, SSH traffic corresponds to 22 ports, etc. However, the method is only suitable for detecting the protocol type, and with the popularization of dynamic port numbers, the occurrence of tunneling technology and a proprietary encryption method, the protocol can be hidden, so that the current port number-based method has poor detection effect. Based on the deep packet inspection method, the field characteristics are analyzed by taking the effective load of the application layer as an inspection object, so that the specific analysis of the flow is realized. The method is known about the protocol and field meanings through the traffic protocol, but frequent maintenance of the feature library is required, and the method is only suitable for unencrypted traffic and has higher calculation cost.

Therefore, a new generation of machine learning based methods that rely on statistical or time series features to be able to detect both encrypted and unencrypted traffic has emerged. In addition, the deep learning method, as an end-to-end method, is capable of learning a nonlinear relationship between an original input and a corresponding output without the step of decomposing the process into feature choices. But the performance of the different model structures and the construction method are somewhat different.

In the prior art, detection of malicious traffic based on deep learning has achieved results, but the problems of incapability of detecting encrypted traffic, non-intuitive traffic presentation, neglecting of image local information, long-distance dependence and the like exist.

Disclosure of Invention

The application provides a malicious encryption traffic detection method and a malicious encryption traffic detection model based on a visual attention network, which can solve the problems that encryption traffic cannot be detected, traffic presentation is not intuitive, local information of an image is ignored, long-distance dependence is caused and the like in the existing malicious traffic detection technology based on deep learning.

The first technical scheme of the application is a malicious encryption traffic detection method based on a visual attention network, which comprises the following steps:

s1: determining an experimental data set comprising malicious flow and normal flow, and preprocessing the experimental data set to obtain a preprocessed data set comprising a plurality of gray pictures with preset sizes;

s2: building a visual attention network comprising a self-attention layer and a feedforward neural network layer which are sequentially connected;

the self-attention layer includes an LKA module, the LKA module including: a plurality of different convolution structures which are sequentially connected and are obtained by large-kernel convolution decomposition and a self-attention mechanism;

s3: training a visual attention network through a preprocessing data set to obtain a malicious encryption flow detection model;

s4: acquiring network traffic to be detected and preprocessing the network traffic to be detected to obtain preprocessed network traffic; inputting the preprocessed network traffic to a malicious encrypted traffic detection model to obtain a classification result of malicious traffic/normal traffic corresponding to the network traffic to be detected.

The second technical scheme of the application is a malicious encryption traffic detection model based on a visual attention network, comprising: a preprocessing module and a VAN module;

the pretreatment module is used for acquiring the network traffic to be detected and carrying out pretreatment on the network traffic to be detected to obtain pretreated network traffic;

and a malicious encryption flow detection model is arranged in the VAN module and is used for detecting the preprocessing network flow through the malicious encryption flow detection model to obtain a classification result of malicious flow/normal flow corresponding to the network flow to be detected.

The beneficial effects are that:

the application converts the encrypted flow into the two-dimensional structure which visually reflects the flow by the image, so that the encrypted flow can be detected and the flow can be visually displayed; according to the application, the converted image is input into a malicious encryption flow detection model to extract characteristics and classify flow. In the LKA module in the visual attention network, the large convolution kernel is decomposed and combined with the attention mechanism, so that local information can be effectively extracted, the problems of long-distance dependence and the like can be solved, important features are highlighted, and feature distinction is enhanced to improve the model detection result;

in summary, the detection method which can convert the flow into the image and is implemented according to the LKA module and the constructed detection model can solve the problems that the existing malicious flow detection technology based on deep learning cannot detect the encrypted flow, the flow is not visual, the local information of the image is ignored, the long-distance dependence is caused, and the like.

Drawings

In order to more clearly illustrate the technical solution of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a flow chart of a malicious encrypted traffic detection method based on a visual attention network in an embodiment of the application;

FIG. 2 is a logic diagram of a malicious encrypted traffic detection method based on a visual attention network according to an embodiment of the present application;

FIG. 3 is an exemplary diagram of various gray scale patterns obtained by preprocessing in an embodiment of the present application;

FIG. 4 is a schematic diagram of a combined convolution structure according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a processing structure according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a self-attention mechanism in a self-attention layer according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a control structure in an ablation experiment;

FIG. 8 is a graph comparing experimental results of a VAN model and a control structure in an ablation experiment;

FIG. 9 is a diagram of an example confusion matrix for each model in a reference model comparison experiment;

FIG. 10 is a graph showing the comparison of the evaluation results of the F1 values of the models on the Neris and the virus in the reference model comparison experiment;

FIG. 11 is a schematic diagram of a malicious encrypted traffic detection model based on a visual attention network according to an embodiment of the application;

in the figure, 1-a preprocessing module; 2-VAN module.

Detailed Description

Reference will now be made in detail to the embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The embodiments described in the examples below do not represent all embodiments consistent with the application. Merely exemplary of systems and methods consistent with aspects of the application as set forth in the claims.

(one) embodiment 1: a malicious encryption traffic detection method based on a visual attention network comprises the following steps:

the application firstly provides a malicious encryption traffic detection method based on a visual attention network, as shown in fig. 1 and 2, fig. 1 is a flow diagram of the malicious encryption traffic detection method based on the visual attention network in the embodiment of the application, and fig. 2 is a logic diagram of the malicious encryption traffic detection method based on the visual attention network in the embodiment of the application, including:

s1: determining an experimental data set comprising malicious traffic and normal traffic and preprocessing the experimental data set to obtain a preprocessed data set comprising a plurality of gray-scale pictures of preset sizes.

Wherein, step S1 includes:

s11: the USCT-TFC2016 dataset including 10 malicious traffic and 10 normal traffic was determined as the experimental dataset.

Specifically, the data set used is part of the data from the USTC-TFC 2016. The data set contains two parts, one part being 10 malicious traffic from the CTU data set, which was collected from the real environment by researchers at the CTU university during 2011 to 2015. Another part is to simulate the 10 normal network application flows collected using the ixinabs specialized simulation equipment. The data format of the USCT-TFC2016 data set is PCAP, and the experiment only takes session data of an application layer as a study object. Table 1 shows the specific case of using the data set, the normal traffic on the left, the malicious traffic on the right, and the data column shows the number of samples generated after preprocessing.

Table one traffic type and corresponding number of USTC-TFC2016 data sets

The data packet header of the network traffic reflects the attribute of the traffic source, and the traffic can be divided into five modes according to the information contained in the data packet header as different units: identifying the beginning and end of each connection in units of TCP connections, based on some TCP flags contained in the packet header, such as SYN, FIN, RST; in units of Flows (Flows), a flow refers to a group of packets having identical source and destination addresses, typically including a source IP address, a source port, a destination IP address, a destination port, and a protocol, with flow timeouts and resets also being considered as end of flow; in units of a session (Bidirectional flows), which refers to a bi-directional flow, which can be viewed as two opposite-directional flows, the locations of the source and destination parts in the five-tuple can be interchanged; services (Services), which generally refer to all traffic generated by the same IP port; the classification is in units of Hosts (Hosts), i.e., according to the primary traffic generated by the bi-directional Hosts, including traffic generated by the Hosts and traffic received by the Hosts.

By dividing the original traffic in the USCT-TFC2016 data set in units of sessions, the interactive traffic between the two parties can be saved in units of sessions, so that more interactive information can be carried. During processing, five-tuple information in the data packet needs to be identified. Through the information, the traffic can be distributed to the corresponding session, thereby realizing the division of the session.

S12: preprocessing including flow segmentation processing, flow cleaning processing, address confusion processing, unified size processing, picture generation processing and marking data processing is sequentially carried out on the experimental data set, and a preprocessing data set including 20 gray-scale pictures with preset sizes is obtained.

Wherein, step S12 includes:

s121: and carrying out flow segmentation processing on the original PCAP flow files through a split Cap tool sequentially on the experimental data set comprising 20 original PCAP flow files.

Specifically, an original Pcap flow file in a data set is segmented by using a split cap tool, quintuple information of the flow is used as a unique identifier, only application layer data in the data set is researched, and the application layer data in the data set is segmented into small units by taking a session as a unit.

S122: and removing repeated files and removing traffic cleaning without application layer data according to the files subjected to the traffic segmentation processing.

Specifically, the repeated file generated after the flow segmentation and the flow units without application layer data are cleaned, only the flow small units with complete communication session marks are reserved, and the flow small units with complete information are saved, so that the subsequent model learns the characteristics to distinguish malicious flow from non-malicious flow.

S123: address obfuscation processing including uniformly deleting IP addresses and using randomly generated IP addresses is performed on the file after the flow cleaning processing.

Specifically, the Mac address of the data link layer and the IP address of the IP layer are processed, and these pieces of information with specific environment identifiers may interfere with the traffic detection, resulting in a result that is fit too and loses the authenticity of the detection, so that the IP address is deleted uniformly and randomly mixed by using the randomly generated IP address, so that the address does not affect the detection result.

S124: and carrying out unified size processing including traffic cutting or traffic filling for the file subjected to the address confusion processing, wherein the unified size processing is used for generating a preset number of bytes.

Specifically, the segmented flow is different in information and data, the number and the size of data packets are different, and the deep learning model requires that the input data size is of a fixed length, so that the flow is cut or filled to be of the same length for subsequent processing. Referring to the original paper data preprocessing mode of the USTC-TFC data set, the traffic is segmented into 784 bytes in size. The traffic is truncated for more than 784 bytes and is satisfied with 784 bytes in length for subsequent fills of 0x00 less than 784 bytes.

S125: and performing picture generation processing for generating gray pictures with preset sizes by converting the gray pictures for the files subjected to the uniform size processing.

Specifically, the flow after uniform size is converted into a gray scale of 28×28 and stored for subsequent visual analysis. As shown in fig. 3, fig. 3 is an exemplary diagram of various gray-scale patterns obtained by preprocessing in the embodiment of the present application.

S126: and carrying out marking data processing classified under various label folders on the file subjected to the picture generation processing to obtain a preprocessing data set comprising 20 gray-scale pictures with preset sizes.

Specifically, 20 gray-scale images are stored under various label folders.

S2: a visual attention network is built that includes a self-attention layer and a feedforward neural network layer connected in sequence.

The self-attention layer includes an LKA module, the LKA module including: a plurality of different convolution structures which are sequentially connected and are obtained by large-kernel convolution decomposition and a self-attention mechanism.

Wherein, step S2 includes:

s21: decomposition of the large-kernel convolution against K results in a convolution comprising a concatenation in turnA combined convolution structure of depth expansion convolution, and (2 d-1) x (2 d-1) depth convolution with expansion degree d and 1x1 convolution.

Wherein, step S21 includes:

s211: the decomposition structure of the large-kernel convolution is determined based on the output formula of the LKA module, the parameters p (K, d) and FLOPsF (K, d).

The output formula of the LKA module is as follows:

Attention＝Conv _1×1 [DW-D-Conv(DW-Conv(F))]。

Output＝AttentionF。

wherein F represents an input sequence, F ε R ^C×H×W . Attention represents an Attention map, attention ε R ^C×H×W 。

Representing tensor product operations.

The calculation formula of the parameter p (K, d) is as follows:

the calculation formula of FLOPsF (K, d) is as follows:

F(K,d)＝p(K,d)×H×W。

wherein d represents the expansion ratio. K represents the size of the convolution kernel.

S212: based on the decomposition structure, decomposing the large-kernel convolution of K multiplied by K to obtain the method comprising the steps of sequentially connectingA combined convolution structure of depth expansion convolution, and (2 d-1) x (2 d-1) depth convolution with expansion degree d and 1x1 convolution.

Specifically, as shown in fig. 4, fig. 4 is a schematic structural diagram of a combined convolution structure in the embodiment of the present application, and the large-core convolution is divided into three parts, namely, a spatial local convolution (depth convolution), a spatial long-distance convolution (depth expansion convolution) and a convolution on a channel (1X 1 convolution).

I.e. decomposing the large kernel convolution of K x K to obtain a complex sequence comprising sequentially concatenatedDepth-expanded convolution, a (2 d-1) × (2 d-1) depth convolution with an expansion degree of d, and a 1×1 convolution.

Parameters and floating point operations (FLPs) are used in the decomposition process. To simplify the format, the bias is omitted in the calculation process. Assuming that the input and output sequences have the same size h×w×c, the calculation formulas of the parameters p (K, d) and FLOPsF (K, d) are shown in order as follows:

F(K,d)＝p(K,d)×H×W；

wherein d represents the expansion ratio. K represents the size of the kernel of the large kernel convolution. The budget savings ratio of the FLOPs and parameters is the same according to the formulas of the FLOPs and the parameters.

To determine a specific decomposition structure, a determination needs to be made in conjunction with the output formula of the LKA module.

The output formula of the LKA module is as follows:

Attention＝Conv _1×1 [DW-D-Conv(DW-Conv(F))]。

Output＝AttentionF。

in the embodiment of the application, the value of K is 21. Thus, when k=21, the minimum value is obtained when d=3 according to the output formula of the LKA module, thereby obtaining a decomposition structure that decomposes a large-core convolution into a depth convolution of 5×5 and a depth convolution of 7×7 and a small convolution with an expansion ratio of 3.

S22: and (5) building a visual attention network.

The visual attention network comprises a plurality of processing stages, each processing stage comprises a plurality of processing structures which are sequentially stacked, and each processing structure comprises a self-attention layer and a feedforward neural network layer which are sequentially connected in sequence.

The self-attention layer includes an LKA module, the LKA module including: the convolution structure and the self-attention mechanism are combined.

Wherein, step S22 includes:

s221: and building an attention unit comprising a BN layer, a self-attention layer and an addition residual error connection layer which are sequentially connected.

And constructing a feedforward processing unit comprising a BN layer, a feedforward neural network layer and an addition residual error connecting layer which are sequentially connected.

The self-attention layer includes: 1×1 convolution, GELU activation function, LKA module, and 1×1 convolution.

The feedforward neural network layer is a multilayer perceptron, and the multilayer perceptron comprises: 1x1 convolution, 3 x 3 convolution with degree of expansion d, GELU activation function, and 1x1 convolution.

S222: a attention unit and a feedforward processing unit are sequentially integrated into a processing structure.

S223: and setting up a first processing stage, a second processing stage, a third processing stage and a fourth processing stage in sequence according to network depth setting rules comprising 3,5 and 2 in sequence of the number of processing structures.

S224: and sequentially connecting the first processing stage, the second processing stage, the third processing stage and the fourth processing stage, and setting a sampling module before the first processing stage to obtain a visual attention network.

Wherein in the first processing stage, the second processing stage, the third processing stage and the fourth processing stage, the resolution of the output space is in turnAnd->And the ratio between the spatial dimension and the embedded vector dimension in the self-attention mechanism in the LKA module is 8,4, and 4 in order, and the embedded vector dimension in the self-attention mechanism in the LKA module is 64,128,160, and 256 in order.

Specifically, the visual attention network (Visual Attention Network, VAN) is a neural network constructed based on LKA (Large Kernal Attention, large-kernel convolution attention).

The VAN has a simple hierarchical structure, i.e., a series of four stages, specifically, a first process stage, a second process stage, a third process stage, and a fourth process stage in the embodiment of the present application.

As the processing stage progresses, the resolution of the output space is continuously reduced, respectivelyAnd->Where H and W represent the height and width of the input image, as the resolution decreases, the number of output channels increases, and the number of output channels is shown in table 2.

Table 2 parameter settings for various processing stages in VAN

In table 2, mlp _ratios represent the ratio between the spatial dimension and the embedding vector dimension referring to each processing stage; c represents the dimension of the embedded vector; l denotes the depth of each processing stage, i.e. the number of processing structures.

As shown in fig. 5, fig. 5 is a schematic structural diagram of a processing structure according to an embodiment of the present application, in each processing stage of the VAN, the number of processing structures is determined according to the depth, i.e., L groups in each processing stage stack the processing structures L times.

Each processing structure comprises two units, namely an attention unit and a feedforward processing unit.

The attention unit includes: the BN layer, the self-attention layer and the addition residual error connecting layer are sequentially connected. Wherein the self-attention layer includes an LKA module.

The feedforward processing unit includes: the BN layer, the feedforward neural network layer and the addition residual error connecting layer are sequentially connected.

In the attention unit or the feed forward processing unit, normalization (Batch Normalization, BN) is used first, and finally an additive residual connection (residual connection,)。

in the self-attention layer of the attention unit, the LKA module includes the advantages of the self-attention mechanism and the large-kernel convolution, and decomposes the large-kernel convolution operation to capture the long-distance relation. After the decomposition is completed, the importance of a point is estimated through the attention mechanism and an attention map is generated, so that the output formula of the LKA is as follows:

Attention＝Conv _1×1 [DW-D-Conv(DW-Conv(F))]；

Output＝AttentionF；

wherein F represents an input sequence, F ε R ^C×H×W . Attention represents an Attention diagram, the values in which represent the weight of each featureThe significance is that. Attention epsilon R ^C×H×W . Representing tensor product operations.

The self-attention mechanism of the LKA module maps the input sequence into a key matrix K, a query matrix Q and a value matrix V, as shown in fig. 6, fig. 6 is a schematic diagram of the self-attention mechanism in the self-attention layer in the embodiment of the present application, and then calculates a similarity score between them to obtain the attention weight of each part, and uses these attention weights to weight and sum each part to obtain a final representation, which is specifically designed for a one-dimensional sequence, and its calculation formula is as follows:

it is emphasized that unlike common attention methods, LKA does not require additional normalization functions like sigmoid and softmax. A key feature of the attention method is to adaptively adjust the output based on the input features, rather than normalizing the attention map.

S3: training the visual attention network through the preprocessing data set to obtain a malicious encrypted flow detection model.

Wherein preprocessing the data set comprises: a training data set and a validation data set.

And, step S3 includes:

s31: and training a visual attention network through a training data set based on a loss function of cross entropy and an optimization algorithm of AdamW, and judging whether training is finished or not through a verification data set based on an early stop system in the training process to obtain a malicious encryption flow detection model.

Specifically, the training environment of the malicious encrypted traffic detection model in the embodiment of the application is trained on Pytorch by using a GPU, and the experimental environment is shown in table 3.

TABLE 3 data sheet of training environment for malicious encrypted traffic detection model

The parameters of the malicious encrypted traffic detection model are set as follows:

in the pre-processing process of the early data, unifying the flow into a 784 byte conversation small unit; in the malicious encryption traffic detection model, the picture input size is 28×28, four stages are passed, the embedded vector dimension in the attention mechanism of each stage is [64,128,160,256], the ratio between the space dimension and the embedded vector dimension of each stage is [8,8,4,4], and the network depth of each stage is [3,3,5,2]. In the LKA module, the large-kernel convolution is decomposed into a 5×5 depth convolution and a 7×7 depth convolution, and a small convolution with a dilation rate of 3 and a 1×1 convolution.

The malicious encryption flow detection model selects cross entropy as a loss function and an AdamW optimization algorithm, the batch size is 50, and the learning rate is 1 multiplied by 10 ^-4 . Setting an early-stop mechanism, and storing the pre-training model of the one epoch when the loss of the verification data set is smaller than that of the verification data set of the last epoch; if none of the loss of the verification data sets for ten epochs in succession is smaller than the pre-trained model currently stored, then the training is ended.

S4: and acquiring the network traffic to be detected and preprocessing the network traffic to be detected to obtain preprocessed network traffic. Inputting the preprocessed network traffic to a malicious encrypted traffic detection model to obtain a classification result of malicious traffic/normal traffic corresponding to the network traffic to be detected.

Wherein, step S4 includes:

s41: and acquiring the network traffic to be detected and preprocessing the network traffic to be detected to obtain preprocessed network traffic.

S42: inputting the preprocessed network traffic to the malicious encrypted traffic detection model.

The sampling module performs downsampling processing on the preprocessed network traffic to obtain an input sequence corresponding to the preprocessed network traffic.

The first processing stage, the second processing stage, the third processing stage and the fourth processing stage are sequentially processed for the input sequence based on pipeline communication processing principles.

The LKA module in the first processing stage, the second processing stage and the third processing stage performs feature extraction on the input sequence through a combined convolution structure to obtain a long-distance feature map, the LKA module further performs processing on the long-distance feature map through a self-attention mechanism to obtain attention map, the multi-layer perceptron performs mapping processing on the attention map through multi-layer current transformation to obtain a high-dimensional output sequence compared with the input sequence, and the high-dimensional output sequence is updated to be the input sequence capable of being input into the next processing stage.

And the multi-layer perceptron in the fourth processing stage outputs a high-dimensional output sequence, and linear function processing is carried out on the high-dimensional output sequence output by the multi-layer perceptron in the fourth processing stage, so that a classification result of malicious traffic/normal traffic corresponding to the network traffic to be detected is obtained.

Specifically, as shown in fig. 5, after preprocessing network traffic inputs a malicious encrypted traffic detection model, the malicious encrypted traffic detection model first downsamples the inputs and uses the span number to control the sampling rate. After downsampling, all other layers of a processing stage remain the same output size, i.e. spatial resolution and channel number. Then, the L groups are subjected to batch normalization, 1×1 convolution, GELU activation function, LKA, and feed forward neural network (FFN), which are sequentially stacked to extract features, and specific details of the values of L, dimensions of input and output, etc. are shown in table 2 above.

(II) control experiment of malicious encryption traffic detection model:

in the accuracy experiment, in the embodiment of the application, the performance of the malicious encryption traffic detection model is evaluated by indexes including average accuracy (AverageAccuracy, AA), average accuracy (AveragePrecision, AP), average recall (AverageRecall, AR) and average F1 value (averageF1_score, AF).

Wherein M represents the category number of malicious traffic and normal traffic in the training data set;

TP represents the number of samples for which class A traffic is predicted to be class A;

FP represents the number of samples that predict non-class a traffic as class a;

TN represents the number of samples for which the non-class A traffic is predicted to be non-class A;

FN represents the number of samples for which class a traffic is predicted to be non-class a;

class a traffic is malicious traffic.

2.1 ablation experiments:

the most important part in the VAN network is an LKA module, and three convolutions decomposed by LKA are respectively: local convolution in space and convolution in long distance and convolution in channel are added with attention mechanism. Different convolution decomposition modes form networks with different depths, and different LKA modules are formed. Too much or too little decomposition may cause degradation of the model performance, resulting in incomplete feature extraction.

Therefore, the embodiment of the application sets ablation experiments, respectively removes the depth convolution (DW-Conv), the depth expansion convolution (DW-D-Conv), the convolution on the channel (1X 1 Conv) and the attention mechanism, adds a sigmoid function on the basis of the LKA module, and performs six groups of experiments on the public data set USTC-TFC 2016. The validity of the VAN model was evaluated using the average Accuracy (AP) and the average F1 value (AF) as evaluation indexes, with parameters remaining identical except for the modification of the above modules.

(1) DW-Conv: the DW-Conv part is removed, the large-core convolution is directly divided into 7×7 convolution with the expansion rate of 3 and channel convolution with the expansion rate of 1×1, and the attention characteristic diagram is regenerated, so that the flow data is classified. Fig. 7 is a schematic structural view of a control structure in an ablation experiment, and fig. 7 (a) is a schematic structural view of the control structure after DW-Conv is removed.

(2) DW-D-Conv: and removing the DW-D-Conv part, decomposing the large-kernel convolution into a 5×5 depth convolution and a 1×1 channel convolution, and regenerating a attention characteristic map so as to classify the flow data. As shown in fig. 7 (b), fig. 7 (b) shows a schematic structure after DW-D-Conv portion is removed.

(3) Attention mechanism: and removing the attention mechanism, directly classifying the flow data according to the convolution result without generating an attention characteristic diagram. As shown in fig. 7 (c), fig. 7 (c) shows a schematic structural diagram after the attentiveness mechanism is removed.

(4) 1x1 convolution: the 1×1 convolution part is removed, and the large-kernel convolution is decomposed into a 5×5 depth convolution and a 7×7 convolution with an expansion rate of 3, and then an attention mechanism diagram is generated, so that the flow data is classified. As shown in fig. 7 (d), fig. 7 (d) shows a schematic configuration from which the 1×1 convolution is removed.

(5) Sigmoid function: the Sigmoid function is a common normalization function, and the added LKA module will take care to normalize. As shown in fig. 7 (d), fig. 7 (d) shows a schematic structure after adding the Sigmoid function.

Experimental results show that the LKA has the best performance, each part is critical, the average accuracy and the average F1 value index reach 96.27% and 95.65% respectively, and the original module is not obviously improved compared with the model without the DW-D-Conv module on the AR index, but the advantages of the original model on the other three indexes can be reflected.

In summary, compared with the method that indexes of each module in the LKA are removed, the LKA original module is improved, so that all parts in the LKA are essential for improving performance, long-distance dependence and adaptability of space can be effectively captured, and model classification is more accurate. Compared with the addition of the Sigmoid function, the original LKA performance is better, and normalization is not necessary for the LKA module. As shown in fig. 8, fig. 8 is a graph comparing experimental results of the VAN model and the control structure in the ablation experiment, and each group of bar charts in fig. 8 represents the VAN structure, the VAN structure with DW-Conv part removed, the VAN structure with DW-D-Conv part removed, the VAN structure with 1×1 convolution part removed, the VAN structure with attentiveness mechanism removed, and the VAN structure with Sigmoid function added in order from left to right.

2.2 reference model comparison experiments:

in order to verify the effectiveness of the VAN model based on the LKA module, in the training round number with the early-stop mechanism, namely when the loss of the verification set is not smaller than the minimum loss at present for 10 times continuously, training is finished, 5 models (ResNet, googLeNet, DPN, VGG, 1D-CNN) are selected as reference models for comparison.

The five models are typical representations in the field of computer vision, are excellent in image recognition tasks, and are applied to the field of malicious encryption traffic detection. Among them, the ResNet model was proposed by HeK et al in 2016 to increase network depth by introducing residual connection (residual connection); the GooLeNet model was proposed in 2014 by Google, which convolves a preprocessed image by using a plurality of convolution kernels of different sizes, and uses pooling and convolution operations between the convolution kernels of different sizes to improve the performance of the network; the DPN model is proposed by HeK et al in 2017, and solves the problems of gradient elimination and overhigh computational complexity when the depth of the model is increased by layering dense connection of the preprocessed images; VGG models were proposed by oxford university in 2014, where VGG-11 was selected based on the size of the input image, using a plurality of smaller convolution kernels instead of the larger ones for the preprocessed image. According to the data in Table 4, the VAN model performed best, reaching 95.53%, 96.27%, 95.56% and 95.65% on the AA, AP, AR, AF scale, respectively. Compared with other reference models, the AA indexes are respectively improved by 0.74%, 3.01%, 2.66%, 2.98% and 3.88%, the AP indexes are respectively improved by 0.71%, 2.03%, 2.23%, 2.41% and 4.49%, the AR indexes are respectively improved by 0.81%, 3.45%, 3.02%, 3.31% and 4.15%, and the AF indexes are respectively improved by 0.8%, 3.51%, 3.07%, 3.48% and 4.2%.

TABLE 4 Experimental results vs. data graphs for models AA, AC, AR and AF

Model	AA％	AC％	AR％	AF％
					VAN	95.53	96.27	95.56	95.65
ResNet	94.49	95.56	94.75	94.85
					GoogLeNet	92.52	94.24	92.11	92.14
DPN	92.87	94.04	92.54	92.58
					VGG	92.55	93.86	92.25	92.17
1D-CNN	91.65	91.78	91.41	91.45

To verify the degree of differentiation of the proposed VAN model in encrypted traffic detection, it was found that most models have some confusion in both Neris and virus classes. Meanwhile, the VAN model predicts that 63 samples in the Neris class belong to the category of virus, 52 samples in the category of virus belong to the category of Neris, and the ResNet model with the best performance in the benchmark model predicts that 51 samples in the category of Neris belong to the category of virus, and 114 samples in the category of virus belong to the category of Neris, so that 50 sample confusion is reduced. As shown in fig. 9, fig. 9 is an exemplary diagram of a confusion matrix for each model in the reference model comparison experiment, where fig. 9 (a) shows VAN, fig. 9 (b) shows res net, fig. 9 (c) shows GoogLeNet, fig. 9 (D) shows DPN, fig. 9 (e) shows VGG, and fig. 9 (f) shows 1D-CNN.

The VAN model has the highest values on F1 indexes in Neris and Virut types, and is respectively improved by 3.77% and 6.07% compared with a ResNet model with the best performance in a reference model, and is respectively improved by 0.27% and 0.97% compared with a VGG model with the best differentiation degree between Neris and Virut in the reference model. As can be seen from the classification of the prediction label and the real label of the confusion matrix in fig. 9, when the other models predict the Tinba class of the malicious encrypted traffic, the samples of the class are predicted as the traffic of the other class in error, and the VAN model classifies the samples in the class without error; when the best-performing ResNet model in the reference model predicts the Miuref class, the error prediction is carried out to four classes, and the VAN model only carries out error prediction to the Htbot class by 2 samples, so that the distinction degree of the malicious encryption traffic detection class is stronger. As shown in fig. 10, fig. 10 is a comparative graph of the evaluation results of F1 values of the respective models on both Neris and Virut in the reference model comparison experiment, and the left set of bar charts in fig. 10 represents Neris and the right set of bar charts in fig. 10 represents Virut. VAN, resNet, googLeNe, DPN, VGG and 1D-CNN are represented in each set of histograms in order from left to right.

(III) example III: malicious encryption traffic detection model based on visual attention network:

the application further provides a malicious encrypted traffic detection model based on the visual attention network, as shown in fig. 11, fig. 11 is a schematic structural diagram of the malicious encrypted traffic detection model based on the visual attention network in the embodiment of the application, and the system comprises: a preprocessing module 1 and a VAN module 2;

the preprocessing module 1 is used for acquiring the network traffic to be detected and preprocessing the network traffic to be detected to obtain preprocessed network traffic;

the VAN module 2 is internally provided with a malicious encrypted traffic detection model, which is used for detecting the preprocessed network traffic through the malicious encrypted traffic detection model, so as to obtain a classification result of malicious traffic/normal traffic corresponding to the network traffic to be detected.

Specifically, the main function of the preprocessing module 1 is to preprocess data in an original PCAP file, and divide the preprocessed data in a conversational manner and extract useful information.

As can be seen from the foregoing, the embodiment of the present application detects malicious encrypted traffic based on the VAN model, where the LKA module generates an attention map by decomposing a large convolution kernel into a deep convolution, a deep expansion convolution, and a convolution on a channel, and the LKA absorbs the advantages of the convolution and the self-attention mechanism, including local structure information, long-distance dependency, and adaptability. Meanwhile, the defects of neglecting adaptability in the channel dimension and the like are avoided. The VAN structure is applied to the field of malicious encryption traffic detection, and the characteristics after traffic is converted into pictures are more focused, so that high-performance classification is realized. Experimental results effectively prove that when the model is applied to malicious encryption flow image recognition, average Accuracy (AA), average Accuracy (AP), average Recall (AR) and average F1 value (AF) are increased to a certain extent compared with other models, and meanwhile fine granularity distinction on different types of malicious encryption flows is obviously enhanced. The model intuitively extracts more effective malicious encrypted traffic characteristics for the form of traffic converted into pictures, and improves various indexes for detecting the malicious encrypted traffic.

The embodiments of the present application have been described in detail, but the present application is merely the preferred embodiments of the present application and should not be construed as limiting the scope of the present application. All equivalent changes and modifications within the scope of the present application should be made within the scope of the present application.

Claims

1. A method for detecting malicious encrypted traffic based on a visual attention network, comprising:

2. The method for detecting malicious encrypted traffic based on a visual attention network according to claim 1, wherein said step S1 comprises:

s11: determining a USCT-TFC2016 data set comprising 10 malicious traffic and 10 normal traffic as an experimental data set;

3. The method for detecting malicious encrypted traffic based on a visual attention network according to claim 2, wherein said step S12 comprises:

s121: sequentially carrying out flow segmentation processing on the original PCAP flow files through a split Cap tool aiming at an experimental data set comprising 20 original PCAP flow files;

s122: removing repeated files and removing flow cleaning without application layer data according to the files subjected to flow segmentation;

s123: performing address confusion processing including uniformly deleting the IP addresses and using the randomly generated IP addresses for the flow rate cleaned file;

s124: performing unified size processing including flow cutting or flow filling for generating a preset number of bytes on the file subjected to the address confusion processing;

s125: performing image generation processing for generating gray-scale images with preset sizes by converting gray-scale images for the files subjected to uniform-size processing;

4. The method for detecting malicious encrypted traffic based on a visual attention network according to claim 1, wherein said step S2 comprises:

s21: decomposition of the large-kernel convolution against K results in a convolution comprising a concatenation in turnA combined convolution structure of depth expansion convolution, and (2 d-1) × (2 d-1) depth convolution with expansion degree d and 1×1 convolution;

s22: building a visual attention network;

the visual attention network comprises a plurality of processing stages, each processing stage comprises a plurality of processing structures which are sequentially stacked, and each processing structure comprises a self-attention layer and a feedforward neural network layer which are sequentially connected;

5. The method for detecting malicious encrypted traffic based on a visual attention network according to claim 4, wherein said step S21 comprises:

s211: determining a decomposition structure of the large-kernel convolution based on an output formula of the LKA module, parameters p (K, d) and FLOPsF (K, d);

the output formula of the LKA module is as follows:

Attention＝Conv _1×1 [DW-D-Covn(DW-Covn(F))]；

Output＝AttentionF；

wherein F represents an input sequence, F ε R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the Attention represents an Attention map, attention ε R ^C×H×W ；

Representing tensor product operation;

the calculation formula of the parameter p (K, d) is as follows:

FLOP _S the formula for F (K, d) is shown below:

F(K,d)＝p(K,d)×H×W；

wherein d represents the expansion ratio; k represents the size of the kernel of the large kernel convolution;

s212: based on the decomposition structure, for KThe large kernel convolution of xK is decomposed to obtain a convolution of the successive phases A combined convolution structure of depth expansion convolution, and (2 d-1) x (2 d-1) depth convolution with expansion degree d and 1x1 convolution.

6. The method for detecting malicious encrypted traffic based on a visual attention network according to claim 4, wherein said step S22 comprises:

s221: building an attention unit comprising a BN layer, a self-attention layer and an addition residual error connection layer which are sequentially connected;

constructing a feedforward processing unit comprising a BN layer, a feedforward neural network layer and an addition residual error connecting layer which are sequentially connected;

the self-attention layer includes: 1×1 convolution, GELU activation function, LKA module, and 1×1 convolution;

the feedforward neural network layer is a multilayer perceptron, and the multilayer perceptron comprises: 1x1 convolution, 3 x 3 convolution with degree of expansion d, GELU activation function, and 1x1 convolution;

s222: sequentially integrating one attention unit and one feedforward processing unit into one processing structure;

s223: according to network depth setting rules comprising 3,5 and 2 in turn of the number of the processing structures, a first processing stage, a second processing stage, a third processing stage and a fourth processing stage are built in turn;

7. The visual attention network based malicious encrypted traffic detection method of claim 6, wherein in the first, second, third, and fourth processing stages, the resolution of the output space decreases in sequence, and the ratio between the spatial dimension and the embedding vector dimension in the self-attention mechanism in the LKA module is 8,4, and 4 in sequence, and the embedding vector dimension in the self-attention mechanism in the LKA module is 64,128,160, and 256 in sequence.

8. The method for detecting malicious encrypted traffic based on a visual attention network according to claim 6, wherein said step S4 comprises:

s41: acquiring network traffic to be detected and preprocessing the network traffic to be detected to obtain preprocessed network traffic;

s42: inputting the preprocessed network traffic to the malicious encrypted traffic detection model;

the sampling module performs downsampling processing on the preprocessing network traffic to obtain an input sequence corresponding to the preprocessing network traffic;

based on a pipeline communication processing principle, the first processing stage, the second processing stage, the third processing stage and the fourth processing stage are used for processing an input sequence in sequence;

the LKA module in the first processing stage, the second processing stage and the third processing stage performs feature extraction on an input sequence through a combined convolution structure to obtain a long-distance feature map, the LKA module also performs processing on the long-distance feature map through a self-attention mechanism to obtain attention map, the multi-layer perceptron performs mapping processing on the attention map through multi-layer current transformation to obtain a high-dimensional output sequence, and the high-dimensional output sequence is updated to be an input sequence which can be input into the next processing stage;

9. The visual attention network-based malicious encrypted traffic detection method according to claim 1, wherein said preprocessing data set comprises: a training data set and a validation data set;

and, the step S3 includes:

10. A malicious encrypted traffic detection model based on a visual attention network, comprising: a preprocessing module and a VAN module;