CN116155572A

CN116155572A - Encryption traffic network intrusion detection method based on ensemble learning

Info

Publication number: CN116155572A
Application number: CN202310036438.4A
Authority: CN
Inventors: 朱鸿宇; 袁亚丽; 胡文韬; 程光
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-05-23

Abstract

The invention discloses an encryption traffic network intrusion detection method based on ensemble learning, which is applied to the instant identification and classification of encryption malicious traffic in a network and comprises encryption traffic pretreatment and feature extraction; the encrypted flow time sequence characteristic cluster analysis module; the encryption flow statistics feature support vector machine classification module; the encryption source flow deep learning abnormality detection module; and integrating strategies of all flow analysis modules. The method can cope with the unknown malicious traffic category continuously appearing in the network, give out an alarm in time before the malicious traffic produces the effect, and simultaneously has the capability of resisting the data packet filling evasion attack to a certain extent, thereby protecting the integrity, the usability and the user data security of the network infrastructure.

Description

Encryption traffic network intrusion detection method based on ensemble learning

Technical Field

The invention relates to an encryption malicious traffic identification network intrusion detection technology based on ensemble learning, and belongs to the technical field of information security.

Background

Network intrusion detection identifies malicious traffic through traffic analysis to discover intrusion behavior, and is an important research object of network security discipline. The detection method based on machine learning and deep learning is widely applied due to the excellent accuracy and low dependence on expert analysis, but with the progress of encryption technology and the improvement of privacy protection consciousness of users, more and more traffic on the network is in an encrypted state, traditional protocol analysis, port analysis and traffic packet head analysis are gradually disabled, and the intrusion detection method based on encrypted traffic analysis becomes a new research hotspot.

When the encrypted traffic is analyzed, based on the time sequence characteristics derived from the lengths of the individual data packets in the traffic, considering statistical characteristics such as the duration of the traffic, the number of forward and reverse data packets and the like in the unit of the traffic becomes an important classification basis. Timing characteristics have proven to be good in practice, and sometimes encrypted traffic can be accurately classified using only the timing characteristics [4], and efficient time series clustering algorithms such as symbol pattern clustering (Symbolic Pattern Forest, SPF) [1] have been proposed in the academia; however, timing characteristics are often affected by packet padding, and an attacker may pad the traffic to mask the length characteristics, at which time the classifier constructed from the timing characteristics will drop dramatically, while statistical characteristics less affected by packet padding will be more efficient, and cs++ SVM classifier [2] may be a good processing algorithm for the statistical characteristics. In addition, the source flow contains all information in the flow, and if the source flow can be reasonably utilized by a deep learning method, such as an anomaly detection algorithm anomaly transformer [3], the source flow can also make a prominent contribution to intrusion detection. How to maximize the potential value of mining these features is a challenge.

Further, even if accurate classifiers are constructed, they may still be difficult to deploy in the real world. Unknown traffic on the network, i.e. traffic generated by zero-day application and zero-day attack, is layered endlessly, and the conventional traffic classification method cannot cope with the traffic for which the traffic classification method is capable of causing serious misjudgment. Therefore, the unknown traffic problem must also be properly handled in the network intrusion detection scenario.

Reference to the literature

[1]Xiaosheng Li,Jessica Lin and Liang Zhao.Linear Time Complexity Time Series Clustering with Symbolic Pattern Forest.In IJCAI,pages 2930-2936,2019.

[2]Alistair Shilton,SutharshanRajasegarar and Marimuthu Palaniswami.Multiclass Anomaly Detector:the CS++Support Vector Machine.Journal of Machine Learning Research,213:1-39,2020.

[3]Jiehui Xu,Haixu Wu,Jianmin Wang and Mingsheng Long,“Anomaly Transformer:Time Series Anomaly Detection with Association Discrepancy”.In ICLR,2022.

[4]M.Shen,Y.Liu,L.Zhu,X.Du and J.Hu,"Fine-Grained Webpage Fingerprinting Using Only Packet Length Information of Encrypted Traffic,"in IEEE Transactions on Information Forensics and Security,vol.16,pp.2046-2059,2021,doi:10.1109/TIFS.2020.3046876.

Disclosure of Invention

The invention provides an encryption traffic network intrusion detection method based on ensemble learning, which aims at solving the problem of malicious traffic detection including unknown attack.

In order to solve the problems, the invention discloses a new algorithm for network intrusion detection, and provides an encrypted malicious traffic identification network intrusion detection method based on ensemble learning, which comprises the following steps:

s1, extracting flow characteristics: extracting time sequence characteristics and statistical characteristics from network encrypted traffic and storing source traffic to be used as input of a detector of the next stage;

s2, clustering processing of time sequence features: using the timing characteristics extracted in S1, sending into a symbol pattern clustering algorithm (Symbolic Pattern Forest, SPF, a linear time complexity time series clustering algorithm) for dividing data regions using randomly selected symbol patterns without using distance metrics and obtaining good resultsThe integration size of good results is not dependent on the input sequence size, and is fast, low in cost and high in accuracy, see reference [1] for details]) Obtaining a flow clustering result R ₁ ；

S3, statistical feature classification processing: the statistical features extracted in S1 are utilized to be sent into CS++ SVM (a support vector machine capable of simultaneously completing combination classification and anomaly detection), so that not only can multiple types of known classes in data be accurately classified, but also newly appeared anomaly classes can be simultaneously identified, the known classes can be accurately cut and unknown space can be simultaneously maximized, and the model can be trained without regard to the structure of the data flow, thereby effectively solving the combination classification/anomaly detection problem, and the details are disclosed in reference [2]]) Obtaining a flow classification result R ₂ ；

S4, source flow anomaly detection: the source flow saved in S1 is utilized to send into anomaly transformer (a time sequence anomaly detection algorithm, a method for detecting and extracting characteristics point by using a traditional continuous flow model is improved), the characteristics that the local association of anomaly points is tight but the association with the whole sequence is small, the association of the normal points with any whole area is large are found, the accuracy of the detection model is greatly improved by utilizing the characteristics, and the details are shown in reference [3]]) Obtaining a flow classification result R ₃ ；

S5, result integration and aggregation: overall consider R ₁ 、R ₂ 、R ₃ And (3) obtaining a final malicious traffic identification result R by using a weighted majority voting method.

As an improvement of the present invention, the step S1 further includes:

s11, intercepting encrypted source traffic at a gateway, performing traffic cleaning, and filtering with quintuple of < source IP, target IP, source port, destination port, transport layer protocol > to process the source traffic into an encrypted network traffic set which can be input as a machine learning model;

s12, executing flow segmentation on the encrypted flow obtained by flow cleaning, and extracting the first third data packet of each encrypted flow to represent the original data flow; for each encryption stream, sequentially recording the lengths of data packets, splicing to form a time sequence characteristic sequence, and calculating stream duration, the number of forward and reverse data packets, the average length and standard deviation of the length of the data packet heads, the maximum arrival time interval of the forward and reverse streams and the like as statistical characteristic vectors; and saving the characterizing stream image as the original feature set.

As an improvement of the invention, the time sequence characteristic sequence obtained in the step S2 is sent to the pattern matching clustering algorithm in the step [1] to obtain flow clusters, each flow cluster contains a certain encrypted flow sample, and the clustering result does not have any label information.

As an improvement of the invention, in the step S3, the statistical feature set obtained in the step S1 is sent to the CS++ SVM classifier in the pre-trained [2] to obtain a flow classification result set, wherein the flow classification result set represents benign encrypted flow sets obtained by the CS++ SVM classifier, represents malicious encrypted flow sets of each fine-grained category obtained by the CS++ SVM classifier, such as DDoS flow sets, violent SSH flow sets, penetrating flow sets and the like, represents unknown encrypted flow sets identified by the CS++ SVM classifier, namely zero daily flow which does not belong to any known category, and represents potential threat.

As an improvement of the present invention, in the step S4, the original characterizing encrypted traffic set stored in S1 is sent to the anomaly transformer anomaly detector in the pre-trained [3], so as to obtain a recognition result set, where the benign traffic set obtained by the anomaly detector is represented, and the anomaly traffic set obtained by the anomaly detector is represented.

As an improvement of the present invention, the step S5 further includes:

s51, extracting flow judgment from network flow in real time or determining the proportion of data packet filling flow in the flow according to experience, wherein the data packet filling flow indicates that the data packet in the flow is filled to cover the length characteristic of the data packet in order to protect privacy of users or hide attack by adversaries; decision weights are respectively assigned to a symbol pattern clustering algorithm, a CS++ SVM classifier and a anomaly transformer anomaly detector;

s52, applying the CS++ SVM classifier to the traffic clusters obtained in the S2, marking each traffic cluster with a unified label according to a majority voting principle, and converting the traffic clusters into a traffic classification set (for example, supposing that most samples in the traffic classification set are marked with benign labels by the CS++ SVM classifier, temporarily considering all sample types in the traffic classification set as benign);

s53, integrating the results obtained in S2, S3 and S4, and firstly determining that the traffic is benign or malicious: weighted majority voting according to the weight determined in S51, dividing the encrypted traffic set into benign set and malicious set

S54, further dividing malicious traffic sets through the integration result: the method comprises the steps of performing opposite exchange, namely when a clustering result and a CS++ SVM classification result both consider that a malicious flow belongs to a category, determining that the flow belongs to a fine-grained malicious category; all traffic in the malicious traffic set that is not identified as fine-grained malicious categories are partitioned into abnormal traffic along with the unknown traffic set and left for further analysis.

The beneficial effects are that: compared with the prior art, the invention provides the encrypted traffic network intrusion detection method based on heterogeneous ensemble learning, which only extracts one-third data packet before each flow as a representation, reduces processing complexity and can give an alarm before malicious traffic is negatively influenced; the method does not depend on the characteristics of non-encrypted traffic, and can cope with the increasingly-growing encrypted traffic in the network; unknown zero-day threats can be detected; and utilizing heterogeneous characteristics and a heterogeneous learning machine integration strategy to maximize the degree of difference, and timely adjusting the integration strategy according to the proportion of data packet filling flows in the network. The scheme combines the coarse granularity and the fine granularity on the classification task, considers the common condition of zero day attack flow in the network, and avoids the problem of high false positive of an intrusion detection system to a certain extent; the time sequence data flow is automatically processed, the speed is high, the hardware requirement is low, the processing cost is low, and the actual deployment is convenient; the adopted integrated model can simultaneously cope with combination classification and anomaly detection, and disregard the self structure of the data; the three models are adopted for voting instead of singleness by adopting an integration strategy, and the method has good stability.

Drawings

FIG. 1 is a flow chart of method steps of the present invention;

fig. 2 is a working frame diagram of the present invention.

Detailed Description

The present invention is further illustrated by the following detailed description, which is to be taken in conjunction with the accompanying drawings, and should be understood as being merely illustrative of the present invention and not limiting the scope of the invention.

Example of implementation: as shown in fig. 1, an encrypted malicious traffic identification network intrusion detection method based on ensemble learning includes the following steps:

extracting flow characteristics:

in the era of everything interconnection, the data storage capability of the network is enhanced, the data generation and collection technology is greatly advanced, a large amount of time series data is available, and although we can perform fine space and time detail monitoring on a wide range of physical objects and environments, the problem of processing large and complex evolution data streams is faced at the same time.

The feature extraction step is a step that requires a great deal of careful consideration and fine manipulation, and is sufficient to preserve a specific flow length? But not so much redundancy? Which components are needed to extract timing characteristics? Consider that there is no bias in the overall characteristics of the whole world? These all require careful planning and extensive reference to other work, where it is strictly sufficient to encrypt the first third of the data packets of the stream to characterize the original stream, and the kind of statistical features in terms of length we use is generally accepted.

We first get network encrypted traffic from the server cluster, personal computer (Personal Computer, PC), other network detection points and send to intrusion detection system (Intrusion Detection System, IDS) for preprocessing, extracting timing features, statistics features and preserving source traffic for use as input to the next stage detector.

The preprocessing and extracting features of the encrypted traffic in step (1) comprises the following sub-steps:

(1.1) intercepting encrypted source traffic at a gateway and performing traffic cleaning to < source IP, destination IP, source port, destination port, transport layer protocol > five-tuple filtering to process the source traffic into an encrypted set of network streams that can be input as a machine learning model

(1.2) performing flow segmentation on the encrypted flow obtained by flow cleaning, and extracting n data packets (n is a super parameter and is determined in a specific application) before each encrypted flow so as to characterize the original data flow; for each encryption stream, sequentially recording the lengths of data packets, splicing to form a time sequence characteristic sequence, and calculating stream duration, the number of forward and reverse data packets, the average length and standard deviation of the length of the data packet heads, the maximum arrival time interval of the forward and reverse streams and the like as statistical characteristic vectors; and saving the characterizing stream image as the original feature set.

Clustering the time sequence features:

automatically interpreting the time-series data stream obtained in the last step is a great challenge, the large data size makes manual inspection impractical, the rapidly changing data stream content also requires a data mining method to have lower time complexity, and hardware requirements are reduced, so that the performance and processing cost of a processor are balanced, and further popularization of a solution is promoted.

In 2019, a linear time complexity symbol pattern clustering algorithm (Symbolic Pattern Forest, SPF) was proposed [1], which is an effective solution to the above challenges, using randomly selected symbol patterns to divide the data region, without using distance metrics, and the integrated size to obtain good results is not dependent on the input sequence size, fast, low cost and high precision.

And (2) sending the time sequence characteristics extracted in the step (1) into the symbol pattern clustering algorithm in the step (1) to obtain a flow clustering result R1.

The specific symbolization is expressed as follows: feeding the time sequence feature sequence obtained in step (1) into step (1)]The symbol pattern clustering algorithm in the method obtains a flow cluster { C } ₁ ,C ₂ ,......,C _m Each traffic cluster contains a certain encrypted traffic sample, and the clustering result does not have any label information.

And (3) classifying and processing statistical characteristics:

because new unknown events may occur at any time, only the known kinds and proportions of human control of the laboratory training set cannot be considered, and the actual situation is carefully held. Classifiers that are trained using known patterns, such as conventional binary and multi-class support vector machines, cannot identify or incorrectly label these events.

In 2020, a new SVM algorithm: CS++ SVM is proposed [2], which not only can accurately classify various types of known classes in data, but also can simultaneously identify new abnormal classes, accurately cut the known classes and simultaneously maximize unknown space, and can train a model regardless of the structure of the data flow, thereby effectively solving the problem of combined classification/abnormal detection.

Step (3), the statistical features extracted in the step (1) are used for being sent to the CS++ support vector machine classifier in the step (2), and a flow classification result R2 is obtained;

the specific symbolization is expressed as follows:

sending the statistical feature set obtained in the step (1) into the pre-trained step [2]]A CS++ SVM classifier in the middle of the method obtains a flow classification result set: { B, M ₁ ,M ₂ ......,M _n U, where B represents the benign encrypted traffic set derived by the CS++ SVM classifier, { M } ₁ ,M ₂ ......,M _n The U represents an unknown encrypted traffic set identified by the CS++ SVM classifier, namely zero daily traffic which does not belong to any known class, and represents potential threat.

Source traffic anomaly detection:

the proportion of anomalous data is generally too small, the time distribution is unpredictable and too unbalanced to be effectively detected, and sensitivity requirements for the detection method are high.

In 2022, aiming at the dilemma, a anomaly transformer anomaly detector is proposed [3] and a method for detecting and extracting characteristics point by using a traditional continuous flow model is abandoned, the characteristics that local correlations of anomaly points are tight but have small correlations with a whole sequence, and the correlation of normal points with any overall area is large are found, and the accuracy of the detection model is greatly improved by utilizing the characteristics.

Step (4) using the source flow stored in step (1) to send to the anomaly transformer detector in step [3] to obtain a flow classification result R3;

symbolized as follows:

will be described in (1)The saved original characterizing encrypted traffic set is fed into pre-trained [3]]Anomaly transformer anomaly detector of (B) to obtain a recognition result set { B } ₂ A }, wherein B ₂ Represents the benign traffic set derived by the anomaly detector and a represents the anomaly traffic set derived by the anomaly detector.

Result integration aggregation:

considering the flow classification results of R1, R2 and R3, we choose to use a weighted majority voting method to obtain a final malicious flow identification result R, wherein the weight proportion is manually assigned according to priori experience, if the specific effect is not clear, the method can be used for carrying out average treatment firstly, and then the classification result is dynamically adjusted by means of expert experience, so that the robustness of the model is enhanced.

After the voting result is obtained, coarse granularity division is firstly carried out to benign and malignant, fine granularity identification is carried out on malicious traffic, unknown traffic in the malicious traffic set is reserved for further analysis, direct labeling is not carried out conventionally, and the fault tolerance of the model is enhanced.

The step (5) specifically comprises the following substeps:

(5.1) extracting traffic judgment from network traffic in real time or empirically determining the proportion alpha of the data packet filling flow in the traffic, wherein the data packet filling flow points out that the data packets in the flow are filled to mask the length characteristic of the data packets in order to protect privacy of users or hide attack actions by opponents; the symbol pattern clustering algorithm, the CS++ SVM classifier and the anomaly transformer anomaly detector are respectively assigned with decision weights (1-alpha)/(3-alpha), 1/(3-alpha), and 1/(3-alpha);

(5.2) applying the CS++ SVM classifier to the traffic clusters obtained in (2), and labeling each traffic cluster uniformly according to the majority voting principle, thereby obtaining the traffic cluster { C } ₁ ,C ₂ ,......,C _m Conversion to a traffic classification set { B } ₁ ^′ ,M ₁ ^′ ,M ₂ ^′ ......,M _n ^′ ,U ₁ ^′ } (e.g., assume C ₁ Most samples in the model are marked with benign labels by a CS++ SVM classifier, and C is temporarily considered as ₁ Is benign) for all sample classes;

(5.3) integrating the results obtained in (2) (3) (4), first determining that the traffic is benign or malicious: weighted majority voting is carried out according to the weight determined in (5.1), the encrypted traffic set is divided into a benign set B and a malicious set M, and the malicious traffic set M is further divided through an integration result: for M _i And M is as follows _i ^′ Taking traffic (i=1, 2,) n, i.e. when both the clustering result and the cs++ SVM classification result consider that a malicious traffic belongs to class M _i When the traffic is determined to belong to the fine-grained malicious class M _i The method comprises the steps of carrying out a first treatment on the surface of the All traffic in the malicious traffic set that is not identified as fine-grained malicious categories and the unknown traffic sets U and U ^′ The flow is divided into abnormal flows together, and the abnormal flows are reserved for further expert analysis.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features.

Claims

1. An encrypted traffic network intrusion detection method based on ensemble learning is characterized by comprising the following steps:

s2, clustering processing of time sequence features: sending the time sequence features extracted in the S1 into a symbol mode clustering algorithm to obtain a flow clustering result R ₁ ；

S3, statistical feature classification processing: the statistical features extracted in the S1 are utilized to be sent into CS++ SVM to obtain a flow classification result R ₂ ；

S4, source flow anomaly detection: using the source flow stored in S1, sending into anomaly transformer to obtain flow classification result R ₃ ；

2. The method for detecting the intrusion of the encrypted traffic network based on the ensemble learning as claimed in claim 1, wherein the method comprises the following steps: the step S1 further includes:

3. The method for detecting the intrusion of the encrypted traffic network based on the ensemble learning as claimed in claim 1, wherein the method comprises the following steps: in the step S2, the time sequence characteristic sequence obtained in the step S1 is sent into a symbol pattern clustering algorithm to obtain a flow cluster { C } ₁ ,C ₂ ,......,C _m Each traffic cluster contains a certain encrypted traffic sample, and the clustering result does not have any label information.

4. The method for detecting the intrusion of the encrypted traffic network based on the ensemble learning as claimed in claim 1, wherein the method comprises the following steps: in the step S3, the statistical feature set obtained in the step S1 is sent to a pre-trained CS++ SVM classifier to obtain a flow classification result set { B, M } ₁ ,M ₂ ......,M _n U, where B represents the benign encrypted traffic set derived by the CS++ SVM classifier, { M } ₁ ,M ₂ ......,M _n The method comprises the steps of obtaining a malicious encryption traffic set, a DDoS traffic set, a violent SSH traffic set, an osmotic traffic set and the like of each fine-grained category by a CS++ SVM classifier, and obtaining an unknown encryption traffic set which is identified by the CS++ SVM classifier, namely, zero daily traffic which does not belong to any known category, wherein potential threats are represented.

5. The method for detecting the intrusion of the encrypted traffic network based on the ensemble learning as claimed in claim 1, wherein the method comprises the following steps: in the step S4, the original characterization encrypted traffic set saved in the step S1 is sent to a pre-trained anomaly transformer anomaly detector to obtain a recognition result set { B } ₂ A }, wherein B ₂ Represents the benign traffic set derived by the anomaly detector and a represents the anomaly traffic set derived by the anomaly detector.

6. The method for detecting the intrusion of the encrypted traffic network based on the ensemble learning as claimed in claim 1, wherein the method comprises the following steps: the step S5 further includes:

s51, extracting flow judgment from network flow in real time or determining the proportion alpha of a data packet filling flow in the flow according to experience, wherein the data packet filling flow indicates that the data packet in the flow is filled to cover the length characteristic of the data packet in order to protect privacy of users or hide attack actions by opponents; the symbol pattern clustering algorithm, the CS++ SVM classifier and the anomaly transformer anomaly detector are respectively assigned with decision weights (1-alpha)/(3-alpha), 1/(3-alpha), and 1/(3-alpha);

s52, applying the CS++ SVM classifier to the flow clusters obtained in the S2, marking uniform labels for each flow cluster according to a majority voting principle, and accordingly enabling the flow clusters { C } ₁ ,C ₂ ,......,C _m Conversion to a traffic classification set { B } ₁ ^′ ,M ₁ ^′ ,M ₂ ^′ ......,M _n ^′ ,U ₁ ^′ }；

S53, integrating the results obtained in S2, S3 and S4, and firstly determining that the traffic is benign or malicious: weighted majority voting is carried out according to the weight determined in the step S51, and the encrypted traffic set is divided into a benign set B and a malicious set M;

s54, further dividing the malicious traffic set M by the integration result: for M _i And M is as follows _i ^′ Taking traffic (i=1, 2,) n, i.e. when both the clustering result and the cs++ SVM classification result consider that a malicious traffic belongs to class M _i When the flow is identifiedThe amount belongs to the fine-grained malicious category M _i The method comprises the steps of carrying out a first treatment on the surface of the All traffic in the malicious traffic set that is not identified as fine-grained malicious categories and the unknown traffic sets U and U ^′ The flow is divided into abnormal flows together and is used as suspected zero-day attack flow to be reserved for further safety event analysis and academic research.