CN114710326A

CN114710326A - Intrusion detection method based on GC-Forest

Info

Publication number: CN114710326A
Application number: CN202210263762.5A
Authority: CN
Inventors: 赵金雄; 王国华; 张驯; 骆怡; 马宏忠; 狄磊
Original assignee: STATE GRID GASU ELECTRIC POWER RESEARCH INSTITUTE
Current assignee: STATE GRID GASU ELECTRIC POWER RESEARCH INSTITUTE
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-07-05

Abstract

The invention discloses an intrusion detection method based on GC-Forest, which comprises the steps of firstly, performing feature selection and dimension reduction on acquired network data by using principal component analysis to obtain sample data; then, respectively scanning sample data by adopting a plurality of windows with different sizes, and reconstructing the scanned data as enhanced feature data and original data through two random forests to form new feature data; and training the cascade forest by using the reconstructed new characteristic data. In the intrusion detection method based on GC-Forest provided by the invention, the adopted GC-Forest has high classification precision and simpler network structure, has fewer hyper-parameters and faster training speed than CNN, so that the algorithm is more suitable for parallel calculation, the problems of complexity and time consumption of the CNN intrusion detection method are effectively solved, and the method has obvious advantages in the aspects of training time and intrusion detection rate compared with CNN.

Description

Intrusion detection method based on GC-Forest

Technical Field

The invention belongs to the technical field of network intrusion detection, and particularly relates to an intrusion detection method based on GC-Forest.

Background

Intrusion detection technology, which is an important component of network security, detects various intrusion behaviors by collecting and analyzing various information on a network, and is an important means for maintaining network security. With the popularization of networks and the improvement of network speed, the attack behaviors are more and more, and meanwhile, the attack means is continuously updated, so that the traditional intelligent detection technology cannot achieve the expected effect. In recent years, due to the excellent performance of deep learning in classification tasks, regression learning and other aspects, intrusion detection algorithms based on deep learning are continuously proposed, most of traditional deep models are full-connection networks, and multiple parameters cause long time consumption and are easy to overfit. Significant improvements have been made to Convolutional Neural Network (CNN) based intrusion detection methods over traditional learning models such as SVMs, decision trees, and k-Nearest neighbor neurour. CNN shows great potential in the field of intrusion detection, has fewer links, hyper-parameters and better generalization ability, can extract more deep fine features, and is easier to train. However, CNN requires a large number of labels in practical applications first, which greatly increases the workload; in addition, convolution operations require high-dimensional convolution kernels, which are computationally complex and time-consuming; another disadvantage is that there are still many hyper-parameters (such as number of nodes, number of layers, learning rate, etc.), and although there are less than the conventional deep methods, it takes a lot of time to debug the parameters, which is time consuming.

Ensemble Learning (Ensemble Learning) designs a conventional intelligent algorithm or a deep Learning algorithm into a plurality of weak classifiers by considering a combination of different algorithms, and then obtains better performance by coordinating a classification strategy of a classifier group. But learning by relying on a set of traditional network models cannot mine deeper information to achieve higher performance, which is also a bottleneck limiting its performance. Based on the intrusion detection method, the invention provides a novel intrusion detection method based on GC-Forest.

Disclosure of Invention

Aiming at the defects pointed out in the background technology, the invention provides an intrusion detection method based on GC-Forest, wherein GC-Forest combines the advantages of CNN representation learning and the robustness of traditional ensemble learning, and aims to solve the problems of complexity and time consumption of the existing CNN intrusion detection method in the background technology.

In order to achieve the purpose, the invention adopts the technical scheme that:

an intrusion detection method based on GC-Forest includes the following steps:

(1) performing feature selection and dimension reduction on the acquired network data by using principal component analysis to obtain sample data;

(2) the sample data is used as original characteristic data of multi-granularity scanning, a plurality of windows with different sizes are adopted to respectively scan the original characteristic data, then the scanned data is subjected to two Random Forest (RF) to form enhanced characteristic data, and the enhanced characteristic data and the original characteristic data are reconstructed to form new characteristic data;

(3) training a cascade forest (CDF) with the reconstructed new feature data.

The algorithm of the cascade forest is as follows: and constructing a multi-stage network architecture by using a cascade structure, wherein the output vector of the first-stage network is regarded as enhanced characteristics, and the enhanced characteristics are connected with the original characteristics and are used as the input of the next stage, and the steps are repeated.

Preferably, each level of the cascaded forest uses a fully random tree forest (C-RTF) and a Random Forest (RF).

Compared with the defects and shortcomings of the prior art, the invention has the following beneficial effects:

in the intrusion detection method based on GC-Forest provided by the invention, GC-Forest combines the advantages of CNN representation learning and the robustness of traditional integrated learning, has high classification precision and simpler network structure, has less hyper-parameters and faster training speed than CNN, enables the algorithm to be more suitable for parallel computation, solves the problems of complexity and time consumption of the CNN intrusion detection method, and is more accurate. Experiments on an NSL-KDD data set show that the algorithm provided by the invention has obvious advantages over CNN in terms of training time and intrusion detection rate, and particularly shows better performance than CNN in a small data set.

Drawings

FIG. 1 is a block diagram of a GC-Forest provided by the present invention.

Fig. 2 is a graph showing the significance of the features provided in example 1 of the present invention.

FIG. 3 is a graph of the accuracy of the test platform provided in example 1 of the present invention as a function of CDF level.

FIG. 4 shows the accuracy scores of the test set under different window settings provided in embodiment 1 of the present invention.

FIG. 5 shows the accuracy score results of the test21 sets at different window settings as provided in example 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

An intrusion detection method based on GC-Forest comprises the following steps:

(1) principal Component Analysis (PCA) is an unsupervised linear transformation technique that is widely used in different fields, most notably for dimensionality reduction. In short, the purpose of PCA is to find the direction of maximum variance in the high dimensional data and project it into a new subspace, with dimensions the same as or less than those of the original data. The PCA algorithm is shown in table 1.

TABLE 1 PCA Algorithm

And performing feature selection and dimension reduction on the acquired network data by using principal component analysis to obtain sample data.

(2) The sample data after PCA dimensionality reduction is used as original characteristic data of GC-Forest, and sequentially passes through a multi-granularity scanning algorithm and a cascade Forest (CDF) algorithm.

The structure of GC-Forest is shown in FIG. 1, and comprises two parts of multi-granularity scanning and cascading Forest (CDF).

The multi-granularity scanning algorithm comprises the following steps: the method includes the steps that a plurality of windows with different sizes are adopted to scan original characteristic data respectively, the scanned data form enhanced characteristic data through two Random Forests (RF), and the enhanced characteristic data and the original characteristic data are reconstructed to form new characteristic data.

The length of the input sample vector X after PCA extraction is len (X)_mgs) Assuming that window scan data with width (a, b, C) is used, step size is (s _ a, s _ b, s _ C), and the number of sample target types is C, the length of the new feature data X _ mgs is given by the formula:

(4) cascaded forest (CDF) algorithm: after multi-granularity scanning, the characteristics are reconstructed, and the reconstructed new characteristic data is sent to the CDF, wherein the CDF is a comprehensive learning method of random forests and has competitiveness with CNN. The CDF uses a cascade structure to construct a multi-level network architecture, the output vector of the first-level network is regarded as enhanced characteristics, and the enhanced characteristics are connected with the original characteristics and are used as the input of the next level. CDF uses a simpler network structure than CNN to facilitate understanding and analysis; fewer superparameters make the algorithm more suitable for parallelism. Furthermore, CDF shows better performance than CNN in small datasets.

Each level of CDF uses two forest algorithms: a fully random tree forest (C-RTF) that selects a single feature as a parent node of a split node and a Random Forest (RF) that selects √ F (the number of features in a Fis sample) the feature with the best viny coefficient is selected as the split node. The kini coefficient is defined as follows:

wherein p is_iIs the probability that sample i belongs to a class, and C is the number of classes.

After the data is sent into the CDF, the first layer generates the MXC characteristics, the generated characteristics are connected with the original characteristics and enter the next stage for the number of forests of each layer, and the final classification result of the last stage CDF is a formula (3) and a formula

Fin(c)＝Max{Ave[c_M×C]} (3)

c_M×C＝[c₁₁，c₁₂，…c_1C；…；c_M1，c_M2…，c_MC] (4)

(5) And finally, classifying the new features by using GC-Forest.

Example 1: experimental testing

An experiment platform: intel i7-7700, ubuntu18.04, python 3.6.5.

(1) NSL-KDD dataset

GC-Forest was evaluated for performance on the NSL-KDD dataset. The NSL-KDD dataset consists of four groups: KDDtrain +, KDDtest +, KDDtrain _20percent and KDDTest-21, abbreviated as Train, Test, Train20 and Test 21. The Train set contains 22 different attacks, the Test set contains 38 attack patterns, the attacks not in Train are used to evaluate the ability to detect unknown attacks, the Train20 set is a subset, and is a random sample of 20% of the Train set. The Test21 set is a subset of the Test set that is difficult to classify correctly in the traditional model, and consists of 125973, 22544, 25192, and 11850 samples, one in each set. Each attack record consists of 41 characteristics such as type of network connection, protocol, duration, content and traffic.

(2) Data preprocessing:

first, three character features (protocol type, service and logo) are encoded into numeric features with a one-key encoder, forming a length 122 vector, and then all features are normalized. Then, PCA is used to reduce the dimensionality and the most important features are selected, as shown in fig. 2, the total significance of 50 features is close to 1, so 50 dimensions are selected as the target components of PCA.

(3) Test evaluation criteria

In the experiments, the results of GC-Forest were compared with the other 9 methods to demonstrate the validity of the present invention in relation to algorithms, especially in comparison with CNN. To integrate the capabilities of the comparison algorithms, Accuracy (ACC), accuracy (PRE), Recall (REC), F1 score and training time are used as performance indicators, accuracy being the percentage of correctly classified samples in total, accuracy representing the capability of attack detection, recall representing the capability of false alarm, and F1 score representing the stability of the system. The results obtained are an average of 10 times.

The calculation formulas are respectively (5), (6), (7) and (8).

Wherein, TP is a model for predicting the correct number of attack samples, FP is the number of samples predicted as an attack class but actually as a normal class, TN is the number of samples correctly predicted as a normal class by the model, and FN is the number of samples predicted as an attack class of the normal class.

(4) Experimental parameter settings

In terms of parameter setting, GC-Forest has significant advantages over CNN. Table 2 shows the parameters of GC-Forest.

TABLE 2 parameter settings for GC-Forest

As can be seen from Table 2, there are 50 decision trees for each forest, and the function of Keyny in equation (2) is selected as the evaluation function. For accuracy and diversity, two different forest models were chosen, a tree of a Random Forest (RF) would be generated with a maximum of 7 features, and a tree of a fully random tree forest (C-RTF) would be generated with 1 feature.

The number of input features is 50, and a method of related feature selection and grid search is adopted. The combined importance of the first 50 features reaches 99.30%. The number and size of the scanning windows are important parameters of the model, so some sizes and combinations thereof are used in experiments to find the optimal number and size.

The relationship between accuracy score and CDF level is shown in FIG. 3, using window (10,20) and window (10) gives a better score than the curve, and the curve converges near level 8.

Fig. 4 and 5 compare the accuracy scores of the test set and the test21 set, respectively, for different window settings. In fig. 4, the scores obtained with the windows (10, 30) are the best, and it can be seen that the accuracy periodically exhibits a step-like variation as the window size and number vary, with better results when the window size is chosen to be 10. In fig. 5, the window (30) has the highest score. As the window size is gradually changed, the accuracy is gradually increased.

The results of the PCA-GC-Forest accuracy comparison with the other 9 methods are shown in Table 3, and the accuracy data of the other 9 methods are obtained from published articles. As can be seen from Table 3, PCA-GC-Forest gave the best scores in both the Test and Test21 sets, fully demonstrating the superiority of the method. The best two methods are PCA-GC-Forest and NPCNN, which fully illustrate the advantages of feature learning in intrusion detection. As a collection of RFs, GC-Forest has an accuracy of 5.48% higher than RF because the collection has a good learning ability.

TABLE 3 results of accuracy tests of ten methods

Method	ACC test (%)	21ACC test (%)
			J48	81.05	63.97
NB	76.56	55.77
			NBTree	82.02	66.16
RandomForest	80.67	63.26
			RandomTree	81.59	58.51
MLP	77.41	57.34
			SVM	69.52	42.29
CNN	78.76	60.02
			NPCNN	82.59	69.20
PCA-GC-Forest	86.15	75.26

The accuracy results of the PCA-GC-Forest and CNN tests on the Train and Train20 sets, respectively, are shown in Table 4. CNN uses a model named VGG-16. It can be seen that in both the Train and Train20 sets, the test accuracy score for PCA-GC-Forest is higher than the overall accuracy score for CNN, indicating the superiority of PCA-GC-Forest in overall detection. In addition, better performance in terms of accuracy indicates that PCA-GC-Forest is less likely to miss attacks, and a higher F1 score demonstrates that PCA-GC-Forest is more powerful than CNN. Then, in the comparison of the Train score and the Train20 score, the score of PCA-GC-Forest decreased less than CNN in addition to the accuracy and recall score of the normal class, which fully demonstrates that PCA-GC-Forest performed better on small data sets.

TABLE 4 test set results for GC-Forest and CNN

The training times for GC-Forest and CNN are shown in Table 5, with the CDF of GC-Forest having 7 levels and the training of CNN having 7 convolution levels.

TABLE 5 training times for GC-Forest and CNN

Data set	Train/s	Train20/s
			GC-Forest	391.84	96.20
CNN	599.81	164.77

It can be seen that the training time for GC-Forest is shorter than that for CNN, indicating that GC-Forest is more efficient than CNN.

The experimental result fully shows the superiority of the PCA-GC-Forest intrusion detection method, and the accuracy rate and the training speed are both very satisfactory.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An intrusion detection method based on GC-Forest is characterized by comprising the following steps:

(2) the sample data is used as original characteristic data of multi-granularity scanning, a plurality of windows with different sizes are adopted to respectively scan the original characteristic data, then the scanned data is subjected to two random forests to form enhanced characteristic data, and the enhanced characteristic data and the original characteristic data are reconstructed to form new characteristic data;

(3) and training the cascade forest by using the reconstructed new characteristic data.

2. The GC-Forest based intrusion detection method according to claim 1, wherein the algorithm of the cascaded Forest is as follows: and constructing a multi-stage network architecture by using a cascade structure, wherein the output vector of the first-stage network is regarded as enhanced characteristics, and the enhanced characteristics are connected with the original characteristics and are used as the input of the next stage.

3. The GC-Forest based intrusion detection method of claim 1, wherein each level of the cascaded Forest uses a fully random tree Forest and a random Forest.