CN112910866B

CN112910866B - Feature selection method for network intrusion detection

Info

Publication number: CN112910866B
Application number: CN202110076965.9A
Authority: CN
Inventors: 李珊珊; 韦世红; 李兆玉; 赖雪梅
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2022-07-29
Anticipated expiration: 2041-01-20
Also published as: CN112910866A

Abstract

The invention relates to a feature selection technology in the field of network security, in particular to a feature selection method for network intrusion detection; acquiring a network intrusion data set, and primarily screening the network intrusion data set by using the correlation degree to obtain a first network intrusion characteristic subset; obtaining a second network intrusion feature subset from the initially screened subset by adopting a recursive feature elimination improvement algorithm based on a random forest; taking the second network intrusion feature subset as a part of initial population features of the genetic algorithm, randomly generating another part, and iterating to obtain a third network intrusion feature subset; aiming at the problem of feature selection of high-dimensional data in network intrusion, firstly, a network intrusion data set is primarily screened by using relevancy measurement, then, a recursive feature elimination improvement algorithm based on a random forest classifier is used for obtaining a good feature subset, the good feature subset is used as one part of an initial population of a genetic algorithm, the other part is randomly generated, and a feature subset with the best classification effect is obtained by using the genetic algorithm.

Description

Feature selection method for network intrusion detection

Technical Field

The invention relates to a feature selection technology in the field of network security, in particular to a feature selection method for network intrusion detection.

Background

With the rapid development of network technology, the network environment has massive data, and the problem of network security is very important. The insecurity of the network environment can cause privacy disclosure, resource embezzlement and other problems, and bring much loss to the working life of people. Therefore, network intrusion detection has become a research hotspot. The network intrusion detection analyzes the network information to find whether behaviors violating the security policy and signs of attack exist in the network.

The feature selection of the network intrusion data is a crucial link for network intrusion detection, the effect of later detection is directly influenced, the feature selection can effectively reduce data dimensionality and computational complexity, and the accuracy of the classifier is improved. The method aims to remove the features with low relevance degree and excessive redundancy degree with the class labels from the feature set of the data set, and searches a group of representative feature subsets with the least number and the optimal result on the premise of ensuring the classification effect as much as possible. According to the different combination modes of the feature set and the learning algorithm, the current common feature selection methods can be divided into three categories: filter (Filter), Wrapper (Wrapper) and Embedded (Embedded).

The current commonly used feature selection methods include mutual information method, chi-square check, combination of group intelligent algorithm and classification algorithm and the like, and obtain better classification performance, but the feature selection algorithms fail to analyze the problem from the aspect of simultaneously removing weak related and redundant features. And because the network intrusion data has increasing dimensions and scales, not only the overhead of the intrusion detection algorithm is increased, but also the redundant attributes and irrelevant attributes influence the detection effect, so that how to select a feature subset with the highest accuracy and the lowest quantity is still a difficult point of research.

Disclosure of Invention

Based on the problems in the prior art, in order to select a network intrusion feature subset with a better classification effect, the invention provides a feature selection method for network intrusion detection, which comprises the following steps:

s1, carrying out preliminary screening on the network intrusion data set by using the correlation degree to obtain a series of first network intrusion feature subsets;

s2, obtaining a second network intrusion feature subset from the initially screened subset by adopting a random forest-based recursive feature elimination improved algorithm;

And S3, taking the second network intrusion feature subset as a part of initial population features of the genetic algorithm, randomly generating another part of initial population features, and iterating to obtain a third network intrusion feature subset.

The invention has the beneficial effects that:

aiming at the characteristic selection problem of high-dimensional data, based on the characteristics of high dimensionality of network intrusion data, large data scale and the like, the invention firstly uses mutual information as a relevancy measurement standard to carry out primary screening on a network intrusion data set, then uses a recursive characteristic elimination improvement algorithm based on a random forest classifier to obtain an excellent characteristic subset, uses the excellent characteristic subset as a part of an initial population of a genetic algorithm, generates the other part randomly, and uses the genetic algorithm to obtain a characteristic subset with the best classification effect.

Drawings

Fig. 1 is a flowchart of a feature selection method for network intrusion detection according to an embodiment of the present invention;

FIG. 2 is a flow chart of the preliminary screening of feature sets in an embodiment of the present invention;

FIG. 3 is a flow chart of an improved algorithm for recursive feature elimination based on a random forest according to an embodiment of the present invention;

FIG. 4 is a flow chart of random forest generation in an embodiment of the present invention;

FIG. 5 is a flow chart of feature selection based on a genetic algorithm in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

With the technical development of computer networks, network intrusion means are changing day by day, the traditional simple network defense means can not solve the existing network intrusion problem, and network intrusion detection is actually a classification problem and mainly comprises the technologies of feature selection, classification model selection, parameter optimization and the like; the invention is based on the characteristic selection technology, and improves the intrusion detection classification accuracy of network intrusion data.

Fig. 1 is a flowchart of a feature selection method for network intrusion detection according to an embodiment of the present invention, and as shown in fig. 1, the feature selection process includes:

s1, carrying out preliminary screening on the network intrusion data set by using the correlation degree to obtain a first network intrusion characteristic subset;

Firstly, the invention considers using NSL-KDD data set to test and analyze, it can be understood that, in the feature selection, the selection of data set is the first step of research and evaluation method, the accuracy of data set will directly determine the evaluation result of various methods, the feature selection result of data set adopted in the invention can reflect the commonness of most data sets, therefore, although the accuracy of final feature selection of each data set is slightly different, the accuracy of selection can be basically improved by adopting the feature selection method of the invention.

The NSL-KDD data set adopted in the embodiment of the invention is a version of the KDD99 data set after cleaning, so that a large amount of redundant repeated data is removed, and the method is more suitable for an intrusion detection experiment. The NSL-KDD dataset still presents some problems, but it can still be used as an effective reference dataset to help researchers compare different intrusion detection methods. The settings of the training set and test set of NSL-KDD are reasonable and the results of the evaluations of the different research jobs will be consistent and comparable. The data set comprises a training set and a testing set, wherein the training set contains 22 attack types, the testing data set contains 17 attack types, and the attack types are mainly divided into four types: dos, Probe, U2R, U2L. The data set contains a total of 41 features and 1 class tag feature, wherein three features, protocol _ type, service and flag, are type features.

Then, preprocessing a network intrusion data set; the embodiment of the invention processes the discrete data by utilizing the one-hot coding and converts the character type characteristics into the numerical type characteristics. And then, carrying out numerical normalization, including data normalization and numerical normalization, wherein the process can enable each feature to be in the same magnitude range, and the same contribution is made to the intrusion detection classification result.

Of course, the preprocessing is mainly performed on each feature in the data set.

Fig. 2 is a flowchart of preliminary screening of a feature set in an embodiment of the present invention, and as shown in fig. 2, the preliminary screening of the network intrusion data set by using a correlation includes:

s11, calculating mutual information between each feature set and each category label set in the network intrusion data set;

suppose f _i For the ith feature in the feature set F in the network intrusion data setAnd f is _i ＝{f _i1 ,f _i2 ,…f _il Where l represents a feature f _i The number of characteristic values of (i.e. each characteristic f) _i The total number of the information in the list includes l pieces of sub information; let c _k Is expressed as the kth type in a category label set C in the network intrusion data set, and the category label set C is { C ═ C ₁ ,c ₂ ,…c _s H, then characteristic f _i And the mutual information of the tag set C is defined as:

s12, calculating the correlation degree between the feature set and the category label set by using the mutual information, wherein the formula is as follows:

where m represents the dimension of the feature.

And S13, based on the correlation degree between the feature set and the category label set, carrying out primary screening on the feature set according to the order of the correlation degree.

Taking the feature set corresponding to the front degree of correlation as a first network intrusion feature subset S ₁ 。

After a lot of tests, the effect of selecting the first 80% -90% of the features is the best, i.e. in the preferred embodiment of the present invention, the features with the correlation degree of the first 85% can be selected as the first network intrusion feature subset S ₁ 。

the method comprises the steps of firstly, using mutual information as a correlation degree measurement standard to carry out preliminary screening on an original network intrusion data set, then using a random forest as a classifier, adding a concept of characteristic redundancy as a measurement mode of redundancy characteristics, deleting the characteristics with the maximum redundancy while deleting the variables with the minimum importance, and obtaining a characteristic subset with strong correlation and low redundancy.

Specifically, fig. 3 shows a flowchart of the recursive feature elimination improvement algorithm based on the random forest in the embodiment of the present invention, as shown in fig. 3, the flowchart includes:

s21, calculating the importance of each feature in the feature set by using a random forest classifier;

FIG. 4 is a flow chart of random forest generation according to an embodiment of the present invention; as shown in fig. 4, the generation method of the random forest is:

randomly extracting a part of samples in a replacement mode;

randomly extracting partial features as features to be selected;

determining test features in the features to be selected by utilizing the Gini index;

a node that generates a random number;

if the node can become a leaf node and the decision tree stops growing, storing the decision tree, otherwise, branching;

and when the trees of the decision tree meet the requirements, generating a random forest, otherwise, returning to the sample extraction process to continue circulation.

The method for calculating the feature importance by utilizing the random forest comprises the following steps:

for each decision tree, the out-of-bag data error μ is calculated ₁ 。

Features f of random pair out-of-bag data samples _i Adding noise interference, and calculating error μ of data outside the bag ₂ 。

Let N be the number of decision trees in the random forest, then for feature f _i Of importance

S211, setting the size of a training set to be N, and randomly and replaceably extracting N training sample training sets from the training set for each tree in a random forest;

s212, if the feature dimension of each sample is M, randomly selecting M feature subsets from the M features to input into a decision tree, and calculating an optimal splitting mode when the tree is split each time;

s213, each tree grows completely without pruning;

and S214, the result of all the tree votes is the final result of the random forest classification.

S22, sorting the features according to the importance;

s23, iteratively deleting the features with the minimum importance;

s24, calculating the redundancy of each feature;

assume the feature set is A, and A ═ f ₁ ,f ₂ ,…f _k }. Any one of the features f in A _i The redundancy with other features in the set is defined as:

s25, deleting the features with the minimum importance and deleting the features with the maximum redundancy;

s26, taking the rest characteristic set as a second network intrusion characteristic subset S ₂ Said second network intrusion feature subset S ₂ Is a subset of features that are strongly correlated and have low redundancy.

The existing method for selecting features based on recursive feature elimination generally uses a classifier to sequence feature importance, and then deletes the features with the minimum importance in sequence; compared with the existing method, the method provided by the invention is improved, the importance of the features is considered, the redundancy between the features is considered, and the accuracy of the feature subsets can be effectively improved by screening the feature subsets from two angles.

FIG. 5 is a flow chart of feature selection based on genetic algorithm in the embodiment of the present invention, and the process of executing the genetic algorithm of the present invention includes:

s31, using a second network intrusion feature subset S ₂ As part of the initial population, another part of the initial population is randomly generated.

S32, encoding individual characteristics in each initial population characteristic in a binary mode;

s33, calculating the fitness value of each individual feature according to the fitness function;

s34, selecting an operator by adopting a championship algorithm, selecting the individual characteristic with the highest fitness value and transmitting the individual characteristic to the next generation;

s35, performing a cross variation process to generate next generation population characteristics;

s36, executing the step S33-step S35 until the maximum iteration number is reached, or when the current population fitness value reaches a preset threshold value which is set to be 0.9999 in the embodiment of the invention, terminating the flow by the genetic algorithm, and outputting the decoded third network intrusion feature subset S ₃ 。

In some preferred embodiments, said selecting an operator using the tournament algorithm comprises:

s341, determining the number N of the individual features selected each time;

s342, randomly selecting N individual characteristics from the population, and selecting the individual characteristic with the best fitness value to enter the next generation of population.

And S343, repeating for multiple times until the new population size reaches the original population size.

The initial population of the original genetic algorithm is randomly selected and generated in the original characteristic set, the characteristic subset obtained in the recursive characteristic elimination algorithm is used as part of the population of the initial population of the genetic algorithm, and the other part is randomly generated.

In the description of the present invention, it is to be understood that the terms "coaxial", "bottom", "one end", "top", "middle", "other end", "upper", "one side", "top", "inner", "outer", "front", "center", "both ends", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.

In the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "disposed," "connected," "fixed," "rotated," and the like are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrated; can be mechanically or electrically connected; the terms may be directly connected or indirectly connected through an intermediate, and may be communication between two elements or interaction relationship between two elements, unless otherwise specifically limited, and the specific meaning of the terms in the present invention will be understood by those skilled in the art according to specific situations.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A feature selection method for network intrusion detection, the feature selection method comprising the steps of:

s1, collecting a network intrusion data set, and primarily screening the network intrusion data set by using the correlation degree to obtain a first network intrusion characteristic subset;

S2, obtaining a second network intrusion feature subset from the preliminarily screened subsets by adopting a random forest-based recursive feature elimination improvement algorithm, wherein the step S2 comprises the following steps:

s21, calculating the importance of each feature in the first network intrusion feature subset by using a random forest classifier;

s22, sorting the features according to the importance;

s23, iteratively deleting the features with the minimum importance;

s24, calculating the redundancy of each feature;

s26, taking the rest feature set as a second network intrusion feature subset, wherein the second network intrusion feature subset is a feature subset with strong correlation and low redundancy;

s3, taking the second network intrusion feature subset as a part of initial population features of a genetic algorithm, randomly generating another part of initial population features, and iterating to obtain a third network intrusion feature subset, wherein the third network intrusion feature subset is generated in a manner that:

coding individual characteristics in each initial population characteristic in a binary mode; calculating the fitness value of each individual feature according to the fitness function; selecting an operator by adopting a championship algorithm, and selecting the individual characteristic with the highest fitness value to be transmitted to the next generation; performing a cross variation process to generate next generation population characteristics; and when the maximum iteration times are reached or the fitness value of the current population characteristic reaches a preset threshold value, stopping the iteration process and outputting a decoded third network intrusion characteristic subset.

2. The method of claim 1, wherein the preliminary screening of the network intrusion data set using the correlation further comprises preprocessing the network intrusion data set to classify the network intrusion data set into a plurality of features and a class label; the method comprises the steps of processing discrete network intrusion data by utilizing unique hot codes, converting character type characteristics into numerical type characteristics, and carrying out numerical value standardization processing on the numerical type characteristics, wherein the numerical value standardization processing comprises data standardization and numerical value normalization.

3. The method of claim 1, wherein the preliminary screening of the network intrusion data set using the correlation degree comprises:

calculating mutual information between each feature set and each category label set in the network intrusion data set;

calculating the correlation degree between the feature set and the category label set by using the mutual information;

and sorting according to the degree of correlation based on the degree of correlation between the feature set and the category label set, and carrying out primary screening on the feature set.

4. The method for selecting features oriented to network intrusion detection according to claim 1, wherein the first network intrusion feature subset is generated in a manner that a correlation degree between each feature set and a category label set is calculated, and a feature set corresponding to a feature ranked in the correlation degree in the first network intrusion feature subset is used.

5. The method for selecting network intrusion detection-oriented features according to claim 1, wherein the selecting out the operator by the tournament algorithm comprises randomly selecting a plurality of individual features from the population features, and selecting the individual feature with the highest fitness value from the population features to enter the next generation of population features; and repeating the steps until the new population characteristic scale is the same as the initial population scale.