CN114897124A

CN114897124A - Intrusion detection feature selection method based on improved wolf optimization algorithm

Info

Publication number: CN114897124A
Application number: CN202210321742.9A
Authority: CN
Inventors: 贺敬; 刘泽超; 施然; 李思照
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-08-12

Abstract

The invention provides an intrusion detection feature selection method based on an improved grey wolf optimization algorithm, which is characterized in that a wolf group is disturbed by using Cauchy variation and Levy flight to jump out local optimums, the global search capability is improved, an optimal feature subset is obtained, and the accuracy of intrusion detection is improved. When the grey wolf population is initialized, Logistic chaotic mapping is used, the quality of the initial grey wolf population is improved, and the algorithm mining capacity is improved; disturbance is carried out when the wolfsbane population is updated, the defect of falling into local optimum is made up to a certain extent, and the global search capability is improved; and finally, training an optimal feature subset with a small number of features, and bringing the optimal feature subset into a test set to obtain a higher detection accuracy.

Description

Intrusion detection feature selection method based on improved wolf optimization algorithm

Technical Field

The invention relates to a gray wolf optimization algorithm and kNN and other related technologies for classification and identification, and belongs to a method for seeking an optimal feature subset in an intrusion detection system.

Background

In the era of increasing network security risks, more and more security protection means are provided. The intrusion detection technology can effectively detect known and unknown attacks, and has the capabilities of identification and early warning, so that the intrusion detection technology becomes one of the most safe and common protective measures acknowledged in the industry. Although intrusion detection systems have various classifications according to different classification standards, many intrusion detection systems have problems such as low detection efficiency. As the data volume of network traffic increases exponentially, the intrusion frequency and complexity of network attackers are also increasing continuously, so that it is difficult to accurately detect attacks or anomalies in high-dimensional network traffic, and the time cost also increases exponentially with the increase of the data volume. After the network flow data is analyzed, the fact that the number of data features of the network flow is large but the number of the features with invalid redundancy is small is found, the redundancy features are removed, the dimensionality reduction of the high-dimensional network flow is achieved, the time cost can be effectively reduced, and the accuracy rate is increased.

Feature selection is the best method for reducing the dimension of high-dimensional flow data, and the most representative and most valuable feature subset is selected from the features, so that dimension reduction is realized. The categories of feature selection algorithms can be regarded as three categories, filtering, wrapping and embedding. The filtering process is to select the features and train the classifier, and no algorithm is involved in the process of selecting the feature subset; the wrapping type is that when the feature subsets are selected, various intelligent algorithms are added, an optimal group of feature subsets are found from a solution space consisting of all features, and the wrapping type algorithm is also the most widely researched feature selection algorithm at present; the embedded method mixes the two methods and takes the process of feature selection as a link of training the classifier.

The gray wolf is a predator at the top of the nature, prefers to be social and has a strict social ranking system, so that the strongest leader wolf with the highest ranking can be clearly distinguished in the gray wolf group. The grey wolf optimization algorithm simulates the mass hunting behavior pattern of grey wolfs, improved to encompass the prey, i.e., the optimal fitness, each wolf representing a feature subset solution. The algorithm has strong convergence, and has the advantage of easy realization due to less related parameters, but also has the problem of easy falling into local optimization. The characteristic selection based on the gray wolf optimization algorithm is a wrapping type characteristic selection type, when the optimal characteristic subset is searched in a solution space, the gray wolf optimization algorithm is utilized, a plurality of groups of characteristic subsets are used as base points, the search is carried out nearby, and the optimal characteristic subset is finally obtained, so that the dimensionality reduction of high-dimensional flow data is realized, the cost is reduced, the intrusion detection accuracy is improved, and the rapid improvement of the network defense efficiency is realized.

Disclosure of Invention

The invention aims to solve the problems of low accuracy and low speed of the final intrusion detection result caused by factors such as a large number of data features, redundant features and the like in the intrusion detection process. The traditional intrusion detection depends on the prior art to detect flow data and the like, and selects a feature subset for complicated data features by utilizing a gray wolf optimization algorithm, but the current algorithm has the problem of falling into local optimization and influences the selection of the optimal feature subset and the final accuracy. The invention provides an intrusion detection feature selection technology of a gray wolf optimization algorithm combining Cauchy variation and Levy flight, which utilizes the Cauchy variation and Levy flight to disturb a wolf group, jump out local optimality, improve the global search capability, further obtain an optimal feature subset and improve the accuracy of intrusion detection.

The invention is realized by the following specific steps:

step 1: 80% are obtained from KDD Cup99 data set as training set, and 20% are obtained as testing set. And the intrusion detection classification technology selects a kNN technology, and k is 5. K represents the K nearest neighbors, and each sample can be represented by its nearest K neighbors.

Step 2: parameters are set that improve the gray wolf algorithm. The number N of wolf clusters is 30, the maximum iteration number T is 50, the clusters are initialized, the position of the current wolf is represented by a vector X, the dimension is the feature number 41 of the selected data set, and the first generation wolf clusters are generated through initialization.

And step 3: converting all the positions of the wolfsbane population into a binary system, taking 0.5 as a boundary, setting the wolfsbane population as 1 when the wolfsbane population exceeds 0.5 in each dimension, and representing the characteristic represented by the dimension; otherwise, set to 0 indicates that the feature represented by the dimension is not selected.

And 4, step 4: and substituting the obtained binary grayish wolf population vector group into the kNN training set according to the corresponding characteristic selection scheme so as to select corresponding characteristic data.

And 5: training the characteristic data obtained in the step 4 to obtain the accuracy

Where TP indicates the number of correct detections, FP indicates the number of erroneous detections, and the error rate error is 1-Accuracy.

Step 6: calculating the fitness of the subset of features

Where a and b represent weights, a is typically 0.99, b is typically 0.01, m represents the number of features in this subset of features, and n represents all the subsets of features. The calculation of the formula is used as the fitness value because two factors, namely accuracy and feature subset length, need to be considered in the process of feature selection.

And 7: and (4) after the fitness values of the feature subset schemes calculated in the step (6) are sorted, and the three smallest values are set as three leadership wolfs alpha, beta and gamma. Among these three wolfs, the wolf head, i.e. the wolf with the best fitness, is denoted by α, in the process belonging to the current best feature subset; the beta wolf is the wolf with the second best fitness, obeys to the head wolf and belongs to the current next-best feature subset; the γ wolf is the third best wolf, subject to α, β wolf, belonging to the current third best feature subset.

And 8: generating a random number of [0,1) to determine the perturbation strategy of alpha, beta, gamma wolf. Bounded by 0.5, the random number is within the range of [0,0.5] step 8.1 is performed, otherwise step 8.2 is performed.

Step 8.1: sequentially carrying out Cauchy variation disturbance on alpha, beta and gamma to obtain new positions

Step 8.2: sequentially carrying out Cauchy variation disturbance on alpha, beta and gamma to obtain new positions

Wherein

And w represents a random number that follows a normal distribution of standards,

wherein Γ (x) ═ x-1! .

Step 8.3: and calculating a corresponding fitness value according to the result obtained in the step 8.1 or 8.2, and storing and updating the positions and the fitness values of the three wolfs before updating.

And step 9: the model of the wolf surrounding the prey is X (t +1) ═ X _p (t) -A ° D, wherein X _p (t) represents the position vector of the current generation of prey, which refers to the position of the first three wolfs, A is the convergence factor, C is the swing factor, D represents the distance between wolfs, and A is 2 alpha DEG r ₁ -α，C＝2r ₂ ，D＝C°X _p (t) -X (t), ° representing the Hadamard product, a decreases nonlinearly from 2 to 0 throughout the iteration,

r ₁ and r ₂ Is [0,1 ]]A random vector of (1).

Step 10: in each iteration process, the best three wolfs in the current population are kept, and the positions of the candidate wolfs are calculated. The next generation of gray wolf population locations were obtained using the following formula. D _α ＝|C ₁ °X _α (t)-X(t)|、D _β ＝|C ₂ °X _β (t)-X(t)|、D _γ ＝|C ₃ °X _γ (t)-X(t)|、X ₁ ＝X _α -A ₁ °D _α 、X ₂ ＝X _β -A ₂ °D _β 、X ₃ ＝X _γ -A ₃ °D _β 、

Wherein X _α 、X _β And X _γ Three wolfs, D respectively representing the best current population _α 、D _β And D _γ Respectively representing the distances between the current candidate gray wolf and the optimal three wolfs.

Step 11: carrying out Cauchy variation disturbance on all the gray wolves,

step 12: and (5) repeating the steps 3 to 12 until the maximum iteration number is reached.

Step 13: and during the kNN test, extracting test set data by using the obtained optimal feature subset, and performing detection analysis.

Compared with the prior art, the invention has the beneficial effects that:

when the wolf population is initialized, Logistic chaotic mapping is used, the quality of the initial wolf population is improved, and the algorithm mining capacity is improved; disturbance is carried out when the wolfsbane population is updated, the defect of falling into local optimum is made up to a certain extent, and the global search capability is improved; and finally, training an optimal feature subset with a small number of features, and bringing the optimal feature subset into a test set to obtain a higher detection accuracy.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

Fig. 1 is a schematic flow chart of an intrusion detection feature selection method based on caucasian variation gray wolf optimization algorithm according to the present invention. The method comprises the following specific implementation steps:

1. data were randomly drawn from KDD Cup99, with 80% as training set and 20% as test set.

2. All parameters used by the gray wolf optimization algorithm are set, and a gray wolf population is initialized by using Logistic chaotic mapping.

3. Substituting the obtained characteristic selection scheme corresponding to the initial wolf population into the kNN model, and trainingThe training set is subjected to feature extraction and training to obtain specific accuracy

And error rate error 1-Accuracy, by combining the feature numbers contained in the feature subsets

And calculating the fitness value of each scheme, and sequencing to obtain the best first 3 wolfs as alpha, beta and gamma wolfs.

4. Levy flight is carried out on the 3 wolf

Or coxib mutation

And (4) disturbing, calculating the adaptability value after disturbance, comparing and storing the best result.

5. Updating the positions of all grey wolfs by using the disturbed alpha, beta and gamma wolfs,

6. and carrying out Cauchy variation disturbance on the current Huilus lupulus population to obtain the next generation Huilus lupulus population.

7. Subjecting the population to fitness value

And updating the top 3 wolf locations.

8. And judging whether the maximum iteration times is reached, if not, returning to the step 4, sequentially performing each step, and jumping out of the loop after the loop is circulated until the maximum iteration times is reached to complete the search of the optimal feature subset.

9. And according to the obtained optimal feature subset, performing feature extraction on data in the KDD Cup99 test set, and detecting by using kNN to obtain the test accuracy.

Claims

1. An intrusion detection feature selection method based on an improved grayling optimization algorithm is characterized by comprising the following steps:

step 1: acquiring 80% of the KDD Cup99 data set as a training set and 20% of the KDD Cup99 data set as a test set; selecting a kNN technology by using an intrusion detection classification technology, wherein k is 5; k represents the K nearest neighbors, each sample can be represented by its nearest K neighbors;

step 2: setting parameters for improving the gray wolf algorithm; the number N of wolf clusters is 30, the maximum iteration number T is 50, a population is initialized, the position of the current wolf is represented by a vector X, the dimension is the characteristic number 41 of the selected data set, and a first generation wolf cluster is generated through initialization;

and step 3: converting all the positions of the wolfsbane population into a binary system, taking 0.5 as a boundary, setting the wolfsbane population as 1 when the wolfsbane population exceeds 0.5 in each dimension, and representing the characteristic represented by the dimension; otherwise, setting the dimension to be 0 to indicate that the feature represented by the dimension is not selected;

and 4, step 4: substituting the obtained binary grayish wolf population vector group into a kNN training set according to the corresponding characteristic selection scheme so as to select corresponding characteristic data;

Wherein TP represents the number of correct detections, FP represents the number of false detections, and the error rate is 1-Accuracy;

step 6: calculating the fitness of the subset of features

Wherein a and b represent weights, a is usually 0.99, b is usually 0.01, m represents the number of features in the feature subset, and n represents all feature subsets; the calculation of the formula is used as a fitness value because two factors of accuracy and length of a feature subset need to be considered in the process of feature selection;

and 7: after the fitness values of the feature subset schemes calculated in the step 6 are sorted, the three smallest values are set as three leadership wolfs alpha, beta and gamma; among these three wolfs, the wolf head, i.e. the wolf with the best fitness, is denoted by α, in the process belonging to the current best feature subset; the beta wolf is the wolf with the second best fitness, obeys to the head wolf and belongs to the current next-best feature subset; the γ wolf is the third best wolf, obeys to the α, β wolf, and belongs to the current third best feature subset;

and 8: generating a random number of [0,1) to determine a perturbation strategy of alpha, beta, gamma wolf; taking 0.5 as a boundary, executing a step 8.1 when the random number is in the range of [0,0.5], otherwise, executing a step 8.2;

Wherein

u-w- σ, v and w represent a random number that follows a normal distribution of the norm,

wherein Γ (x) ═ x-1! (ii) a

Step 8.3: calculating a corresponding fitness value according to the result obtained in the step 8.1 or 8.2, and storing and updating the positions and the fitness values of the three wolfs;

and step 9: the model of the gray wolf surrounding the prey is

Wherein X _p (t) represents the position vector of the current generation of prey, which refers to the position of the first three wolfs, A is the convergence factor, C is the swing factor, D represents the distance between wolfs,

C＝2r ₂ ，

representing the hadamard product, a is reduced nonlinearly from 2 to 0 throughout the iteration,

r ₁ and r ₂ Is [0,1 ]]A random vector of (1);

step 10: in each iteration process, the best three wolfs in the current population are reserved, and the positions of the candidate wolfs are calculated; obtaining the next generation of Huilus lupulus population position by using the following formula;

wherein X _α 、X _β And X _γ Three wolfs, D respectively representing the best current population _α 、D _β And D _γ Respectively representing the distances between the current candidate grey wolf and the optimal three wolfs;

step 11: subjecting all Husky wolfs to Cauchi variationThe disturbance is carried out by the vibration generator,

step 12: repeating the step 3 to the step 12 until the maximum iteration number is reached;

step 13: and during kNN testing, extracting test set data by using the obtained optimal characteristic subset, and performing detection analysis.