CN115296837A

CN115296837A - SSA optimization-based sustainable integrated intrusion detection method

Info

Publication number: CN115296837A
Application number: CN202210721435.XA
Authority: CN
Inventors: 杨忠君; 刘志
Original assignee: Shenyang University of Chemical Technology
Current assignee: Shenyang University of Chemical Technology
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-11-04
Anticipated expiration: 2042-06-24
Also published as: CN115296837B

Abstract

A sustainable integrated intrusion detection method based on SSA optimization relates to a network intrusion detection method. The method comprises the following steps: a standard intrusion detection data set is selected as a training set and a test set. Preprocessing the data, and searching the preprocessed data through SSA to obtain a feature subset which maximizes the classification performance of the model. The different models are then trained using a training set containing corresponding feature subsets, and the prediction results are combined by an adaptive integrated decision process. And finally, testing by using the test set. The invention solves the problems that the current network intrusion detection method based on the machine learning model is difficult to classify complex multi-class traffic data and is difficult to obtain the characteristic subset for optimizing the model. The invention can effectively detect complex multi-class flow data, has higher detection precision compared with the traditional intrusion detection method, has the characteristic of sustainable integration, and can continuously integrate a new ML model to optimize the existing model.

Description

SSA optimization-based sustainable integrated intrusion detection method

Technical Field

The invention relates to the field of network intrusion detection, in particular to a sustainable integrated intrusion detection method based on SSA optimization.

Background

With the increasing intelligence and digitization of society, related network intrusion events occur at times, and the property safety of enterprises and individuals is seriously affected. The intrusion detection technology is an active defense technology applied to an intrusion detection system, and can effectively detect network intrusion behaviors by continuously monitoring the flow of a key network link.

The traditional intrusion detection technology is mainly divided into two types: signature-based (misuse) intrusion detection, and anomaly-based intrusion detection. And carrying out pattern matching on the unique features carried by the attack behaviors based on the intrusion detection of the signature so as to judge whether the flow is abnormal or not. However, the method can only detect the currently known attack type, and is easy to cause higher false negative rate. The intrusion detection based on the abnormity judges whether the flow behavior is abnormal or not by establishing a normal behavior model, and the method has the advantages that unknown attacks can be found, but higher false alarm rate is easily caused.

Aiming at the problems existing in the traditional intrusion detection technology, a Machine Learning (ML) technology based on data driving is introduced into the intrusion detection field, and an ML model can directly mine the behavior rules of normal and abnormal flow, so that the problems existing in the traditional intrusion detection technology are solved to a certain extent.

However, a single ML classification model often cannot effectively detect all classes on a multi-classification problem, and an Ensemble Learning (EL) method combining the classification advantages of multiple ML models can effectively alleviate such problems. The idea of EL is to learn multiple models from data, explicitly or implicitly, and combine them efficiently to obtain more reliable and accurate predictions. Training a more reliable and accurate EL model requires two preconditions, namely that the basis classifiers be quasi-distinct, and an efficient integration strategy.

Feature selection can remove redundant and irrelevant features, thereby improving the performance of the base classifier. The sea squirt Algorithm (Salp Swarm Algorithm, SSA) is a group optimization Algorithm and is widely applied to the field of feature selection and the field of engineering optimization. Weighted hard voting is a simple and effective heterogeneous classifier integration strategy, and the weights after careful calibration are often more competitive compared with other integration strategies.

Disclosure of Invention

The invention relates to a sustainable integrated intrusion detection method based on SSA optimization. And then training corresponding machine learning models by using different optimal feature subsets, finally integrating the prediction results of a plurality of machine learning models in a multi-class weighted hard voting mode, and optimizing the corresponding voting weights by SSA (simple steady state analysis) so as to effectively combine the classification advantages of different ML (maximum likelihood) models and further obtain more accurate and reliable prediction results. In addition, the method has the characteristic of sustainable integration, and new ML models can be continuously integrated to optimize the existing models.

The technical scheme of the invention is as follows:

a sustainable integrated intrusion detection method based on SSA optimization, the method comprising the steps of:

step (1): inputting a reference data set; taking an NSL-KDD data set as an example, the data set comprises normal communication traffic and four different types of attack traffic, namely Dos, probe, U2R and R2L;

step (2): preprocessing a data set; the method comprises three parts of data cleaning, feature coding and data normalization; cleaning data, namely removing repeated samples in the reference data set and samples containing missing values and abnormal values; feature coding is to encode character type discrete features in a reference data set into digital features so as to introduce a subsequent machine learning model; normalizing the data, namely eliminating dimension difference between the features;

and (3): selecting characteristics; searching optimal feature subsets corresponding to different ML models, namely feature subsets with optimal fitness values, through a SSA-based packaged feature selection algorithm;

and (4): classifying the models; training a plurality of heterogeneous machine learning classification models by using the reference data set after feature selection;

and (5): self-adaptive integrated decision making; the predictions of multiple ML models are integrated by way of multi-class weighted hard voting, with the corresponding voting weights determined and optimized by an SSA-based weight optimization algorithm.

The sustainable integration intrusion detection method based on SSA optimization, wherein the reference data set used in the step (1) is as follows: the original NSL-KDD data set contains 148517 samples in total, 30% of the samples are extracted for testing according to the layering idea, and the rest 70% of the samples are used for training, so that the proportion of the samples of different classes in the training set is consistent with that in the testing set.

The sustainable integration intrusion detection method based on SSA optimization is characterized in that in the feature coding part of step (2): three discrete character type characteristics exist in an original NSL-KDD data set, wherein the three discrete character type characteristics are respectively 'protocol-type', 'service' and 'flag', the 'protocol-type' has 3 states, the 'service' has 70 states, and the 'flag' has 11 states; adopting single hot coding for the 'protocol-type' characteristic, and expanding the characteristic into a three-dimensional characteristic; for the 'service' and 'flag' features with more states, replacing the corresponding states by the frequency counts of the states; the encoded data set contains 43-dimensional features in total.

The sustainable integration intrusion detection method based on SSA optimization, wherein in the data normalization part of the step (2): data was scaled to interval [0,1] using a minimum-maximum function, with the specific normalization:

wherein the content of the first and second substances,

a characteristic value representing the characteristic of the sample,

and

respectively representing the maximum and minimum values of the feature,

representing the normalized eigenvalues.

The sustainable integrated intrusion detection method based on the SSA optimization comprises the following steps of (3) modeling the SSA-based packaged feature selection algorithm:

(1) Setting a fitness function:

wherein acc and F1 respectively represent the overall accuracy mean value and the weighted F1 score mean value of the model in 5-fold cross validation on the training set;

(2) Setting parameters; setting the population number to be 30, the maximum iteration number to be 200, the upper search limit to be 1 and the lower search limit to be 0;

(3) Initializing a population; randomly initializing the position of the individual goblet sea squirt in the population within the search limit;

(4) Position coding; binary coding is carried out on the position of each individual in the goblet sea squirt population so as to adapt to the problem of feature selection; where 1 indicates that the feature is selected and 0 indicates that the feature is not selected. The specific coding formula is as follows:

；

note that the encoding here is only for calculating the fitness value, and the position of individual goblet sea squirt in the population will not change;

(5) Determining a food location; calculating the fitness value of each individual goblet ascidian, determining the goblet ascidian individual with the maximum fitness value, and setting the position as the food position;

(6) Searching a population; respectively updating the individual positions of the leader and the follower according to a population updating formula; in the goblet sea squirt population, the first individual is taken as a leader, and the position updating formula is as follows:

wherein the content of the first and second substances,

the first to represent the leader

The position of the dimension(s) is,

to indicate food

The position of the dimension(s) is,

and

are respectively the first

Upper and lower bounds of the dimension decision variables;

、

is that

A random number in between, and a random number,

is a convergence factor of the algorithm, plays a role in balancing global exploration and local development, and has an expression of

In the formula (I), wherein,

and

respectively representing the current iteration times and the maximum iteration times;

the other individuals are used as followers, and the position updating formula is as follows:

wherein the content of the first and second substances,

indicating the updated position of the individual and,

is indicative of the current location of the individual,

indicating the location of the previous individual;

(7) Repeating (4) - (6) until a maximum number of iterations is reached.

According to the SSA optimization-based sustainable integrated intrusion detection method, the model classification part in the step (4) is associated with feature selection, and an SSA-based feature selection algorithm can select corresponding optimal feature subsets for different machine learning models.

According to the sustainable integration intrusion detection method based on SSA optimization, the model classification part in the step (4) can integrate multiple different ML models at the same time, a new ML model can be added on the basis of the original model to optimize the classification performance of the existing model, the classification can be realized only by selecting a corresponding optimal feature subset for the new ML model and further optimizing voting weight, and certain universality and expandability are achieved.

The sustainable integrated intrusion detection method based on SSA optimization, wherein in step (5): the adaptive integrated decision making process combines predictions of multiple ML models in a multi-class weighted hard voting manner; the specific decision making process is as follows:

suppose there is

A different base classifier

The reference data set has

Individual category label

Then the weight matrix can be represented as

Wherein

，

；

For a certain sample

Class of

The weighted probability is output as

Wherein

Indicating weighted sum of

The probability of a particular class of the object,

denotes the first

Individual base classifier for classes

Predicting; the integrated probability prediction for all base classifiers can be represented as one

Dimension vector

(ii) a The final decision can be expressed as

。

The sustainable integrated intrusion detection method based on SSA optimization, wherein the modeling process of the weight optimization algorithm based on SSA in step (5) is as follows:

a. setting a fitness function:

acc represents the average value of the overall accuracy of the model in 5-fold cross validation on the training set;

b. setting parameters; setting the population quantity to be 30, the maximum iteration number to be 200, the upper search boundary to be 1 and the lower search boundary to be 0;

c. initializing a population; randomly initializing the positions of the goblet and sea squirt individuals in the population within the search limit, wherein the number of the position vector elements represented by each goblet and sea squirt individual is equal to the number of the one-dimensional vector elements generated by the weight matrix according to column extension;

d. determining a food location; calculating the individual fitness value of all goblet ascidians, and determining the goblet ascidian individual position with the maximum fitness value as the food position;

e. searching a population; respectively updating the individual positions of the leader and the follower according to a population updating formula; in the goblet ascidian population, the first individual is used as a leader, and the position updating formula is as follows:

wherein the content of the first and second substances,

first to represent leader

The position of the dimension is measured,

to indicate the first of food

The position of the dimension is measured,

and

are respectively the first

Upper and lower bounds of the dimension decision variables;

、

is that

A random number in between, and a random number,

In the formula (I), the reaction is carried out,

and

wherein the content of the first and second substances,

indicating the updated position of the individual and,

which is indicative of the current location of the individual,

indicating the location of the previous individual;

f. repeating d-e until a maximum number of iterations is reached.

The invention has the following beneficial effects:

according to the sustainable integration intrusion detection method based on SSA optimization, redundant and irrelevant features in original data are removed through packaged feature selection based on an SSA algorithm, the classification performance of a single ML model is enhanced, then decisions of multiple ML models are integrated in a multi-class weighting hard voting mode, voting weights are continuously optimized through a weight optimization algorithm based on SSA, the classification advantages of different models are fully combined, and finally the overall classification performance of an intrusion detection model is effectively improved. The method also provides an effective implementation mode for different ML models, and various novel ML models can be continuously integrated into the intrusion detection model, so that the intrusion detection model is continuously optimized to improve the overall detection performance.

Drawings

FIG. 1 is a block diagram of an overall modeling flow of an embodiment of the present invention;

FIG. 2 is a flow chart of an embodiment of an SSA-based packaged feature selection process;

FIG. 3 is a pseudo-code diagram of an SSA-based packed feature selection algorithm according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating multi-class weighted voting modeling according to an embodiment of the present invention;

fig. 5 is a pseudo code diagram of an SSA-based weight optimization algorithm according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings so that those skilled in the art can refer thereto and implement the same.

The invention provides a sustainable integrated intrusion detection method based on SSA optimization, which comprises the following steps:

in the step (1): selecting a public intrusion detection data set NSL-KDD as an evaluation sample, wherein the data set comprises 148517 sample data in total, extracting 30% of samples from the data set for testing according to a layering idea, and using the rest 70% of samples for training, so as to ensure that the proportion of different types of samples in a training set is consistent with that in a testing set.

In the step (2): three discrete character type characteristics exist in an original NSL-KDD data set, namely 'protocol-type', 'service' and 'flag', and the 'protocol-type' characteristic is subjected to independent thermal coding and expanded into a three-dimensional characteristic. For the 'service' and 'flag' features with more states, the frequency count of the state is used to replace the corresponding state. The encoded data set contains 43-dimensional features in total. Secondly, in order to eliminate dimension difference between different characteristics, normalization processing is carried out on the data, values of all the characteristics are scaled to an interval [0,1], and a specific normalization formula is as follows:

wherein the content of the first and second substances,

a characteristic value representing the characteristic of the sample,

and

respectively representing the maximum and minimum values of the feature,

representing the normalized eigenvalues.

In the step (3), the SSA-based packaging type feature selection algorithm is used for searching the optimal feature subsets corresponding to different machine learning models, and the specific SSA-based packaging type feature selection algorithm comprises the following steps:

(1) Setting a fitness function:

wherein acc and F1 respectively represent the overall accuracy mean value and the F1 score mean value of 5-fold cross validation of the model on the training set;

(2) And setting parameters. Setting the population number to be 30, the maximum iteration number to be 200, the upper search limit to be 1 and the lower search limit to be 0;

(3) And (4) initializing a population. Randomly initializing the position of the individual of the goblet sea squirt in the population within the search limits.

(4) And (4) position coding. The location of each individual in the cask ascidian population is binary coded to accommodate the feature selection problem. Where 1 indicates that a feature is selected and 0 indicates that a feature is not selected. The specific coding formula is as follows:

；

note that the encoding here is only for calculating the fitness value, and the position of individual casoderma in the population will not change

(5) The food location is determined. Calculating the fitness value of each goblet ascidian individual, determining the goblet ascidian individual with the maximum fitness value, and setting the position as the food position.

(6) And (4) searching the population. And respectively updating the individual positions of the leader and the follower according to a population updating formula. In the goblet sea squirt population, the first individual is taken as a leader, and the position updating formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

the first to represent the leader

The position of the dimension(s) is,

to indicate food

The position of the dimension(s) is,

and

are respectively the first

The upper and lower bounds of the dimension decision variables.

、

Is that

A random number in between, and a random number,

In the formula (I), wherein,

and

respectively representing the current iteration number and the maximum iteration number.

indicating the updated position of the individual and,

which is indicative of the current location of the individual,

indicating the location of the previous individual of the cask ascidian.

(7) Repeating (4) - (6) until a maximum number of iterations is reached.

In the step (4): the SSA-based packed feature selection algorithm is first used to search for optimal feature subsets corresponding to different ML models, and then the different ML models are trained and evaluated using a training set that contains only the optimal feature subsets.

In the step (5): the adaptive integrated decision process combines predictions of multiple ML models in a multi-class weighted hard vote, with corresponding vote weights determined and optimized by an SSA-based weight optimization algorithm. The specific decision making process is as follows:

suppose there is

A different base classifier

The reference data set has

Individual category label

Then the weight matrix can be represented as

Wherein

。

For a certain sample

Class of

The weighted probability is output as

Wherein

Indicating weighted sum of

The probability of an individual class of the object,

denotes the first

Individual base classifier for classes

The prediction of (a) is performed,

voting a weight for it. The integrated probability prediction for all base classifiers can be represented as one

Dimension vector

. The final decision can be expressed as

。

The weight matrix in the weighted hard voting process is determined and optimized through a weight optimization algorithm based on SSA, and the specific modeling process is as follows:

a. setting a fitness function:

b. and setting parameters. Setting the population number to be 30, the maximum iteration number to be 200, the upper search limit to be 1 and the lower search limit to be 0;

c. and (5) initializing a population. And randomly initializing the positions of the goblet and sea squirt individuals in the population within the search limit, wherein the number of the position vector elements represented by each goblet and sea squirt individual is equal to the number of the one-dimensional vector elements generated by column extension of the weight matrix.

d. The food location is determined. Calculating the individual fitness value of all the goblet ascidians, and determining the individual position of the goblet ascidian with the maximum fitness value as the food position.

e. And (5) searching the population. And respectively updating the individual positions of the leader and the follower according to a population updating formula. In the goblet sea squirt population, the first individual is taken as a leader, and the position updating formula is as follows:

wherein the content of the first and second substances,

first to represent leader

The position of the dimension(s) is,

to indicate food

The position of the dimension is measured,

and

are respectively the first

The upper and lower bounds of the dimension decision variables.

、

Is that

A random number in between, and a random number,

In the formula (I), the reaction is carried out,

and

wherein the content of the first and second substances,

indicating the updated position of the individual and,

which is indicative of the current location of the individual,

indicating the location of the previous individual of the cask ascidian.

f. Repeating d-e until a maximum number of iterations is reached.

In order to verify the beneficial effects of the method, three machine learning models, namely a Decision Tree (DT), a Random Forest (RF) and an eXtreme Gradient Boosting (XGboost) with default parameters are selected to realize the method, then indexes such as accuracy, an F1 score, detection time and the like are used for evaluation, and finally the method is compared with a Particle Swarm Optimization (PSO) algorithm and a Grey Wolf Optimization (GHO) algorithm.

TABLE 1 comparison of Performance of different optimization algorithms on the NSL-KDD test set

As shown in table 1, the accuracy and F1 score of the ML model can be effectively improved by applying the group optimization algorithm to the feature selection process, wherein the SSA according to the present invention obtains the highest accuracy and F1 score, which are better than PSO and GWO. After adaptive voting, the method effectively combines the classification advantages of different ML models, and obtains higher accuracy and F1 score. In terms of detection time, the detection time of the method is also reduced by more than 30% compared with the other two methods.

According to the sustainable integration intrusion detection method based on SSA optimization, firstly, SSA is utilized to independently select the optimal feature subset for different ML models, and then the classification performance of a base classifier is enhanced. And then, the classification advantages of different ML models are combined through self-adaptive decision-making, and finally, the classification performance of the intrusion detection model is effectively improved.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and modifications and variations of the present invention are possible for those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the technical scheme and the conception of the invention shall be included in the protection scope of the invention.

Claims

1. A sustainable integrated intrusion detection method based on SSA optimization, characterized by comprising the following steps:

step (2): preprocessing a data set; the method comprises three parts of data cleaning, feature coding and data normalization; cleaning data, namely removing repeated samples in the reference data set and samples containing missing values and abnormal values; feature encoding, namely encoding character type discrete features in a reference data set into digital features so as to introduce a subsequent machine learning model; normalizing the data, namely eliminating dimension difference between the features;

2. The SSA-optimization-based sustainable integrated intrusion detection method according to claim 1, wherein the step (1) uses a benchmark dataset comprising: the original NSL-KDD data set contains 148517 samples in total, 30% of the samples are extracted for testing according to the layering idea, and the rest 70% of the samples are used for training, so that the proportion of the samples of different classes in the training set is consistent with that in the testing set.

3. The SSA-optimized sustainable integrated intrusion detection method according to claim 1, wherein the signature coding part of step (2): three discrete character type characteristics exist in an original NSL-KDD data set, wherein the three discrete character type characteristics are respectively 'protocol-type', 'service' and 'flag', the 'protocol-type' has 3 states, the 'service' has 70 states, and the 'flag' has 11 states; adopting single hot coding for the 'protocol-type' characteristic, and expanding the characteristic into a three-dimensional characteristic; for the 'service' and 'flag' features with more states, replacing the corresponding states by the frequency counts of the states; the encoded data set contains 43-dimensional features in total.

4. The SSA optimization-based sustainable integrated intrusion detection method of claim 1, wherein in the data normalization part of step (2): the data is scaled to the interval [0,1] using a minimum-maximum function, with the specific normalization:

a characteristic value representing the characteristic of the sample,

and

respectively representing the maximum and minimum values of the feature,

representing the normalized eigenvalues.

5. The SSA-based optimized sustainable integration intrusion detection method according to claim 1, wherein the SSA-based packed feature selection algorithm modeling process of step (3) is:

(1) Setting a fitness function:

(4) Position coding; binary coding the position of each individual in the goblet sea squirt population to adapt to the problem of feature selection; wherein 1 indicates that the feature is selected, and 0 indicates that the feature is not selected; the specific coding formula is as follows:

；

note that the encoding here is only for calculating fitness values, and the location of individual casuia haichoides in the population will not change;

wherein the content of the first and second substances,

the first to represent the leader

The position of the dimension is measured,

to indicate the first of food

The position of the dimension is measured,

and

are respectively the first

Upper and lower bounds of the dimension decision variables;

、

is that

A random number in between, and a random number,

In the formula (I), the reaction is carried out,

and

wherein the content of the first and second substances,

indicating the updated position of the individual and,

is indicative of the current location of the individual,

indicating the location of the previous individual;

(7) Repeating (4) - (6) until a maximum number of iterations is reached.

6. The SSA optimization-based sustainable integrated intrusion detection method according to claim 1, wherein the model classification part of step (4) is associated with feature selection, and the SSA-based feature selection algorithm can select corresponding optimal feature subsets for different machine learning models.

7. The SSA optimization-based sustainable integrated intrusion detection method according to claim 1, wherein the model classification part in step (4) can integrate a plurality of different ML models simultaneously, and can also add a new ML model on the basis of the original model to optimize the existing model classification performance, and the model classification can be realized by only selecting a corresponding optimal feature subset for the new ML model and further optimizing voting weight, and has certain universality and expandability.

8. The SSA-optimized sustainable integrated intrusion detection method according to claim 1, wherein in the step (5): the adaptive integrated decision making process combines predictions of multiple ML models in a multi-class weighted hard voting manner; the specific decision making process is as follows:

suppose there is

A different base classifier

The reference data set has

Individual category label

Then the weight matrix can be represented as

Wherein

,

,

；

For a certain sample

Class of

The weighted probability is output as

Wherein

Indicating weighted sum of

The probability of an individual class of the object,

denotes the first

Individual base classifier for classes

Dimension vector

(ii) a The final decision can be expressed as

。

9. The sustainable integrated intrusion detection method based on SSA optimization according to claim 1, wherein the modeling process of the SSA-based weight optimization algorithm in the step (5) is:

a. setting a fitness function:

b. setting parameters; setting the population number to be 30, the maximum iteration number to be 200, the upper search limit to be 1 and the lower search limit to be 0;

e. searching the population; respectively updating the individual positions of the leader and the follower according to a population updating formula; in the goblet sea squirt population, the first individual is taken as a leader, and the position updating formula is as follows: