CN112910866B - Feature selection method for network intrusion detection - Google Patents

Feature selection method for network intrusion detection Download PDF

Info

Publication number
CN112910866B
CN112910866B CN202110076965.9A CN202110076965A CN112910866B CN 112910866 B CN112910866 B CN 112910866B CN 202110076965 A CN202110076965 A CN 202110076965A CN 112910866 B CN112910866 B CN 112910866B
Authority
CN
China
Prior art keywords
feature
network intrusion
subset
features
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110076965.9A
Other languages
Chinese (zh)
Other versions
CN112910866A (en
Inventor
李珊珊
韦世红
李兆玉
赖雪梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202110076965.9A priority Critical patent/CN112910866B/en
Publication of CN112910866A publication Critical patent/CN112910866A/en
Application granted granted Critical
Publication of CN112910866B publication Critical patent/CN112910866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to a feature selection technology in the field of network security, in particular to a feature selection method for network intrusion detection; acquiring a network intrusion data set, and primarily screening the network intrusion data set by using the correlation degree to obtain a first network intrusion characteristic subset; obtaining a second network intrusion feature subset from the initially screened subset by adopting a recursive feature elimination improvement algorithm based on a random forest; taking the second network intrusion feature subset as a part of initial population features of the genetic algorithm, randomly generating another part, and iterating to obtain a third network intrusion feature subset; aiming at the problem of feature selection of high-dimensional data in network intrusion, firstly, a network intrusion data set is primarily screened by using relevancy measurement, then, a recursive feature elimination improvement algorithm based on a random forest classifier is used for obtaining a good feature subset, the good feature subset is used as one part of an initial population of a genetic algorithm, the other part is randomly generated, and a feature subset with the best classification effect is obtained by using the genetic algorithm.

Description

Feature selection method for network intrusion detection
Technical Field
The invention relates to a feature selection technology in the field of network security, in particular to a feature selection method for network intrusion detection.
Background
With the rapid development of network technology, the network environment has massive data, and the problem of network security is very important. The insecurity of the network environment can cause privacy disclosure, resource embezzlement and other problems, and bring much loss to the working life of people. Therefore, network intrusion detection has become a research hotspot. The network intrusion detection analyzes the network information to find whether behaviors violating the security policy and signs of attack exist in the network.
The feature selection of the network intrusion data is a crucial link for network intrusion detection, the effect of later detection is directly influenced, the feature selection can effectively reduce data dimensionality and computational complexity, and the accuracy of the classifier is improved. The method aims to remove the features with low relevance degree and excessive redundancy degree with the class labels from the feature set of the data set, and searches a group of representative feature subsets with the least number and the optimal result on the premise of ensuring the classification effect as much as possible. According to the different combination modes of the feature set and the learning algorithm, the current common feature selection methods can be divided into three categories: filter (Filter), Wrapper (Wrapper) and Embedded (Embedded).
The current commonly used feature selection methods include mutual information method, chi-square check, combination of group intelligent algorithm and classification algorithm and the like, and obtain better classification performance, but the feature selection algorithms fail to analyze the problem from the aspect of simultaneously removing weak related and redundant features. And because the network intrusion data has increasing dimensions and scales, not only the overhead of the intrusion detection algorithm is increased, but also the redundant attributes and irrelevant attributes influence the detection effect, so that how to select a feature subset with the highest accuracy and the lowest quantity is still a difficult point of research.
Disclosure of Invention
Based on the problems in the prior art, in order to select a network intrusion feature subset with a better classification effect, the invention provides a feature selection method for network intrusion detection, which comprises the following steps:
s1, carrying out preliminary screening on the network intrusion data set by using the correlation degree to obtain a series of first network intrusion feature subsets;
s2, obtaining a second network intrusion feature subset from the initially screened subset by adopting a random forest-based recursive feature elimination improved algorithm;
And S3, taking the second network intrusion feature subset as a part of initial population features of the genetic algorithm, randomly generating another part of initial population features, and iterating to obtain a third network intrusion feature subset.
The invention has the beneficial effects that:
aiming at the characteristic selection problem of high-dimensional data, based on the characteristics of high dimensionality of network intrusion data, large data scale and the like, the invention firstly uses mutual information as a relevancy measurement standard to carry out primary screening on a network intrusion data set, then uses a recursive characteristic elimination improvement algorithm based on a random forest classifier to obtain an excellent characteristic subset, uses the excellent characteristic subset as a part of an initial population of a genetic algorithm, generates the other part randomly, and uses the genetic algorithm to obtain a characteristic subset with the best classification effect.
Drawings
Fig. 1 is a flowchart of a feature selection method for network intrusion detection according to an embodiment of the present invention;
FIG. 2 is a flow chart of the preliminary screening of feature sets in an embodiment of the present invention;
FIG. 3 is a flow chart of an improved algorithm for recursive feature elimination based on a random forest according to an embodiment of the present invention;
FIG. 4 is a flow chart of random forest generation in an embodiment of the present invention;
FIG. 5 is a flow chart of feature selection based on a genetic algorithm in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the technical development of computer networks, network intrusion means are changing day by day, the traditional simple network defense means can not solve the existing network intrusion problem, and network intrusion detection is actually a classification problem and mainly comprises the technologies of feature selection, classification model selection, parameter optimization and the like; the invention is based on the characteristic selection technology, and improves the intrusion detection classification accuracy of network intrusion data.
Fig. 1 is a flowchart of a feature selection method for network intrusion detection according to an embodiment of the present invention, and as shown in fig. 1, the feature selection process includes:
s1, carrying out preliminary screening on the network intrusion data set by using the correlation degree to obtain a first network intrusion characteristic subset;
Firstly, the invention considers using NSL-KDD data set to test and analyze, it can be understood that, in the feature selection, the selection of data set is the first step of research and evaluation method, the accuracy of data set will directly determine the evaluation result of various methods, the feature selection result of data set adopted in the invention can reflect the commonness of most data sets, therefore, although the accuracy of final feature selection of each data set is slightly different, the accuracy of selection can be basically improved by adopting the feature selection method of the invention.
The NSL-KDD data set adopted in the embodiment of the invention is a version of the KDD99 data set after cleaning, so that a large amount of redundant repeated data is removed, and the method is more suitable for an intrusion detection experiment. The NSL-KDD dataset still presents some problems, but it can still be used as an effective reference dataset to help researchers compare different intrusion detection methods. The settings of the training set and test set of NSL-KDD are reasonable and the results of the evaluations of the different research jobs will be consistent and comparable. The data set comprises a training set and a testing set, wherein the training set contains 22 attack types, the testing data set contains 17 attack types, and the attack types are mainly divided into four types: dos, Probe, U2R, U2L. The data set contains a total of 41 features and 1 class tag feature, wherein three features, protocol _ type, service and flag, are type features.
Then, preprocessing a network intrusion data set; the embodiment of the invention processes the discrete data by utilizing the one-hot coding and converts the character type characteristics into the numerical type characteristics. And then, carrying out numerical normalization, including data normalization and numerical normalization, wherein the process can enable each feature to be in the same magnitude range, and the same contribution is made to the intrusion detection classification result.
Of course, the preprocessing is mainly performed on each feature in the data set.
Fig. 2 is a flowchart of preliminary screening of a feature set in an embodiment of the present invention, and as shown in fig. 2, the preliminary screening of the network intrusion data set by using a correlation includes:
s11, calculating mutual information between each feature set and each category label set in the network intrusion data set;
suppose f i For the ith feature in the feature set F in the network intrusion data setAnd f is i ={f i1 ,f i2 ,…f il Where l represents a feature f i The number of characteristic values of (i.e. each characteristic f) i The total number of the information in the list includes l pieces of sub information; let c k Is expressed as the kth type in a category label set C in the network intrusion data set, and the category label set C is { C ═ C 1 ,c 2 ,…c s H, then characteristic f i And the mutual information of the tag set C is defined as:
Figure BDA0002907880210000041
s12, calculating the correlation degree between the feature set and the category label set by using the mutual information, wherein the formula is as follows:
Figure BDA0002907880210000042
where m represents the dimension of the feature.
And S13, based on the correlation degree between the feature set and the category label set, carrying out primary screening on the feature set according to the order of the correlation degree.
Taking the feature set corresponding to the front degree of correlation as a first network intrusion feature subset S 1
After a lot of tests, the effect of selecting the first 80% -90% of the features is the best, i.e. in the preferred embodiment of the present invention, the features with the correlation degree of the first 85% can be selected as the first network intrusion feature subset S 1
S2, obtaining a second network intrusion feature subset from the initially screened subset by adopting a random forest-based recursive feature elimination improved algorithm;
the method comprises the steps of firstly, using mutual information as a correlation degree measurement standard to carry out preliminary screening on an original network intrusion data set, then using a random forest as a classifier, adding a concept of characteristic redundancy as a measurement mode of redundancy characteristics, deleting the characteristics with the maximum redundancy while deleting the variables with the minimum importance, and obtaining a characteristic subset with strong correlation and low redundancy.
Specifically, fig. 3 shows a flowchart of the recursive feature elimination improvement algorithm based on the random forest in the embodiment of the present invention, as shown in fig. 3, the flowchart includes:
s21, calculating the importance of each feature in the feature set by using a random forest classifier;
FIG. 4 is a flow chart of random forest generation according to an embodiment of the present invention; as shown in fig. 4, the generation method of the random forest is:
randomly extracting a part of samples in a replacement mode;
randomly extracting partial features as features to be selected;
determining test features in the features to be selected by utilizing the Gini index;
a node that generates a random number;
if the node can become a leaf node and the decision tree stops growing, storing the decision tree, otherwise, branching;
and when the trees of the decision tree meet the requirements, generating a random forest, otherwise, returning to the sample extraction process to continue circulation.
The method for calculating the feature importance by utilizing the random forest comprises the following steps:
for each decision tree, the out-of-bag data error μ is calculated 1
Features f of random pair out-of-bag data samples i Adding noise interference, and calculating error μ of data outside the bag 2
Let N be the number of decision trees in the random forest, then for feature f i Of importance
Figure BDA0002907880210000051
S211, setting the size of a training set to be N, and randomly and replaceably extracting N training sample training sets from the training set for each tree in a random forest;
s212, if the feature dimension of each sample is M, randomly selecting M feature subsets from the M features to input into a decision tree, and calculating an optimal splitting mode when the tree is split each time;
s213, each tree grows completely without pruning;
and S214, the result of all the tree votes is the final result of the random forest classification.
S22, sorting the features according to the importance;
s23, iteratively deleting the features with the minimum importance;
s24, calculating the redundancy of each feature;
assume the feature set is A, and A ═ f 1 ,f 2 ,…f k }. Any one of the features f in A i The redundancy with other features in the set is defined as:
Figure BDA0002907880210000061
s25, deleting the features with the minimum importance and deleting the features with the maximum redundancy;
s26, taking the rest characteristic set as a second network intrusion characteristic subset S 2 Said second network intrusion feature subset S 2 Is a subset of features that are strongly correlated and have low redundancy.
The existing method for selecting features based on recursive feature elimination generally uses a classifier to sequence feature importance, and then deletes the features with the minimum importance in sequence; compared with the existing method, the method provided by the invention is improved, the importance of the features is considered, the redundancy between the features is considered, and the accuracy of the feature subsets can be effectively improved by screening the feature subsets from two angles.
And S3, taking the second network intrusion feature subset as a part of initial population features of the genetic algorithm, randomly generating another part of initial population features, and iterating to obtain a third network intrusion feature subset.
FIG. 5 is a flow chart of feature selection based on genetic algorithm in the embodiment of the present invention, and the process of executing the genetic algorithm of the present invention includes:
s31, using a second network intrusion feature subset S 2 As part of the initial population, another part of the initial population is randomly generated.
S32, encoding individual characteristics in each initial population characteristic in a binary mode;
s33, calculating the fitness value of each individual feature according to the fitness function;
s34, selecting an operator by adopting a championship algorithm, selecting the individual characteristic with the highest fitness value and transmitting the individual characteristic to the next generation;
s35, performing a cross variation process to generate next generation population characteristics;
s36, executing the step S33-step S35 until the maximum iteration number is reached, or when the current population fitness value reaches a preset threshold value which is set to be 0.9999 in the embodiment of the invention, terminating the flow by the genetic algorithm, and outputting the decoded third network intrusion feature subset S 3
In some preferred embodiments, said selecting an operator using the tournament algorithm comprises:
s341, determining the number N of the individual features selected each time;
s342, randomly selecting N individual characteristics from the population, and selecting the individual characteristic with the best fitness value to enter the next generation of population.
And S343, repeating for multiple times until the new population size reaches the original population size.
The initial population of the original genetic algorithm is randomly selected and generated in the original characteristic set, the characteristic subset obtained in the recursive characteristic elimination algorithm is used as part of the population of the initial population of the genetic algorithm, and the other part is randomly generated.
In the description of the present invention, it is to be understood that the terms "coaxial", "bottom", "one end", "top", "middle", "other end", "upper", "one side", "top", "inner", "outer", "front", "center", "both ends", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.
In the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "disposed," "connected," "fixed," "rotated," and the like are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrated; can be mechanically or electrically connected; the terms may be directly connected or indirectly connected through an intermediate, and may be communication between two elements or interaction relationship between two elements, unless otherwise specifically limited, and the specific meaning of the terms in the present invention will be understood by those skilled in the art according to specific situations.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (5)

1. A feature selection method for network intrusion detection, the feature selection method comprising the steps of:
s1, collecting a network intrusion data set, and primarily screening the network intrusion data set by using the correlation degree to obtain a first network intrusion characteristic subset;
S2, obtaining a second network intrusion feature subset from the preliminarily screened subsets by adopting a random forest-based recursive feature elimination improvement algorithm, wherein the step S2 comprises the following steps:
s21, calculating the importance of each feature in the first network intrusion feature subset by using a random forest classifier;
s22, sorting the features according to the importance;
s23, iteratively deleting the features with the minimum importance;
s24, calculating the redundancy of each feature;
s25, deleting the features with the minimum importance and deleting the features with the maximum redundancy;
s26, taking the rest feature set as a second network intrusion feature subset, wherein the second network intrusion feature subset is a feature subset with strong correlation and low redundancy;
s3, taking the second network intrusion feature subset as a part of initial population features of a genetic algorithm, randomly generating another part of initial population features, and iterating to obtain a third network intrusion feature subset, wherein the third network intrusion feature subset is generated in a manner that:
coding individual characteristics in each initial population characteristic in a binary mode; calculating the fitness value of each individual feature according to the fitness function; selecting an operator by adopting a championship algorithm, and selecting the individual characteristic with the highest fitness value to be transmitted to the next generation; performing a cross variation process to generate next generation population characteristics; and when the maximum iteration times are reached or the fitness value of the current population characteristic reaches a preset threshold value, stopping the iteration process and outputting a decoded third network intrusion characteristic subset.
2. The method of claim 1, wherein the preliminary screening of the network intrusion data set using the correlation further comprises preprocessing the network intrusion data set to classify the network intrusion data set into a plurality of features and a class label; the method comprises the steps of processing discrete network intrusion data by utilizing unique hot codes, converting character type characteristics into numerical type characteristics, and carrying out numerical value standardization processing on the numerical type characteristics, wherein the numerical value standardization processing comprises data standardization and numerical value normalization.
3. The method of claim 1, wherein the preliminary screening of the network intrusion data set using the correlation degree comprises:
calculating mutual information between each feature set and each category label set in the network intrusion data set;
calculating the correlation degree between the feature set and the category label set by using the mutual information;
and sorting according to the degree of correlation based on the degree of correlation between the feature set and the category label set, and carrying out primary screening on the feature set.
4. The method for selecting features oriented to network intrusion detection according to claim 1, wherein the first network intrusion feature subset is generated in a manner that a correlation degree between each feature set and a category label set is calculated, and a feature set corresponding to a feature ranked in the correlation degree in the first network intrusion feature subset is used.
5. The method for selecting network intrusion detection-oriented features according to claim 1, wherein the selecting out the operator by the tournament algorithm comprises randomly selecting a plurality of individual features from the population features, and selecting the individual feature with the highest fitness value from the population features to enter the next generation of population features; and repeating the steps until the new population characteristic scale is the same as the initial population scale.
CN202110076965.9A 2021-01-20 2021-01-20 Feature selection method for network intrusion detection Active CN112910866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110076965.9A CN112910866B (en) 2021-01-20 2021-01-20 Feature selection method for network intrusion detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110076965.9A CN112910866B (en) 2021-01-20 2021-01-20 Feature selection method for network intrusion detection

Publications (2)

Publication Number Publication Date
CN112910866A CN112910866A (en) 2021-06-04
CN112910866B true CN112910866B (en) 2022-07-29

Family

ID=76117277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110076965.9A Active CN112910866B (en) 2021-01-20 2021-01-20 Feature selection method for network intrusion detection

Country Status (1)

Country Link
CN (1) CN112910866B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420291B (en) * 2021-07-19 2022-06-14 宜宾电子科技大学研究院 Intrusion detection feature selection method based on weight integration
CN115242431A (en) * 2022-06-10 2022-10-25 国家计算机网络与信息安全管理中心 Industrial Internet of things data anomaly detection method based on random forest and long-short term memory network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111343171A (en) * 2020-02-19 2020-06-26 重庆邮电大学 Intrusion detection method based on mixed feature selection of support vector machine

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778836A (en) * 2016-11-29 2017-05-31 天津大学 A kind of random forest proposed algorithm based on constraints
US10565528B2 (en) * 2018-02-09 2020-02-18 Sas Institute Inc. Analytic system for feature engineering improvement to machine learning models
CN110166454B (en) * 2019-05-21 2021-11-16 重庆邮电大学 Mixed feature selection intrusion detection method based on adaptive genetic algorithm

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111343171A (en) * 2020-02-19 2020-06-26 重庆邮电大学 Intrusion detection method based on mixed feature selection of support vector machine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于自适应遗传算法的混合特征选择方法;裴作飞等;《计算机应用与软件》;20200812(第08期);全文 *

Also Published As

Publication number Publication date
CN112910866A (en) 2021-06-04

Similar Documents

Publication Publication Date Title
CN111782472B (en) System abnormality detection method, device, equipment and storage medium
Leung et al. Unsupervised anomaly detection in network intrusion detection using clusters
CN111614491B (en) Power monitoring system oriented safety situation assessment index selection method and system
CN112910866B (en) Feature selection method for network intrusion detection
US11533373B2 (en) Global iterative clustering algorithm to model entities' behaviors and detect anomalies
Janjua et al. Handling insider threat through supervised machine learning techniques
CN112488716A (en) Abnormal event detection system
Suman et al. Building an effective intrusion detection system using unsupervised feature selection in multi-objective optimization framework
CN111737694B (en) Malicious software homology analysis method based on behavior tree
CN117372144A (en) Wind control strategy intelligent method and system applied to small sample scene
CN112508363A (en) Deep learning-based power information system state analysis method and device
CN117236699A (en) Network risk identification method and system based on big data analysis
CN112328465A (en) Browser sample set acquisition method based on deep learning and genetic algorithm
Uzun et al. Performance evaluation of machine learning algorithms for detecting abnormal data traffic in computer networks
CN113722230B (en) Integrated evaluation method and device for vulnerability mining capability of fuzzy test tool
Shin et al. Platform design and implementation for flexible data processing and building ML models of IDS alerts
Chareka et al. A study of fitness functions for data classification using grammatical evolution
Su et al. A network anomaly detection method based on genetic algorithm
CN112422505A (en) Network malicious traffic identification method based on high-dimensional extended key feature vector
Rechy-Ramírez et al. Times series discretization using evolutionary programming
Hairuman et al. Evaluation of machine learning techniques for anomaly detection on hourly basis kpi
CN116720167B (en) Role management system for multiple isolation management
Amarnath et al. Metaheuristic approach for efficient feature selection: A data classification perspective
Nikolov et al. Unsupervised data linking using a genetic algorithm
Liu et al. Research on Machine Learning Feature Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant