CN111314353B

CN111314353B - Network intrusion detection method and system based on hybrid sampling

Info

Publication number: CN111314353B
Application number: CN202010103246.7A
Authority: CN
Inventors: 熊炫睿; 陈高升; 熊炼; 张媛; 程占伟; 付明凯; 刘敏
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2022-09-02
Anticipated expiration: 2040-02-19
Also published as: CN111314353A

Abstract

The invention relates to the technical field of network intrusion detection, in particular to a network intrusion detection method and a system based on mixed sampling, wherein the method comprises the steps of converting symbolic attributes in a network intrusion historical data set into digital attributes; normalizing the network intrusion history data set to an interval [0,1 ]; sampling a network intrusion historical data set by using a hybrid sampling algorithm to obtain a training set with balanced each category; training a BP neural network classifier by using the obtained training set; inputting real-time network intrusion data into a trained BP neural network classifier, and outputting the category of the real-time network intrusion data by the BP neural network classifier; the invention reduces the abandonment of most samples, thereby reducing the loss of valuable information for constructing the classifier; compared with the intrusion detection technology based on SMOTE oversampling, the method reduces the noise introduced when a few new samples are generated, so that the algorithm has better classification performance on unbalanced data.

Description

Network intrusion detection method and system based on hybrid sampling

Technical Field

The invention relates to the technical field of network intrusion detection, in particular to a network intrusion detection method and system based on hybrid sampling.

Background

Machine learning methods have been increasingly applied in recent years to network intrusion detection, which is treated as a classification problem. In network attacks, some attack types frequently occur, and the occurrence frequency of some attack types is low, so intrusion detection is a typical application scenario with unbalanced data, and machine learning has a good classification effect on most types of intrusion samples but has a poor classification effect on few types of intrusion samples when processing unbalanced data, but is also important for detecting few types of intrusion samples. The existing network intrusion detection system processing unbalanced data method comprises a network intrusion detection technology based on an oversampling SMOTE algorithm and a network intrusion detection technology based on a clustering algorithm undersampling.

Yan 26170, Hao, Korea and the like use the improved SMOTE algorithm to generate a few new samples, increase the number of the few samples, and train a deep circulation neural network classifier on the generated balance data set for network intrusion detection. An intrusion detection method of an SMOTE algorithm fusing the density of the maximum dissimilarity coefficient, which is proposed by chenhong, xiaoyue, xiaojiulong and the like, is a network intrusion detection method based on the SMOTE algorithm of the density of the maximum dissimilarity coefficient, a deep belief network and a gradient boosting decision tree, the SMOTE algorithm of the density of the maximum dissimilarity coefficient is used for carrying out oversampling on a small number of samples, and then a gradient boosting decision tree classifier is trained on a preprocessed balanced data set. Anomaly detection based on SMOTE and deep belief networks proposed by Shenshuli, Shuzewain, et al, uses the SMOTE algorithm to add a small number of classes of samples, and then trains a deep belief network classifier on the generated balanced dataset.

However, when dealing with extremely unbalanced data classification, the simple SMOTE oversampling algorithm introduces too much noise due to the generation of a large number of new samples of a small number of classes, thereby degrading the classification performance.

"Improving Detection Accuracy for Improving Network Intrusion Detection", proposed by Miah M O, Khan S, Shatabda S, etc., using clustering-based Under-sampling method to reduce most types of samples, and then using Random forest classifier to perform Network Intrusion Detection. In the Multi-level hybrid supported vector machine and transformed K-means for intrusion detection system proposed by Al-Yaseen W L, Othman Z A, Nazri M Z A, etc., an abstract smaller data set is generated by using an improved K-means clustering algorithm, the degree of category imbalance is reduced to a certain extent, and then SVM and ELM are used for network intrusion detection.

However, after the majority of classes are clustered by these network intrusion detection techniques based on the clustering algorithm undersampling, the samples are selected on a cluster basis, and the information of all the samples in the cluster is not considered, which may result in that the selected majority of classes of samples are not representative enough.

Disclosure of Invention

Aiming at the problems that when the existing network intrusion detection technology based on machine learning processes extremely unbalanced intrusion data, data is balanced, a large amount of most samples need to be reduced by a simple undersampling method, a large amount of potential information which has important value for constructing a classifier is lost, and a large amount of new samples of a few classes need to be generated by a simple SMOTE algorithm, so that serious noise is caused, the invention provides a network intrusion detection method and a system based on mixed sampling, wherein the method is shown in figure 1, and specifically comprises the following steps:

s1, converting the symbolic attributes in the network intrusion historical data set into digital attributes;

s2, normalizing the network intrusion historical data set to an interval [0,1 ];

s3, sampling the network intrusion historical data set by using a hybrid sampling algorithm to obtain a training set with balanced each category;

s4, training a BP neural network classifier by using the obtained training set;

and S5, inputting the real-time network intrusion data into the trained BP neural network classifier, and outputting the category of the real-time network intrusion data by the BP neural network classifier.

Further, the process of sampling the network intrusion history data set by using the hybrid sampling algorithm and training the BP neural network classifier comprises the following steps:

s101, dividing network intrusion attacks with the number of samples larger than the balanced sampling number m in historical data containing N types of intrusion attacks into a plurality of types, and otherwise, dividing the network intrusion attacks into a non-plurality type, wherein the non-plurality type comprises a few types with the number of samples smaller than m and types with the number of samples equal to m;

s102, oversampling is carried out on each minority sample set by using SMOTE, and the minority sample number is close to the balance sample number m;

s103, clustering all the class sample sets by using K-means respectively, generating z clusters for each class, extracting representative samples of the clusters from each cluster without replacing the representative samples, and extracting N x z samples as an initial balanced sample set;

s104, training an initial BP neural network classifier by using an initial balance sample set, and setting the iteration number T of sampling to make T equal to 1;

s105, extracting z samples in each majority sample set without replacement by using undersampling based on the average classification error rate of the samples in the clusters;

s106, randomly extracting z samples from each non-majority sample data set without returning, and adding the samples to the balanced sample set;

s107, training the balance sample set and training the BP neural network classifier again;

and S108, judging whether T is equal to T-1, if so, ending iterative output of the trained BP neural network classifier, and otherwise, making T equal to T +1 and returning to S105.

Further, the process of using undersampling based on the average classification error rate of the samples in the cluster for the majority of samples comprises:

clustering samples which are not sampled into the balance sample set in a plurality of classes by using K-means again, and generating m clusters in each class;

calculating the average classification error rate of each cluster, extracting samples represented by the respective clusters from the z clusters with the maximum average classification error rate, adding the samples to the balanced sample set and deleting the samples from the plurality of clusters which are not sampled to the balanced sample set.

The invention provides a network intrusion detection system based on hybrid sampling, which comprises a historical data storage module, an attribute conversion module, a normalization module, a sampling module, a BP neural network classifier training module and a real-time prediction module, wherein:

the historical data storage module is used for storing classified network intrusion data;

the attribute conversion module is used for converting the symbol attribute in the network intrusion data into a digital attribute;

the normalization module is used for normalizing the network intrusion data subjected to attribute conversion into intervals;

the sampling module is used for sampling the network historical data to ensure the data volume balance of the training data;

the BP neural network classifier training module is used for training the BP neural network according to training data to obtain a BP neural network classifier;

and the real-time prediction module is used for inputting real-time network intrusion data into the BP neural network classifier to obtain the type of the network intrusion.

Further, the sampling module comprises a data classification unit, a minority class sampling unit, a sample primary selection unit and a majority class sampling unit, wherein:

the data classification module is used for classifying the attack types in the historical data into a majority type and a non-majority type according to the balanced sampling number m, wherein the non-majority type comprises network intrusion attack types with the sample number smaller than m and network intrusion attack types equal to m;

the minority class sampling unit is used for oversampling by using SMOTE and enabling the minority class sample number to be close to the balance sampling number m;

the system comprises a sample primary selection unit, a network intrusion attack detection unit and a network intrusion attack detection unit, wherein the sample primary selection unit is used for clustering by using K-means so that each network intrusion attack type generates z clusters, representative samples of the clusters are extracted from each cluster without being replaced, and N x z samples are extracted as an initial balanced sample set;

and the majority type sampling unit is used for clustering samples which are not selected by the sample primary selection unit in the majority type again by using K-means, generating m clusters in each type, calculating the average classification error rate of each cluster, and extracting the representative points of the clusters from the z clusters with the maximum average classification error rate without replacing the representative points.

On the basis of converting an extremely unbalanced data set into a balanced data set, compared with an intrusion detection technology based on clustering undersampling, the technology reduces the abandonment of most samples, thereby reducing the loss of valuable information for constructing a classifier, and obtains the classified total information of all samples in a cluster according to the average classified error rate of the samples in the cluster, so as to select more representative most samples.

Drawings

Fig. 1 is a schematic flow chart of a network intrusion detection method based on hybrid sampling according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention provides a network intrusion detection method based on mixed sampling, which specifically comprises the following steps:

s4, training a BP neural network classifier by using the obtained training set;

In the invention, the processes of sampling the network intrusion historical data set by using a hybrid sampling algorithm and training a BP neural network classifier comprise the following steps:

s104, training an initial BP neural network classifier by using an initial balance sample set, and setting the iteration number T of the BP neural network classifier to make T equal to 1;

s105, undersampling based on the average classification error rate of the samples in the clusters is used for most samples;

s106, randomly extracting z samples from each non-majority type of residual sample data set without returning, and adding the samples to a balanced sample set;

and S108, judging whether T is equal to T-1, if so, finishing iterative output of the trained BP neural network classifier, otherwise, enabling T to be T +1 and returning to S105.

In this embodiment, oversampling is performed using SMOTE for each minority class sample set, and the process of setting the oversampling magnification of the minority class i may be expressed as:

wherein the content of the first and second substances,

sampling multiplying power for oversampling by using SMOTE for a minority class i; s. the _i For the sample set of the i-th type intrusion attack, | S _i I represents a sample set S _i Number of samples in (c).

Clustering all class sample sets respectively using K-means, each class generating z clusters, expressed as:

preferably, in this embodiment, the balance sampling number m is a number between the sample number of the category with the smallest number of network intrusion attack type samples in the historical data and the sample number of the category with the largest number of network intrusion attack type samples in the historical data.

In this embodiment, the process of using undersampling based on the average classification error rate of the samples in the cluster for most types of samples includes:

calculating the average classification error rate of each cluster, extracting representative points of the respective clusters from the z clusters with the maximum average classification error rate, adding samples to the balanced sample set and deleting the samples from the plurality of clusters which are not sampled to the balanced sample set.

Taking the sample closest to the cluster center in each cluster as a representative of the cluster, and giving a classifier f and a cluster C with known sample labels, the average classification error rate V (C) of the samples in the cluster C is defined as:

wherein V (C) represents the average classification error rate of samples within cluster C; x is the number of _j Represents the jth sample within cluster C; i represents an indication function, if the input is true, 1 is returned, otherwise, 0 is returned; y is _j Is the true label of sample j; f (x) _j ) A prediction label for classifier f for sample j.

V (C) includes the general information of the classification of all samples in the cluster, the larger v (C), the higher the average classification error rate of the classifier on all samples in the cluster C, which indicates that the classifier lacks sufficient information of the cluster, and the representative point of the cluster closest to the center of the cluster can provide a large amount of information of the cluster for the classifier, and the classifier needs to learn the representative point of the cluster to improve the performance of the classifier. Conversely, if V (C) is smaller, the classifier classifies the samples in the cluster C with higher precision, indicating that the classifier already has enough information for the cluster.

Specifically, the present embodiment uses the common data set KDD99 in the network intrusion detection application, which includes 5 categories, Normal and 4 attacks, Dos, Probe, U2R and R2L, where the number of samples of the data set and the maximum unbalancing degree are shown in table 1, the maximum unbalancing degree is defined as the ratio of the number of samples of the class with the largest number of samples to the number of samples of the class with the smallest number of samples, and represents the degree of imbalance of the data set, the class with the largest number of samples in the KDD99 data set is Dos, the class with the smallest number of samples is U2R, and the maximum unbalancing degree of the data set is very large and belongs to an extremely unbalanced data set.

TABLE 1

The parameter settings in the present invention are shown in table 2.

TABLE 2

In specific implementation, firstly, converting the symbolic attributes in the training set in KDD99 into digital attributes;

normalizing the training set in KDD99 to the interval [0,1 ];

sampling the training set in KDD99 by using the hybrid sampling algorithm proposed herein to obtain various balanced new training sets;

training the neural network with a new training set;

and inputting intrusion data on line, and outputting the intrusion type by the neural network.

The invention can feed back the history data which is successfully judged to the system as training data.

The invention provides a network intrusion detection system based on mixed sampling, which comprises a historical data storage module, an attribute conversion module, a normalization module, a sampling module, a BP neural network classifier training module and a real-time prediction module, wherein:

the historical data storage module is used for storing the classified network intrusion data;

the BP neural network classifier training module is used for training a BP neural network according to training data to obtain a BP neural network classifier;

the system comprises a sample primary selection unit, a network intrusion attack analysis unit and a network intrusion attack analysis unit, wherein the sample primary selection unit is used for clustering by using K-means so that each network intrusion attack type generates z clusters, extracting representative samples of the clusters from each cluster without returning, and extracting N x z samples as an initial balanced sample set;

the majority sampling unit is used for clustering samples which are not selected by the sample primary selection unit again by using K-means in the majority, generating m clusters in each class, calculating the average classification error rate of each cluster, and extracting the representative points of the clusters from the z clusters with the maximum average classification error rate without replacing the representative points, preferably, the representative points of the clusters in the embodiment are samples which are closest to the center of the clusters.

Further, the BP neural network classifier training module trains an initial BP neural network classifier according to the samples selected by the sample primary selection unit, iteration times are set after training is completed, a majority type sampling unit is called in each iteration to select new samples in a majority type to be added into a sample set, the BP neural network classifier is trained until the set iteration times are reached, and the trained BP neural network classifier is output.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A network intrusion detection method based on mixed sampling is characterized by comprising the following steps:

s3, sampling the network intrusion historical data set by using a hybrid sampling algorithm to obtain a training set with balanced each category, wherein the process comprises the following steps:

s101, setting a balance sampling number m, and dividing network intrusion attacks with the sample number larger than the balance sampling number m in historical data containing N types of intrusion attacks into a plurality of types, wherein the network intrusion attacks are not in the plurality of types, and the non-plurality types comprise a few types with the sample number smaller than m and types with the sample number equal to m;

s4, training the BP neural network classifier by using the obtained training set, wherein the process comprises the following steps:

s108, judging whether T is equal to T-1 or not, if so, ending iterative output of the trained BP neural network classifier, and otherwise, enabling T to be T +1 and returning to S105;

2. The method according to claim 1, wherein the sampling rate for oversampling for each minority sample set using SMOTE is expressed as:

wherein the content of the first and second substances,

sampling multiplying power for oversampling by using SMOTE for a minority class i; s _i For the sample set of the i-th type intrusion attack, | S _i I represents a sample set S _i The number of samples in (c).

3. The method of claim 1, wherein the under-sampling based on the mean classification error rate of the intra-cluster samples for the majority of sample classes comprises:

4. The method of claim 3, wherein the samples represented by the clusters are the samples closest to the center of the cluster in each cluster.

5. The method of claim 3, wherein the average classification error rate of the samples in the cluster is expressed as:

6. The utility model provides a network intrusion detection system based on mixed sampling which characterized in that, includes historical data storage module, attribute conversion module, normalization module, sampling module, BP neural network classifier training module piece and real-time prediction module, wherein:

the sampling module is used for sampling the network historical data to ensure the data volume balance of the training data; the sampling module comprises a data classification unit, a few types of sampling units, a sample primary selection unit and a most types of sampling units, wherein:

the majority sampling unit is used for clustering samples which are not selected by the sample primary selection unit in the majority by using K-means again, generating m clusters in each class, calculating the average classification error rate of each cluster, and extracting representative points of the clusters from the z clusters with the maximum average classification error rate without returning;

the BP neural network classifier training module is used for training the BP neural network according to training data to obtain a BP neural network classifier; the BP neural network classifier training module trains an initial BP neural network classifier according to the samples selected by the sample primary selection unit, sets iteration times after training is finished, calls a majority type sampling unit to select new samples in the majority type to add into a sample set in each iteration, trains the BP neural network classifier until the set iteration times are reached, and outputs the trained BP neural network classifier;