CN112217822A

CN112217822A - Detection method for intrusion data

Info

Publication number: CN112217822A
Application number: CN202011088479.0A
Authority: CN
Inventors: 任午令; 张晓冰
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2020-10-13
Filing date: 2020-10-13
Publication date: 2021-01-12
Anticipated expiration: 2040-10-13
Also published as: CN112217822B

Abstract

The invention discloses a detection method for intrusion data, which specifically comprises the following steps: 1) a balanced data set acquisition step, 2) a data classification step, and 3) a classifier evaluation step; the invention provides a detection method aiming at intrusion data, which improves data protection and detection generalization performance.

Description

Detection method for intrusion data

Technical Field

The invention relates to the technical field of information reduction intrusion detection, in particular to a detection method for intrusion data.

Background

With the development of the internet industry, preventing information from being invaded is one of important issues, wherein unbalanced distribution of data is one of the problems that information protection cannot be effectively improved. The unequal distribution of data, i.e. the number of samples in one or several classes in the data set, far exceeds the number of samples of other types. The category with a smaller data percentage is called a minority category, and the category with a larger data percentage is called a majority category. The network attack types are complicated, and some attack types are common, such as DDOS, brute force cracking, ARP spoofing and the like. While some attack types occur less often, such as unauthorized local supervisor privileged access, unauthorized access of a remote host, and so on. These types of attack samples are much fewer than the number of common types of attacks. The consequences of DDOS attacks may damage the entire network, reduce service performance, prevent terminal services, and unauthorized access of a remote host may cause the host to be controlled for illicit activities, etc. The existing classification method has high identification rate for sample points of most categories, so that few categories are classified by mistake, thereby causing great problems and reducing the information prevention of the few categories. Therefore, it is important to process unbalanced data sets and improve the generalization performance of data intrusion detection models.

Disclosure of Invention

The invention overcomes the defects of the prior art, and provides a detection method for intrusion data, which improves data protection and detection generalization performance.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a detection method for intrusion data specifically comprises the following steps:

1) and (3) balanced data set acquisition: calculating the distance from the sample point to the cluster center point of the training data according to the Euclidean distance by using a rough clustering method, dividing the training data into a plurality of clustering subsets, regarding the clustering subsets which contain less sample points and are far away as noise points, and deleting the noise data; then randomly sampling different types of the intrusion data and reducing dimensions, reducing overfitting of different types of models of the intrusion data, performing intra-class equalization on a training set by an oversampling method, and increasing the number of samples of partial types to obtain an equalized data set;

2) data classification step: the classifier classifies the balanced data set, and adjusts the number weight of misclassified samples, so that the generalization performance of the classification model is improved; (ii) a Performing repeated iterative training on a weak classifier in the classifier by using an AdaBoost M1 method, wherein the trained weak classifier in each time participates in the next iterative training; specifically, according to the last iteration result, the weight occupied by the misclassified sample points in the training set is increased, and meanwhile, the weight of the correctly classified sample points is reduced to enter the next iteration, so that the classification performance of the classifier is improved; the classifier generated by the next iteration focuses more on the wrong sample classified by the classifier of the last iteration, so that the accuracy of sample classification is increased; finally, voting according to a classifier generated by each iteration to determine a classification result;

3) a classifier evaluation step: evaluating the classifier through a confusion matrix, a missing report rate, an accuracy rate and an ROC curve; the confusion matrix is used for comparing the classification result with the actual result and visually depicting the performance index of the classifier;

the missing report rate and the accuracy rate adopt the following formulas:

missing report rate TP/(FP + TN)

Accuracy rate (TP + TN)/(TP + TN + FN + FP)

Where FP is determined to be a positive sample but is actually a negative sample; FN is determined to be a negative sample, but is actually a positive sample; TN is determined as a negative sample, and is actually a negative sample; TP is determined to be a positive sample, which is also a positive sample in practice;

the ROC curve is usually used to represent the effect of the model classifier, and in the best case, the ROC should be in the upper left corner, which means that there is a high true positive rate at lower false positive rates; the horizontal axis of the ROC curve is the false positive rate FPR and the vertical axis is the true positive rate TPR;

TPR＝TP/((TP+FN))

the true positive rate represents the ratio of the number of the normal samples predicted by the model to the number of all the normal samples predicted;

FRP＝FP/((FP+TN))

the pseudo-positive rate represents the ratio of the number of normal samples predicted as attack type by the model to the number of all samples predicted as attack type.

Furthermore, the classifier generated by each iteration acquires the proportion of the strong classifier according to the classification error rate; the lower the classification error rate, the higher the weight.

Further, the coarse clustering method uses Euclidean distance to calculate the distance from the sample point to the centroid, and the distance is compared with a set distance threshold value T₁、T₂Comparing, screening out interference points in the data set according to the number of the sample points in each category and the distance between the sample points and each centroid, and deleting noise sample points; the method specifically comprises the following steps:

1.1.1) randomly arrange the original sample set into a sample list L ═ x₁,x₂,…,x_nAnd setting an initial distance threshold T according to cross validation parameters₁、T₂(T₁>T₂)；

1.1.2) randomly selecting a sample point x from the list L_iI ∈ (1, n), as the centroid of the first Canopy cluster, and x₁Delete from the list;

1.1.3) randomly selecting a sample point x from the list L_pP ∈ (1, n) p ≠ i, calculating x_pDistances to all centroids, and checking the minimum distance D_min；

If T₂≤D_min≤T₁Then give x_pA weak mark representing D_minBelongs to the canty cluster and is added; if D is_min≤T₂Then give x_pA strong mark representing D_minBelongs to the canty cluster and is close to the centroid; and x is_pDelete from the list; if D is_min>T₁Then x_pForm a new cluster and combine x_pDelete from the list;

1.1.4) repeating the step 1.1.3) until the number of elements in the list becomes zero, deleting clusters with less sample points in the sphere cluster, and deleting noise points with more than one time of the majority sample points near the minority sample points.

Further, the oversampling method searches a point in k nearest classes from the sample point by randomly searching a few class sample points, performs repeated interpolation to form a plurality of new few class samples, and adds the new few class samples into the data set; the method specifically comprises the following steps:

1.2.1) select a few classes of samples i from the dataset with a feature vector x_i，i∈{1,...,T}；

1.2.2) find sample x from all T samples of a few classes_iK neighbors of (2), denoted as x_i(near)，near∈{1,…,k}；

1.2.3) randomly selecting a sample x from the k neighbors_i(nn)Regenerating a random number lambda between (0,1)₁To synthesize a new sample x_i1：

x_i1＝x_i+λ₁·(x_i(nn)-x_i) (1)

1.2.4) repeat step 1.2.3N times) to synthesize N new samples: x is the number of_inew，new∈{1,...,N}。

Compared with the prior art, the invention has the advantages that:

the classifier is formed by fusing data preprocessing and reinforcement learning, a rough clustering method, an oversampling method and an intrusion detection classification method of Adaboost M1 are provided, a Canpoy cluster is used for rough clustering to obtain noise points, the noise points are removed, meanwhile, a down-sampling method is used for reducing the number of most categories, model overfitting is reduced, then a few categories of sample points are linearly synthesized by the rough clustering method, the number of the few categories is increased, imbalance among the categories is reduced, a balanced data set is formed, the balanced sample can well make up the defect that the number of the few categories of classified samples is insufficient, and the problem that important information is lost during random sampling is solved. The method is combined with an AdaBoost M1 classifier, a random forest is used as a base classifier, the performance of a feature subset is randomly selected to reduce the influence of data dimensionality on classification, a local optimal weak classifier is obtained in the process of each iteration, then the weight of a sample is updated, although training time is increased, compared with the result of an original unbalanced data set on an Adaboost M1 classifier, the accuracy of a few classes can be effectively improved, and the average missing report rate is reduced.

Drawings

FIG. 1 is a flow chart of a balanced data set construction of the present invention;

FIG. 2 is a flow diagram of the AdaBoostm1 framework;

FIG. 3 is a comparison of U2R on the ROC curve for an example of the present invention;

FIG. 4 is a comparison of R2L on the ROC curve for the examples of the present invention.

Detailed Description

The following specific embodiments are given to further illustrate the present invention.

As shown in fig. 1 to 4, a method for detecting intrusion data specifically includes the following steps:

1) and (3) balanced data set acquisition: calculating the distance from the sample point to the cluster center point of the training data according to the Euclidean distance by using a rough clustering method, dividing the training data into a plurality of clustering subsets, regarding the clustering subsets which contain less sample points and are far away as noise points, and deleting the noise data; and then randomly sampling different types of the intrusion data and reducing dimensions, reducing overfitting of different types of models of the intrusion data, performing intra-class balance on the training set by an oversampling method, and increasing the number of samples of partial types to obtain a balanced data set.

The rough clustering method uses Euclidean distance to calculate the distance from a sample point to a centroid and a set distance threshold value T₁、T₂Comparing, screening out interference points in the data set according to the number of the sample points in each category and the distance between the sample points and each centroid, and deleting noise sample points; the method specifically comprises the following steps:

In the oversampling method, a few types of sample points are randomly searched, one point is searched in k nearest classes away from the sample points, repeated interpolation is carried out to form a plurality of new few types of samples, and the new few types of samples are added into a data set; the method specifically comprises the following steps:

x_i1＝x_i+λ₁·(x_i(nn)-x_i) (1)

the missing report rate and the accuracy rate adopt the following formulas:

missing report rate TP/(FP + TN)

Accuracy rate (TP + TN)/(TP + TN + FN + FP)

TPR＝TP/((TP+FN))

FRP＝FP/((FP+TN))

The classifier generated by each iteration acquires the proportion of the strong classifier according to the classification error rate; the lower the classification error rate, the higher the weight.

Specifically, the experimental use data set is the KDD CUP 99 data set. The data set contains a large amount of network traffic data, perhaps containing more than 5,000,000 network connection records, along with test data of about 2,000,0000. In order to avoid the data volume from being too large, the data set is randomly sampled according to the proportion of 10 percent, the sampling result is used as a learning training set, and 10 percent of the test data is used as a test set. The time for establishing the model can be effectively reduced, and the influence on the precision is small. The training data set used in this experiment contained 49,399 training data. There are 41 features in the dataset, 4 attack types, which are Denial of Service attack (DOS), Remote host originated right acquisition attack (Remote to Local, R2L), port monitoring scanning attack (PROBE), and privilege escalation attack (User to Root, U2R). The distribution of the number of specific data set attack types is shown in table 1 below:

data type	Amount of training data	Test set
			Normal	97278	6059
DoS	391458	2298
			Probe	4107	416
R2L	1126	161
			U2R	52	228

TABLE 1

Because the samples in the original data set are unbalanced, the number of U2R is far less than that of DOS and Normal, so that the results are influenced, the generalization performance of the model is influenced, noise points are removed by using Canopy, the number of samples of a few samples of U2R and R2L is increased by using a rough clustering method, simultaneously, DOS and Normal types with more records are downsampled, and then, the artificially synthesized data obtained by downsampling the records of the U2R and R2L types are mixed with the data of the Probe types to form a new balanced data set. The data distribution of the equalized data set processed by the scheme is shown in the following table 2:

data type	Amount of training data
		Normal	9727
DoS	7829
		Probe	4107
R2L	1126
		U2R	1093

TABLE 2

Specifically, firstly, 10-fold cross validation is performed on an unbalanced training set obtained by random sampling, data in a data set is equally divided into 10 parts, one part is selected to be used as a validation set, the other nine parts are used for training, and iteration is performed for 10 times in sequence. Finally, the average experimental result of 10 models is used as the result of the whole model, the test set is used for testing, and the model trained in the training set is named Adaboost M1. The balanced data set is subjected to 10-fold cross validation, a random forest is used as a base classifier, the random forest can process high-dimensional data without feature selection, errors can be balanced for the unbalanced data set, and therefore the method can be combined with Adaboost M1. The model trained on the balanced dataset was named sadaboost m1 and the model trained on the original dataset was named adaboost m 1.

The two models were tested separately using test sets, with the sadaboost m1 model yielding the confusion matrix shown in table 3 below and the adaboost m1 model yielding the confusion matrix shown in table 4 below. Table 5 shows the results of the false negative rate per category obtained from testing the adaboost m1 model and the sadaostm 1 model on the test set. Table 6 shows the accuracy results for each category tested on the test set using the adaboost m1 model and the sadaostm 1 model.

TABLE 3

TABLE 4

Classifier	R2L	DOS	PROBE	U2R
					AdaboostM1	84.4	2.9	36.1	88.7
SAdaboostM1	58.6	2.6	15.5	67.2

TABLE 5

Classifier	R2L	DOS	PROBE	U2R
					AdaboostM1	15.6	97.1	73.9	13.3
SAdaboostM1	30.3	97.4	81.6	42.8

TABLE 6

Through table 5, table 6 can see that after the sample is processed by using the scheme, the noise is reduced, and the problem of few category sample errors caused by unbalanced samples is solved. On the premise of not changing the original overall accuracy, the accuracy of U2R and R2L is greatly improved, and meanwhile, the missing report rate is reduced.

The following compares the ROC curves for U2R and R2L on the two models. ROC curves for a few classes in the dataset U2R on the sadobaost m1 model are shown on the left in fig. 3, AUC 0.9779, and ROC curves on the adaboost m1 model are shown on the right in fig. 3. ROC curves for a few classes R2L in the dataset on the sadaostomm 1 model are shown on the left side of fig. 4, AUC 0.7091. ROC curves for a few classes R2L in the dataset on the adaboost m1 model are shown on the right side of fig. 4, AUC 0.6486.

In summary, attack behaviors in a network environment are various, the number of collected attack data type samples is unbalanced, and it is difficult to judge attack behaviors of a small number of categories, so noise points are removed by using Canopy, errors in the process of synthesizing the sample points of the small number of categories are reduced, a coarse clustering method artificially synthesizes certain categories (R2L and U2R) with small attack data quantity into data, the proportion of the data is increased, the number of categories (DOS and Normal) with large quantity ratio is reduced, then an Adaboost M1 classifier is trained by using a balanced data set, and the Adaboost M1 classifier is compared with a model trained on the Adaboost M1 classifier under an original data set. According to the scheme, on the premise that the accuracy of the whole data set is not reduced, the accuracy of the attack of the minority class U2R is improved by 29%, the accuracy of the attack of the R2L is improved by 15%, and meanwhile, the average missing report rate is reduced by 28%.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the spirit of the present invention, and these modifications and decorations should also be regarded as being within the scope of the present invention.

Claims

1. A detection method for intrusion data is characterized by comprising the following steps:

the missing report rate and the accuracy rate adopt the following formulas:

missing report rate TP/(FP + TN)

Accuracy rate (TP + TN)/(TP + TN + FN + FP)

TPR＝TP/((TP+FN))

FRP＝FP/((FP+TN))

2. A method for intrusion data detection according to claim 1, characterized by: the classifier generated by each iteration acquires the proportion of the strong classifier according to the classification error rate; the lower the classification error rate, the higher the weight.

3. A method for intrusion data detection according to claim 1, characterized by: the coarse clustering method uses Euclidean distance to calculate the distance between a sample point and a centroid and a set distance threshold value T₁、T₂Comparing, screening out interference points in the data set according to the number of the sample points in each category and the distance between the sample points and each centroid, and deleting noise sample points; the method specifically comprises the following steps:

1.1.1) randomly arrange the original sample set into a sample list L ═ x₁,x₂,…,x_n}, root ofSetting an initial distance threshold T according to cross validation parameters₁、T₂(T₁>T₂)；

4. A method for intrusion data detection according to claim 1, characterized by: in the oversampling method, a few types of sample points are randomly searched, one point is searched in k nearest classes away from the sample points, repeated interpolation is carried out to form a plurality of new few types of samples, and the new few types of samples are added into a data set; the method specifically comprises the following steps:

1.2.3) randomly selecting a sample x from the k neighbors_i(nn)Regenerated to one between (0,1)Random number lambda₁To synthesize a new sample x_i1：

x_i1＝x_i+λ₁·(x_i(nn)-x_i) (1)