CN109842614B

CN109842614B - Network intrusion detection method based on data mining

Info

Publication number: CN109842614B
Application number: CN201811637319.XA
Authority: CN
Inventors: 王秋华; 欧阳潇琴; 詹佳程; 吕秋云
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2021-03-16
Anticipated expiration: 2038-12-29
Also published as: CN109842614A

Abstract

The invention relates to a network intrusion detection method based on data mining. In the prior art, the problems of low classification accuracy caused by the defect of sample weight updating, low classification speed caused by redundant weak classifiers, high calculation cost and the like exist. In the weak classifier training stage, the method adopts the Adaboost algorithm of the improved weight updating method to carry out weak classifier training, and updates the sample weight according to the weighted average correct rate of each sample in the previous t times of training, thereby inhibiting the infinite expansion of the noise sample weight and leading the weight updating of all samples to be more balanced. In the weak classifier combination stage, a new weak classifier similarity measurement mode is provided, selective integration is carried out based on the similarity measurement mode and a hierarchical clustering algorithm, the weak classifiers with similarity exceeding a threshold value are classified into one class, and the weak classifiers with highest classification accuracy in each class are combined into a strong classifier, so that redundant weak classifiers are removed, the classification speed is improved, and the calculation cost is reduced.

Description

Network intrusion detection method based on data mining

Technical Field

The invention belongs to the technical field of computers, particularly relates to the technical field of network security, and relates to a network intrusion detection method based on data mining.

Background

As an important component of an information security system, intrusion detection can collect information from a plurality of key points in a network system and analyze whether there are intrusion behaviors and signs in the network. Intrusion detection can be viewed as a data classification process that identifies normal operations and intrusion behavior from the collected information. Currently, the intrusion detection classification algorithm mainly includes a decision tree, a neural network or a support vector machine. However, the classifiers are single classifiers, which have insufficient generalization ability and low classification accuracy, so that an ensemble learning method is introduced. Ensemble learning is a learning method by constructing a plurality of weak classifiers (i.e., a single classifier) and combining them into one strong classifier. The integrated learning method fully utilizes the complementarity between single weak classifiers and effectively improves the generalization capability of the classifiers.

The Adaptive Boosting algorithm is the currently most practical integrated learning algorithm, and the essence of the Adaptive Boosting algorithm is to realize training of a weak classifier by changing sample distribution. Specifically, it updates the weight of each sample according to whether the classification of each sample in each training set is correct and the accuracy of the last overall classification. And (4) sending the training set with the modified weight value to a lower-layer classifier for training, and finally combining the classifiers obtained by each training into a strong classifier. Although the Adaboost algorithm improves the generalization capability of the classifier to some extent, it still has the following disadvantages.

Firstly, the weight updating mechanism of the algorithm is easy to cause unfair weight allocation and is easy to cause infinite increase of the noise sample weight. Many scholars improve the algorithm aiming at the defect, and the related documents are as follows:

1. zhang Zixiang et al propose an Adaboost improved algorithm based on noise detection in AdaBoost algorithm improvement based on sample noise detection, which determines a noise sample according to the difference between the noise sample in a misclassified sample and a common misclassified sample and reclassifies the noise sample, thereby improving the classification accuracy.

2. Li Wen Hui et al in Adaboost's training algorithm, limit the expansion of the weights of the target samples by adjusting the weighted error distribution, and output probability values instead of the traditional discrete values as the output results of the strong classifier.

3. And the dongfeng et al in 'application of Adaboost algorithm based on improvement in network intrusion detection' improves the weight according to the sample classification accuracy, and simultaneously inhibits the infinite increase of the noise sample weight.

Secondly, the training process of the weak classifiers has certain randomness, which is easy to cause the generation of redundant weak classifiers, and the weak classifiers can not improve the classification accuracy, but also increase the calculation cost and reduce the classification speed. The theory of "Man covered Be Better Than All" proposed in Ensembling Neural Networks, Man Coult Be beer All by Zhou Shihua et al proves that the strong classifiers formed by using less weak classifiers can achieve the same or even Better effect. Based on the theory, a selective integration method is proposed, which is to add a selection stage of a classifier on the basis of ensemble learning. In the stage, weak classifiers which have negative influence on the classification capability of the integrated classifier are removed through a certain strategy, and the remaining weak classifiers are combined into a strong classifier, so that the classification performance is further improved.

1. The Xiyuan Cheng et al selectively integrate by deleting the classifiers with poor performance in the weak classifiers in the 'deleting worst basis learner to level pruning Bagging integration'.

2. In a selective integrated human behavior recognition model based on difference clustering, Wangzhini et al calculates double-error difference increment values of weak classifiers, combines a neighbor propagation clustering algorithm to divide a plurality of weak classifiers into K clusters, and selects a central classifier of each cluster to combine into a strong classifier.

Disclosure of Invention

The invention aims to provide a network intrusion detection method based on data mining, which is a self-adaptive lifting method based on improved weight updating and selective integration, aiming at the problems of low classification accuracy caused by the defect of sample weight updating in the traditional Adaboost algorithm and low classification speed, high calculation cost and the like caused by a redundant weak classifier.

The invention firstly provides an Adaboost algorithm for improving a weight updating mode in a weak classifier training stage, wherein the Adaboost algorithm updates sample weights according to the average correct rate of each sample in the previous t times of training, so that the weights of all samples are updated more uniformly, and the infinite expansion of the weights of noise samples is inhibited to a certain extent; secondly, in the weak classifier combination stage, a new weak classifier similarity measurement mode is provided, selective integration is carried out based on the similarity measurement mode and a hierarchical clustering algorithm, redundant weak classifiers are removed, the classification speed is improved, and the calculation overhead is reduced.

In order to achieve the purpose, the technical scheme provided by the invention comprises the following steps:

step (1) using Adaboost algorithm of improved weight updating method to train weak classifier:

step (1.1) setting an initial training set as D { (x)₁,y₁),(x₂,y₂),...,(x_N,y_N) N is the total number of samples in the training set; initializing the weight of the training sample: the initial weight of each training sample is 1/N, and the initial weight vector is {1/N,1/N, …,1/N };

training T weak classifiers in the step (1.2), wherein the training mode of the T weak classifier is as follows, T is more than or equal to 1 and less than or equal to T:

step (1.2.1) randomly extracting N training samples from the initial training set D in a returning mode according to the sample weight value to serve as the t-th weak classifier h_tTraining set D of_t；

Step (1.2.2) according to training set D_tTraining to obtain weak classifier h_t；

Step (1.2.3) calculating h_tIs classified into a plurality of classes_tAnd a weight α_t；

Wherein I [ h ]_t(x_n)＝y_n]Whether the predicted value and the actual value of the nth sample of the tth classifier are equal or not is represented; if equal, the number is 1; if not, 0 is obtained; (x)_n,y_n) Represents the nth sample;

step (1.2.4) if ε_tIf < 0.5, retraining h_t(ii) a If epsilon_tNot less than 0.5, entering the next step;

and (1.2.5) updating the sample weight in the following way:

first, the probability E that the nth sample can be correctly classified under the combination of the first t weak classifiers is counted_t(n)：

Then, the weight W of the t +1 th time of the nth sample is calculated_t+1(n), the lower the classification accuracy of the previous t times is, the larger the weight promotion is:

wherein Z is_tIs a normalization factor that is a function of,

and (1.3) returning to the training stage to obtain T weak classifier sets H ═ H₁,h₂,…,h_T}。

Defining a new similarity measurement mode among classifiers:

step (2.1) is to provide

Subscript matrix representing training samples in T training sets, D_t＝[d_t1,d_t2,…,d_tN]For the t-th row of the matrix, the training sample set of the t-th weak classifier is represented, d_tnIs the subscript representation of the n training sample drawn by the t weak classifier, d_tn∈[1,N](ii) a E.g. d₂₄The 4 th training sample extracted by the 2 nd weak classifier is represented as (x) by 5₅,y₅)。

Step (2.2) defines two weak classifiers h_iAnd h_jThe similarity of the training sets between the training sets is Sim (i, j), which represents the training set D_iAnd D_jThe size of the intersection of (a) and (b) accounts for the proportion of the total number of samples N;

step (2.3) setting matrix

Represents the classification result of the T weak classifiers to the N samples, m_tnRepresenting the classification condition of the nth training sample by the t-th weak classifier,

1 indicates correct classification, 0 indicates wrong classification;

step (2.4) defines two weak classifiers h_iAnd h_jThe similarity of the classification results is Rim (i, j), i.e. the proportion of the number of samples classified into the same class by the two weak classifiers to the total number of samples N:

step (2.5) defines two classifiers h_iAnd h_jThe similarity between the weak classifiers is Tim (i, j) ═ Sim (i, j) + Rim (i, j), so as to obtain a similarity matrix between the T weak classifiers

And (3) combining weak classifiers based on the selective integration method of the measurement mode and the hierarchical clustering algorithm in the step (2), wherein the specific mode is as follows:

step (3.1) firstly, a similarity threshold value delta is set, if two weak classifiers h_iAnd h_jSimilarity between Tim [ i ]][j]If > delta, then h can be substituted_iAnd h_jDividing the data into the same class;

step (3.2) of dividing the T weak classifiers into a class respectively to obtain T initial classes { C }₁,C₂,…,C_T}，C₁To C_TRespectively representing first to Tth classes;

step (3.3) finding out two classes C with the maximum similarity_uAnd C_vIf it is similar to Cim (C)_u,C_v) If delta is larger than delta, then C is added_uAnd C_vMerge into one class, so the total number of classes is reduced by one; define any two classes C_aAnd C_bThe similarity between them is Cim (C)_a,C_b) And define C_aAny weak classifier and C_bThe minimum value of the similarity of any weak classifier is the similarity between classes: cim (C)_a，C_b)＝min{Tim[i][j]|h_i∈C_a，h_j∈C_b}；

Step (3.4) recalculating the similarity between the old class and the merged class according to the formula in step (3.3);

step (3.5) repeating step (3.3) and step (3.4) until the similarity between any two classes is less than or equal to delta;

step (3.6) finally obtaining K classes { C₁,C₂,...,C_KK is less than T, weak classifiers with highest classification accuracy are selected from each class to be combined into a strong classifier, and the decision function of the strong classifier after selective integration is used for realizing the classification

The invention has the following beneficial effects:

in the weak classifier training stage, the method of the invention provides an Adaboost algorithm for improving a weight updating mode, aiming at the defects that the Adaboost algorithm only depends on the previous classification condition to determine that the weight change of a sample is too large, and the infinite amplification of the noise sample weight is easily caused. The updating method is improved to update the sample weight according to the weighted average correct rate of each sample in the previous t times of training, all samples are improved in weight on the basis of the previous t times of training, and the higher the classification correct rate of the previous t times is, the smaller the weight improvement is; and finally, normalizing the lifted weight value, so that the infinite expansion of the noise sample weight value is inhibited to a certain extent, and the weight values of all samples are updated more uniformly. In the weak classifier combination stage, aiming at the problems of low classification speed, high calculation cost and the like caused by the redundancy of weak classifiers, a new similarity measurement mode among the weak classifiers is provided, selective integration is carried out based on the similarity measurement mode and a hierarchical clustering algorithm, the weak classifiers with similarity exceeding a threshold value are classified into one class, the weak classifiers with the highest classification accuracy in each class are taken to be combined into a strong classifier, so that the redundant weak classifiers are removed, the classification speed is improved, and the calculation cost is reduced.

Drawings

Fig. 1 is a schematic diagram of the framework of the present invention.

Detailed Description

The invention will be further explained with reference to the drawings.

Referring to fig. 1, a network intrusion detection method based on data mining includes the following specific steps:

step (1.1) setting an initial training set as D { (x)₁,y₁),(x₂,y₂),...,(x_N,y_N) N is the total number of samples in the training set; initializing the weight of the training sample: the initial weight of each training sample is 1/N, and the initial weight vector is {1/N,1/N, …,1/N }.

and (1.2.5) updating the sample weight in the following way:

wherein Z is_tIs a normalization factor that is a function of,

Defining a new similarity measurement mode among classifiers:

step (2.1) is to provide

step (2.3) is to provideMatrix array

1 indicates correct classification, 0 indicates wrong classification;

step (3.3) finding out two classes C with the maximum similarity_uAnd C_vIf it is similar to Cim (C)_u,C_v) If delta is larger than delta, then C is added_uAnd C_vAre combined into oneA number of classes, the total number of classes is then reduced by one; define any two classes C_aAnd C_bThe similarity between them is Cim (C)_a,C_b) And define C_aAny weak classifier and C_bThe minimum value of the similarity of any weak classifier is the similarity between classes: cim (C)_a，C_b)＝min{Tim[i][j]|h_i∈C_a，h_j∈C_b}；

In summary, in order to improve the classification accuracy and efficiency of the Adaboost algorithm, the method firstly provides the Adaboost algorithm for improving the sample weight updating mode, the improved weight updating mode updates the sample weights according to the weighted average correct rate of each sample in the previous t times of training, all samples promote the weights on the basis of the previous t times of training, and the higher the classification correct rate of the previous t times is, the smaller the weight promotion is; and finally, the weight values after being promoted are normalized, so that the infinite expansion of the noise sample weight values is restrained to a certain extent, the weight values of all samples are updated more uniformly, and the classification accuracy is improved to a certain extent. Secondly, screening weak classifiers by using a selective integration method based on hierarchical clustering and similarity, eliminating redundant weak classifiers, finally obtaining a weak classifier subset, and combining the weak classifiers into a strong classifier, thereby solving the problems of low classification speed, high calculation cost and the like caused by the redundancy of the weak classifiers to a certain extent. Compared with other ensemble learning methods, the method provided by the invention not only improves the classification speed, but also ensures the same or even higher classification accuracy.

The foregoing illustrates and describes the basic principles, implementations, and features of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that various changes and modifications may be made to the present invention within the scope of the present invention.

Claims

1. The network intrusion detection method based on data mining is characterized by comprising the following specific steps:

step (1.1) setting an initial training set as D { (x)₁,y₁),(x₂,y₂),...,(x_N,y_N) N is the total number of training samples in the training set; initializing the weight of the training sample: the initial weight of each training sample is 1/N, and the initial weight vector is {1/N,1/N, …,1/N };

step (1.2.1) randomly extracting N training samples from the initial training set D in a back-to-back mode according to the weights of the training samples to serve as the t-th weak classifier h_tTraining set D of_t；

Wherein I [ h ]_t(x_n)＝y_n]Whether the predicted value and the actual value of the nth training sample of the tth weak classifier are equal is represented; if equal, the number is 1; if not, 0 is obtained; (x)_n,y_n) Representing the nth training sample;

and (1.2.5) updating the weight of the training sample in the following way:

firstly, the probability E that the nth training sample can be correctly classified under the combination of the first t weak classifiers is counted_t(n)：

Then, the weight W of the (t + 1) th training sample is calculated_t+1(n), the lower the classification accuracy of the previous t times is, the larger the weight promotion is:

wherein Z is_tIs a normalization factor that is a function of,

and (1.3) returning to the training stage to obtain T weak classifier sets H ═ H₁,h₂,…,h_T}；

Defining a new similarity measurement mode among classifiers:

step (2.1) is to provide

Subscript matrix representing training samples in T training sets, D_t＝[d_t1,d_t2,…,d_tN]For the t-th row of the matrix, the training sample set of the t-th weak classifier is represented, d_tnIs the subscript representation of the n training sample drawn by the t weak classifier, d_tn∈[1,N]；

Step (2.2) defines two weak classifiers h_iAnd h_jThe similarity of the training set is Sim (i, j), tableTraining set D_iAnd D_jThe size of the intersection of (a) and (b) accounts for the proportion of the total training sample number N;

step (2.3) setting matrix

Represents the classification result of the T weak classifiers to the N training samples, m_tnRepresenting the classification condition of the nth training sample by the t-th weak classifier,

1 indicates correct classification, 0 indicates wrong classification;

step (2.4) defines two weak classifiers h_iAnd h_jThe similarity of the classification results is Rim (i, j), i.e. the proportion of the number of training samples classified into the same class by two weak classifiers to the total number of training samples N:

step (2.5) defines two weak classifiers h_iAnd h_jThe similarity between the weak classifiers is Tim (i, j) ═ Sim (i, j) + Rim (i, j), so as to obtain a similarity matrix between the T weak classifiers

step (3.1) firstly, a similarity threshold value delta is set, if two weak classifiers h_iAnd h_jThe similarity between Tim (i, j) > delta, h can be calculated_iAnd h_jDividing the data into the same class;

step (3.3) finding out two classes C with the maximum similarity_uAnd C_vIf it is similar to Cim (C)_u,C_v) If delta is larger than delta, then C is added_uAnd C_vMerge into one class, so the total number of classes is reduced by one; define any two classes C_aAnd C_bThe similarity between them is Cim (C)_a,C_b) And define C_aAny weak classifier and C_bThe minimum value of the similarity of any weak classifier is the similarity between classes: cim (C)_a,C_b)＝min{Tim(i,j)|h_i∈C_a,h_j∈C_b}；