CN111683048B

CN111683048B - Intrusion detection system based on multicycle model stacking

Info

Publication number: CN111683048B
Application number: CN202010372115.9A
Authority: CN
Inventors: 徐金铭; 池灏; 金韬
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-05-06
Filing date: 2020-05-06
Publication date: 2021-05-07
Anticipated expiration: 2040-05-06
Also published as: CN111683048A

Abstract

The invention discloses an intrusion detection system based on multi-period model locking, which gradually accumulates more training data along with the continuous operation of the system, and the system integrates a model obtained by training historical periods and a model obtained by training current data through an improved locking method, so that the historical data of a plurality of periods can be more effectively utilized, the problem of information loss caused by the fact that all training data cannot be cached due to insufficient storage space of equipment is solved, and even though the historical data are discarded, the information in the historical data can be reflected in the latest training through the historical training model, thereby improving the detection performance of the intrusion detection system.

Description

Intrusion detection system based on multicycle model stacking

Technical Field

The invention belongs to the technical field of network security, relates to an Intrusion Detection System (IDS), and particularly relates to an Intrusion Detection System based on multi-cycle model locking.

Background

Networks have become an indispensable part of life in people's daily life, and the problem of security is getting more and more attention. Deployment of intrusion detection systems in a network environment has become a very important means of protecting against network security risks. The abnormal traffic is identified by extracting features from the traffic to be detected and detecting the features, which is the first step of intrusion detection, and meanwhile, classification of the abnormal traffic type is helpful for subsequent processing of the abnormal traffic.

In the existing method, there are two corresponding technologies, one is to match based on attack characteristics, and match the traffic to be detected through the known characteristics of network attack to identify abnormal traffic. However, this method can only detect known intrusion attack types, and cannot detect unknown attack types. The other technology is machine learning, which extracts features from the network traffic sessions, determines the selected features, trains a machine learning model according to the extracted data to obtain a corresponding prediction model, and judges whether the network traffic sessions in the traffic are abnormal or normal by the model. The sessions determined to be abnormal can be further determined, and the abnormal categories can be classified. The method can detect the attack types of unknown types, but the application of the conventional machine learning to the intrusion detection technology only considers the information of the currently owned training data. However, with the operation of the intrusion detection system, the owned training data will gradually increase, and due to the limitation of the storage space, the newly obtained data cannot be cached indefinitely, so that a part of the older data must be discarded, the model is retrained and deployed with the new data, and the information of the old data set cannot be utilized any more.

Disclosure of Invention

The invention aims to provide an intrusion detection system based on multi-cycle model locking to solve the problems of network intrusion detection and identification, and the system can effectively utilize historical data of multiple cycles and avoid the problem that the traditional method cannot be utilized due to the fact that the old information is easy to lose.

The technical scheme adopted by the invention is as follows:

an intrusion detection system based on multi-cycle model locking, the system comprises the following components:

the method comprises the steps of extracting features of original flow data, firstly capturing pcap format data packets of the original flow on a network by using a packet capturing tool such as a wireshark tool, and then segmenting the data packets into a plurality of sessions by using a packet segmenting tool. On the basis of which data to be detected are collected. The extraction of the features mainly collects the head information and the statistical features of time, flow and message number of the conversation. And extracting relevant statistical features from the sessions generated by the segmentation, and finally generating a feature vector for each session for detection.

Carrying out improved stacking integration on the historical training model by utilizing an improved multi-period stacking model, wherein the method comprises the following steps:

selecting a base learner (base learner) and a meta learner (meta learner), firstly setting the number N of the base learners to be integrated in total, wherein the number of the historical models is N-1, when the system is deployed at the beginning, the historical models are not too many for use, at the moment, only the improved stacking integration is needed according to the current owned models, and the process of the whole system is not influenced. Meanwhile, in order to prevent overfitting, the number cv of the cross validation segmentation training data sets of stacking needs to be set, and when the intrusion detection system enters the next period from the current period, an improved stacking set is used for updating the system, specifically as follows:

step 1, firstly checking the number T of all base classifiers of the current system; if T is 0, the period of the system when the system is initially deployed is the period, all current training data are only needed to train the base learner, and then deployment is carried out, and new training data are continuously accumulated along with continuous operation of the current period; when the current period is over, deleting a part of the older data, entering the next period, and returning to the step 1; if T > 0, in which case step 3 is entered if T ═ N, otherwise step 2 is entered;

and 2, when T is more than 0 and less than N, using an improved stacking integration algorithm: assuming that the currently owned data set D is a matrix of n × m, where n is the number of samples of the data set and m is the number of features of the extracted feature vector: then:

1) divide the currently owned data set D into cv, note

2) For each segmented data set, e.g. the ith data set D_iUsing the remaining data, i.e. D-D_iTraining a base learner, and then predicting by using the model to obtain D which does not participate in training_iPredicted result of (1) F_iThe outcome is either a prediction probability or a prediction category;

3) then the result is compared with the data D_iMerging and splicing according to characteristics to obtain a new data matrix

The entire data set is then transformed into

4) The historical period models are processed similarly, and because the historical models are trained, the historical models keep the information of historical data, the data sets do not need to be subjected to cross validation segmentation and then combined, the data sets D only need to be directly predicted to obtain corresponding prediction results, and then the new transformed data sets are combined to finally obtain the new transformed data sets

5) Using the resulting new data set

Training a meta learner, and then training a base learner by using a data set D to obtain a base classifier of the current period, wherein the base classifier is used as a historical model in the next period;

6) combining all the models according to a stacking integrated frame structure to obtain a stacked model;

returning to the step 1 after the period is ended;

step 3, at this time, the number T of the existing base classifiers is equal to N, the overall updating step is similar to that in step 2, and the only difference is that in step 4), a history model needs to be discarded, so as to ensure that the number of the total base classifiers is N; and returning to the step 1 after the period is finished.

In the above technical solution, because the lightGBM model has better prediction performance, the lightGBM model may be selected as the base learner, and the meta learner may adopt a simpler Logistic Regression model or other more complex models.

The invention has the beneficial effects that:

according to the intrusion detection system based on the multi-period model locking, more training data are gradually accumulated along with the continuous operation of the system, the system integrates the model obtained by training the historical period and the model obtained by training the current data through an improved locking method, so that the historical data of multiple periods can be more effectively utilized, the problem of information loss caused by the fact that all training data cannot be cached due to insufficient storage space of equipment is solved, even if the historical data are discarded, the information in the historical data can be reflected in the latest training through the historical training model, and the detection performance of the intrusion detection system can be improved.

Drawings

FIG. 1 is a block schematic diagram of a detection system of the present invention.

Fig. 2 is a flow diagram of multi-cycle stacking integration.

Detailed Description

The present invention is further illustrated by the following examples, which are not intended to be limiting.

The most important data and models are for the actual deployment of machine learning based intrusion detection systems. As the system operates, a lot of data available for training of the machine learning model will be gradually accumulated through a feedback mechanism. However, due to the memory space limitations of the devices, as the amount of data acquired increases, the devices have no way to cache all of the acquired data, and a portion of the older data must be deleted from the devices. The multi-period model stacking intrusion detection system provided by the invention can make the intrusion detection system use a model trained by history, namely, the model is initially trained by original data, and the model is retrained by updated data in each updated period along with the continuous updating of the data, namely, the model can use the information contained in the discarded data, so that the problem of information loss caused by equipment storage limitation is relieved, and the generalization capability of the model is enhanced.

The invention specifically comprises the following steps:

1. data feature extraction

The whole detection needs to be carried out by preprocessing after the characteristics of the original flow data are extracted.

Therefore, firstly, a packet grabbing tool such as wireshark and the like is used for grabbing a pcap format data packet of original traffic on a network, and then, the pcap format data packet can be segmented into a plurality of sessions by using a packet cutting tool. On the basis of which data to be detected are collected. The extraction of the features mainly collects the head information and the statistical features of time, flow and message number of the conversation. And extracting relevant statistical features from the sessions generated by the segmentation, and finally generating a feature vector for each session for detection.

2. Training and updating of multi-cycle stacking model

After the feature vector to be detected is obtained, in order to realize the target of intrusion detection and detect abnormal traffic, a proper machine learning model is selected to be trained, and then the extracted feature vector is predicted by using the model obtained by training. As the system continues to operate, the feedback mechanism will continuously generate data, and since the data cannot be stored all the time, a part of the data needs to be deleted periodically, which causes information loss.

In order to make full use of data, an improved multi-cycle stacking model method is proposed, which solves the problem caused by data deletion by performing improved stacking integration on a historically trained model. The specific scheme is as follows:

first, a base learner (base learner) and a meta learner (meta learner) are selected, because the lightGBM model has better prediction performance, the lightGBM model can be selected as the base learner, and the meta learner can adopt a simpler Logistic Regression model or other more complex models. The method comprises the steps of firstly setting the total number N of base learners to be integrated, wherein the number of historical models is N-1, when a system is deployed at the beginning, the historical models are not too many for use, and at the moment, only improved stacking integration is needed according to the current owned models, so that the flow of the whole system is not influenced. Meanwhile, in order to prevent overfitting, the number cv of the cross-validation segmentation training data sets of the stacking needs to be set, and the following mainly describes how the intrusion detection system should be updated when the intrusion detection system enters the next cycle from the current cycle. The method comprises the following steps:

step 1, firstly checking the number T of all the base learners of the current system. If T is 0, at this time, for the period in which the system initially starts to be deployed, only all the current training data need to be used for training the lightGBM model, and then deployment is performed, and as the current period continues, new training data are also continuously accumulated. And when the current period is finished, deleting a part of the older data, entering the next period, and returning to the step 1. If T > 0, in which case step 3 is entered if T ═ N, otherwise step 2 is entered

And 2, when T is more than 0 and less than N, an improved stacking integration algorithm is supposed to be used, and the current owned data set D is assumed to be a matrix of N multiplied by m, wherein N is the sample number of the data set, and m is the special number of the extracted feature vector.

1) First, the currently owned data set D is divided into cv numbers, which are recorded as

2) For each segmented data set, e.g. the ith data set D_iUsing the remaining data D-D_iTraining a LightGBM model of a base classifier, and then predicting by using the LightGBM model to obtain D which does not participate in training_iPredicted result of (1) F_iThe result is either a prediction probability or a prediction category.

3) The result is then compared with data D_iMerging and splicing according to characteristics to obtain a new data matrix

The entire data set is then transformed into

5) Using new data sets

Training the meta classifier, and then training the lightGBM model by using the data set D to obtain a base classifier of the current cycle, which will be used as a history model in the next cycle.

6) And combining all the models according to the structure of the stacking integrated frame diagram to obtain the models after stacking for deployment.

Returning to step 1 after the period is finished

And 3, at this time, the number T of the existing base classifiers is equal to N, the overall updating step is similar to the step 2, and the only difference is that the step 4) needs to discard one historical model. To ensure that the total number of basis classifiers is N. And returning to the step 1 after the period is finished.

Claims

1. An intrusion detection system based on multi-cycle model locking is characterized by comprising the following components:

extracting the characteristics of original flow data, namely firstly, grabbing a pcap format data packet of the original flow on a network by using a packet grabbing tool, then segmenting the data packet into a plurality of sessions by using a packet segmenting tool, collecting data to be detected on the basis of the sessions, and finally generating a characteristic vector for detection aiming at each session;

selecting a base learner and a meta-learner, firstly setting the total number N of the base learners to be integrated, wherein the number of historical models is N-1, and also setting the number cv of a stacking cross validation segmentation training data set in order to prevent overfitting, and when an intrusion detection system enters the next period from the current period, performing improved stacking set pair system updating, specifically as follows:

1) divide the currently owned data set D into cv, note

The entire data set is then transformed into

5) Using the resulting new data set

returning to the step 1 after the period is ended;

2. The system of claim 1, wherein the base learner uses a lightGBM model.

3. The system of claim 1, wherein the meta-learner employs a Logistic Regression model.