CN111683048B - Intrusion detection system based on multicycle model stacking - Google Patents

Intrusion detection system based on multicycle model stacking Download PDF

Info

Publication number
CN111683048B
CN111683048B CN202010372115.9A CN202010372115A CN111683048B CN 111683048 B CN111683048 B CN 111683048B CN 202010372115 A CN202010372115 A CN 202010372115A CN 111683048 B CN111683048 B CN 111683048B
Authority
CN
China
Prior art keywords
data
training
model
period
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010372115.9A
Other languages
Chinese (zh)
Other versions
CN111683048A (en
Inventor
徐金铭
池灏
金韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010372115.9A priority Critical patent/CN111683048B/en
Publication of CN111683048A publication Critical patent/CN111683048A/en
Application granted granted Critical
Publication of CN111683048B publication Critical patent/CN111683048B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Abstract

The invention discloses an intrusion detection system based on multi-period model locking, which gradually accumulates more training data along with the continuous operation of the system, and the system integrates a model obtained by training historical periods and a model obtained by training current data through an improved locking method, so that the historical data of a plurality of periods can be more effectively utilized, the problem of information loss caused by the fact that all training data cannot be cached due to insufficient storage space of equipment is solved, and even though the historical data are discarded, the information in the historical data can be reflected in the latest training through the historical training model, thereby improving the detection performance of the intrusion detection system.

Description

Intrusion detection system based on multicycle model stacking
Technical Field
The invention belongs to the technical field of network security, relates to an Intrusion Detection System (IDS), and particularly relates to an Intrusion Detection System based on multi-cycle model locking.
Background
Networks have become an indispensable part of life in people's daily life, and the problem of security is getting more and more attention. Deployment of intrusion detection systems in a network environment has become a very important means of protecting against network security risks. The abnormal traffic is identified by extracting features from the traffic to be detected and detecting the features, which is the first step of intrusion detection, and meanwhile, classification of the abnormal traffic type is helpful for subsequent processing of the abnormal traffic.
In the existing method, there are two corresponding technologies, one is to match based on attack characteristics, and match the traffic to be detected through the known characteristics of network attack to identify abnormal traffic. However, this method can only detect known intrusion attack types, and cannot detect unknown attack types. The other technology is machine learning, which extracts features from the network traffic sessions, determines the selected features, trains a machine learning model according to the extracted data to obtain a corresponding prediction model, and judges whether the network traffic sessions in the traffic are abnormal or normal by the model. The sessions determined to be abnormal can be further determined, and the abnormal categories can be classified. The method can detect the attack types of unknown types, but the application of the conventional machine learning to the intrusion detection technology only considers the information of the currently owned training data. However, with the operation of the intrusion detection system, the owned training data will gradually increase, and due to the limitation of the storage space, the newly obtained data cannot be cached indefinitely, so that a part of the older data must be discarded, the model is retrained and deployed with the new data, and the information of the old data set cannot be utilized any more.
Disclosure of Invention
The invention aims to provide an intrusion detection system based on multi-cycle model locking to solve the problems of network intrusion detection and identification, and the system can effectively utilize historical data of multiple cycles and avoid the problem that the traditional method cannot be utilized due to the fact that the old information is easy to lose.
The technical scheme adopted by the invention is as follows:
an intrusion detection system based on multi-cycle model locking, the system comprises the following components:
the method comprises the steps of extracting features of original flow data, firstly capturing pcap format data packets of the original flow on a network by using a packet capturing tool such as a wireshark tool, and then segmenting the data packets into a plurality of sessions by using a packet segmenting tool. On the basis of which data to be detected are collected. The extraction of the features mainly collects the head information and the statistical features of time, flow and message number of the conversation. And extracting relevant statistical features from the sessions generated by the segmentation, and finally generating a feature vector for each session for detection.
Carrying out improved stacking integration on the historical training model by utilizing an improved multi-period stacking model, wherein the method comprises the following steps:
selecting a base learner (base learner) and a meta learner (meta learner), firstly setting the number N of the base learners to be integrated in total, wherein the number of the historical models is N-1, when the system is deployed at the beginning, the historical models are not too many for use, at the moment, only the improved stacking integration is needed according to the current owned models, and the process of the whole system is not influenced. Meanwhile, in order to prevent overfitting, the number cv of the cross validation segmentation training data sets of stacking needs to be set, and when the intrusion detection system enters the next period from the current period, an improved stacking set is used for updating the system, specifically as follows:
step 1, firstly checking the number T of all base classifiers of the current system; if T is 0, the period of the system when the system is initially deployed is the period, all current training data are only needed to train the base learner, and then deployment is carried out, and new training data are continuously accumulated along with continuous operation of the current period; when the current period is over, deleting a part of the older data, entering the next period, and returning to the step 1; if T > 0, in which case step 3 is entered if T ═ N, otherwise step 2 is entered;
and 2, when T is more than 0 and less than N, using an improved stacking integration algorithm: assuming that the currently owned data set D is a matrix of n × m, where n is the number of samples of the data set and m is the number of features of the extracted feature vector: then:
1) divide the currently owned data set D into cv, note
Figure BDA0002478532570000021
2) For each segmented data set, e.g. the ith data set DiUsing the remaining data, i.e. D-DiTraining a base learner, and then predicting by using the model to obtain D which does not participate in trainingiPredicted result of (1) FiThe outcome is either a prediction probability or a prediction category;
3) then the result is compared with the data DiMerging and splicing according to characteristics to obtain a new data matrix
Figure BDA0002478532570000022
The entire data set is then transformed into
Figure BDA0002478532570000023
4) The historical period models are processed similarly, and because the historical models are trained, the historical models keep the information of historical data, the data sets do not need to be subjected to cross validation segmentation and then combined, the data sets D only need to be directly predicted to obtain corresponding prediction results, and then the new transformed data sets are combined to finally obtain the new transformed data sets
Figure BDA0002478532570000024
5) Using the resulting new data set
Figure BDA0002478532570000025
Training a meta learner, and then training a base learner by using a data set D to obtain a base classifier of the current period, wherein the base classifier is used as a historical model in the next period;
6) combining all the models according to a stacking integrated frame structure to obtain a stacked model;
returning to the step 1 after the period is ended;
step 3, at this time, the number T of the existing base classifiers is equal to N, the overall updating step is similar to that in step 2, and the only difference is that in step 4), a history model needs to be discarded, so as to ensure that the number of the total base classifiers is N; and returning to the step 1 after the period is finished.
In the above technical solution, because the lightGBM model has better prediction performance, the lightGBM model may be selected as the base learner, and the meta learner may adopt a simpler Logistic Regression model or other more complex models.
The invention has the beneficial effects that:
according to the intrusion detection system based on the multi-period model locking, more training data are gradually accumulated along with the continuous operation of the system, the system integrates the model obtained by training the historical period and the model obtained by training the current data through an improved locking method, so that the historical data of multiple periods can be more effectively utilized, the problem of information loss caused by the fact that all training data cannot be cached due to insufficient storage space of equipment is solved, even if the historical data are discarded, the information in the historical data can be reflected in the latest training through the historical training model, and the detection performance of the intrusion detection system can be improved.
Drawings
FIG. 1 is a block schematic diagram of a detection system of the present invention.
Fig. 2 is a flow diagram of multi-cycle stacking integration.
Detailed Description
The present invention is further illustrated by the following examples, which are not intended to be limiting.
The most important data and models are for the actual deployment of machine learning based intrusion detection systems. As the system operates, a lot of data available for training of the machine learning model will be gradually accumulated through a feedback mechanism. However, due to the memory space limitations of the devices, as the amount of data acquired increases, the devices have no way to cache all of the acquired data, and a portion of the older data must be deleted from the devices. The multi-period model stacking intrusion detection system provided by the invention can make the intrusion detection system use a model trained by history, namely, the model is initially trained by original data, and the model is retrained by updated data in each updated period along with the continuous updating of the data, namely, the model can use the information contained in the discarded data, so that the problem of information loss caused by equipment storage limitation is relieved, and the generalization capability of the model is enhanced.
The invention specifically comprises the following steps:
1. data feature extraction
The whole detection needs to be carried out by preprocessing after the characteristics of the original flow data are extracted.
Therefore, firstly, a packet grabbing tool such as wireshark and the like is used for grabbing a pcap format data packet of original traffic on a network, and then, the pcap format data packet can be segmented into a plurality of sessions by using a packet cutting tool. On the basis of which data to be detected are collected. The extraction of the features mainly collects the head information and the statistical features of time, flow and message number of the conversation. And extracting relevant statistical features from the sessions generated by the segmentation, and finally generating a feature vector for each session for detection.
2. Training and updating of multi-cycle stacking model
After the feature vector to be detected is obtained, in order to realize the target of intrusion detection and detect abnormal traffic, a proper machine learning model is selected to be trained, and then the extracted feature vector is predicted by using the model obtained by training. As the system continues to operate, the feedback mechanism will continuously generate data, and since the data cannot be stored all the time, a part of the data needs to be deleted periodically, which causes information loss.
In order to make full use of data, an improved multi-cycle stacking model method is proposed, which solves the problem caused by data deletion by performing improved stacking integration on a historically trained model. The specific scheme is as follows:
first, a base learner (base learner) and a meta learner (meta learner) are selected, because the lightGBM model has better prediction performance, the lightGBM model can be selected as the base learner, and the meta learner can adopt a simpler Logistic Regression model or other more complex models. The method comprises the steps of firstly setting the total number N of base learners to be integrated, wherein the number of historical models is N-1, when a system is deployed at the beginning, the historical models are not too many for use, and at the moment, only improved stacking integration is needed according to the current owned models, so that the flow of the whole system is not influenced. Meanwhile, in order to prevent overfitting, the number cv of the cross-validation segmentation training data sets of the stacking needs to be set, and the following mainly describes how the intrusion detection system should be updated when the intrusion detection system enters the next cycle from the current cycle. The method comprises the following steps:
step 1, firstly checking the number T of all the base learners of the current system. If T is 0, at this time, for the period in which the system initially starts to be deployed, only all the current training data need to be used for training the lightGBM model, and then deployment is performed, and as the current period continues, new training data are also continuously accumulated. And when the current period is finished, deleting a part of the older data, entering the next period, and returning to the step 1. If T > 0, in which case step 3 is entered if T ═ N, otherwise step 2 is entered
And 2, when T is more than 0 and less than N, an improved stacking integration algorithm is supposed to be used, and the current owned data set D is assumed to be a matrix of N multiplied by m, wherein N is the sample number of the data set, and m is the special number of the extracted feature vector.
1) First, the currently owned data set D is divided into cv numbers, which are recorded as
Figure BDA0002478532570000051
2) For each segmented data set, e.g. the ith data set DiUsing the remaining data D-DiTraining a LightGBM model of a base classifier, and then predicting by using the LightGBM model to obtain D which does not participate in trainingiPredicted result of (1) FiThe result is either a prediction probability or a prediction category.
3) The result is then compared with data DiMerging and splicing according to characteristics to obtain a new data matrix
Figure BDA0002478532570000052
The entire data set is then transformed into
Figure BDA0002478532570000053
4) The historical period models are processed similarly, and because the historical models are trained, the historical models keep the information of historical data, the data sets do not need to be subjected to cross validation segmentation and then combined, the data sets D only need to be directly predicted to obtain corresponding prediction results, and then the new transformed data sets are combined to finally obtain the new transformed data sets
Figure BDA0002478532570000054
5) Using new data sets
Figure BDA0002478532570000055
Training the meta classifier, and then training the lightGBM model by using the data set D to obtain a base classifier of the current cycle, which will be used as a history model in the next cycle.
6) And combining all the models according to the structure of the stacking integrated frame diagram to obtain the models after stacking for deployment.
Returning to step 1 after the period is finished
And 3, at this time, the number T of the existing base classifiers is equal to N, the overall updating step is similar to the step 2, and the only difference is that the step 4) needs to discard one historical model. To ensure that the total number of basis classifiers is N. And returning to the step 1 after the period is finished.

Claims (3)

1. An intrusion detection system based on multi-cycle model locking is characterized by comprising the following components:
extracting the characteristics of original flow data, namely firstly, grabbing a pcap format data packet of the original flow on a network by using a packet grabbing tool, then segmenting the data packet into a plurality of sessions by using a packet segmenting tool, collecting data to be detected on the basis of the sessions, and finally generating a characteristic vector for detection aiming at each session;
selecting a base learner and a meta-learner, firstly setting the total number N of the base learners to be integrated, wherein the number of historical models is N-1, and also setting the number cv of a stacking cross validation segmentation training data set in order to prevent overfitting, and when an intrusion detection system enters the next period from the current period, performing improved stacking set pair system updating, specifically as follows:
step 1, firstly checking the number T of all base classifiers of the current system; if T is 0, the period of the system when the system is initially deployed is the period, all current training data are only needed to train the base learner, and then deployment is carried out, and new training data are continuously accumulated along with continuous operation of the current period; when the current period is over, deleting a part of the older data, entering the next period, and returning to the step 1; if T > 0, in which case step 3 is entered if T ═ N, otherwise step 2 is entered;
and 2, when T is more than 0 and less than N, using an improved stacking integration algorithm: assuming that the currently owned data set D is a matrix of n × m, where n is the number of samples of the data set and m is the number of features of the extracted feature vector: then:
1) divide the currently owned data set D into cv, note
Figure FDA0002914438970000011
2) For each segmented data set, e.g. the ith data set DiUsing the remaining data, i.e. D-DiTraining a base learner, and then predicting by using the model to obtain D which does not participate in trainingiPredicted result of (1) FiThe outcome is either a prediction probability or a prediction category;
3) then the result is compared with the data DiMerging and splicing according to characteristics to obtain a new data matrix
Figure FDA0002914438970000012
The entire data set is then transformed into
Figure FDA0002914438970000013
4) The historical period models are processed similarly, and because the historical models are trained, the historical models keep the information of historical data, the data sets do not need to be subjected to cross validation segmentation and then combined, the data sets D only need to be directly predicted to obtain corresponding prediction results, and then the new transformed data sets are combined to finally obtain the new transformed data sets
Figure FDA0002914438970000014
5) Using the resulting new data set
Figure FDA0002914438970000021
Training a meta learner, and then training a base learner by using a data set D to obtain a base classifier of the current period, wherein the base classifier is used as a historical model in the next period;
6) combining all the models according to a stacking integrated frame structure to obtain a stacked model;
returning to the step 1 after the period is ended;
step 3, at this time, the number T of the existing base classifiers is equal to N, the overall updating step is similar to that in step 2, and the only difference is that in step 4), a history model needs to be discarded, so as to ensure that the number of the total base classifiers is N; and returning to the step 1 after the period is finished.
2. The system of claim 1, wherein the base learner uses a lightGBM model.
3. The system of claim 1, wherein the meta-learner employs a Logistic Regression model.
CN202010372115.9A 2020-05-06 2020-05-06 Intrusion detection system based on multicycle model stacking Active CN111683048B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010372115.9A CN111683048B (en) 2020-05-06 2020-05-06 Intrusion detection system based on multicycle model stacking

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010372115.9A CN111683048B (en) 2020-05-06 2020-05-06 Intrusion detection system based on multicycle model stacking

Publications (2)

Publication Number Publication Date
CN111683048A CN111683048A (en) 2020-09-18
CN111683048B true CN111683048B (en) 2021-05-07

Family

ID=72433411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010372115.9A Active CN111683048B (en) 2020-05-06 2020-05-06 Intrusion detection system based on multicycle model stacking

Country Status (1)

Country Link
CN (1) CN111683048B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108023876A (en) * 2017-11-20 2018-05-11 西安电子科技大学 Intrusion detection method and intruding detection system based on sustainability integrated study
CN110351307A (en) * 2019-08-14 2019-10-18 杭州安恒信息技术股份有限公司 Abnormal user detection method and system based on integrated study
CN110874373A (en) * 2019-12-10 2020-03-10 杭州岑石能源科技有限公司 Linear variation relation judgment method based on machine learning stacking model

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102291392B (en) * 2011-07-22 2015-03-25 中国电力科学研究院 Hybrid intrusion detection method based on Bagging algorithm
US9178812B2 (en) * 2013-06-05 2015-11-03 Cisco Technology, Inc. Stacking metadata contexts for service chains
CN106973057B (en) * 2017-03-31 2018-12-14 浙江大学 A kind of classification method suitable for intrusion detection
CN108962397B (en) * 2018-06-06 2022-07-15 中国科学院软件研究所 Pen and voice-based cooperative task nervous system disease auxiliary diagnosis system
CN110247910B (en) * 2019-06-13 2022-08-09 深信服科技股份有限公司 Abnormal flow detection method, system and related components
CN110414554B (en) * 2019-06-18 2022-03-22 浙江大学 Stacking ensemble learning fish identification method based on multi-model improvement
CN110674947B (en) * 2019-09-02 2021-02-19 三峡大学 Spectral feature variable selection and optimization method based on Stacking integrated framework
CN110763660B (en) * 2019-10-22 2021-07-30 中国科学院广州地球化学研究所 LIBS quantitative analysis method based on ensemble learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108023876A (en) * 2017-11-20 2018-05-11 西安电子科技大学 Intrusion detection method and intruding detection system based on sustainability integrated study
CN110351307A (en) * 2019-08-14 2019-10-18 杭州安恒信息技术股份有限公司 Abnormal user detection method and system based on integrated study
CN110874373A (en) * 2019-12-10 2020-03-10 杭州岑石能源科技有限公司 Linear variation relation judgment method based on machine learning stacking model

Also Published As

Publication number Publication date
CN111683048A (en) 2020-09-18

Similar Documents

Publication Publication Date Title
CN107294993B (en) WEB abnormal traffic monitoring method based on ensemble learning
CN111107102A (en) Real-time network flow abnormity detection method based on big data
CN108038049A (en) Real-time logs control system and control method, cloud computing system and server
CN113518063B (en) Network intrusion detection method and system based on data enhancement and BilSTM
CN112381121A (en) Unknown class network flow detection and identification method based on twin network
CN102420723A (en) Anomaly detection method for various kinds of intrusion
CN110535878B (en) Threat detection method based on event sequence
CN110414367B (en) Time sequence behavior detection method based on GAN and SSN
CN107483451B (en) Method and system for processing network security data based on serial-parallel structure and social network
CN113556319B (en) Intrusion detection method based on long-short term memory self-coding classifier under internet of things
CN113536256B (en) Statistical analysis method and device for population mobility data and electronic equipment
CN113378990A (en) Traffic data anomaly detection method based on deep learning
CN112738014A (en) Industrial control flow abnormity detection method and system based on convolution time sequence network
CN114021135A (en) LDoS attack detection and defense method based on R-SAX
Shi et al. Deepddos: Online ddos attack detection
CN116318830A (en) Log intrusion detection system based on generation of countermeasure network
CN110866553A (en) User behavior classification method and system based on encrypted camera flow statistical characteristics
Yang et al. Pedestrian tracking algorithm for dense crowd based on deep learning
CN110166422A (en) Domain name Activity recognition method, apparatus, readable storage medium storing program for executing and computer equipment
CN111683048B (en) Intrusion detection system based on multicycle model stacking
CN105930430B (en) Real-time fraud detection method and device based on non-accumulative attribute
CN109376531B (en) Web intrusion detection method based on semantic recoding and feature space separation
CN109871469A (en) Tuftlet crowd recognition method based on dynamic graphical component
CN113328986A (en) Network flow abnormity detection method based on combination of convolutional neural network and LSTM
CN112491866A (en) Intrusion detection method and device combining data flow detection and time sequence feature extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant