CN111683048B - Intrusion detection system based on multicycle model stacking - Google Patents
Intrusion detection system based on multicycle model stacking Download PDFInfo
- Publication number
- CN111683048B CN111683048B CN202010372115.9A CN202010372115A CN111683048B CN 111683048 B CN111683048 B CN 111683048B CN 202010372115 A CN202010372115 A CN 202010372115A CN 111683048 B CN111683048 B CN 111683048B
- Authority
- CN
- China
- Prior art keywords
- data
- training
- model
- period
- historical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
Abstract
The invention discloses an intrusion detection system based on multi-period model locking, which gradually accumulates more training data along with the continuous operation of the system, and the system integrates a model obtained by training historical periods and a model obtained by training current data through an improved locking method, so that the historical data of a plurality of periods can be more effectively utilized, the problem of information loss caused by the fact that all training data cannot be cached due to insufficient storage space of equipment is solved, and even though the historical data are discarded, the information in the historical data can be reflected in the latest training through the historical training model, thereby improving the detection performance of the intrusion detection system.
Description
Technical Field
The invention belongs to the technical field of network security, relates to an Intrusion Detection System (IDS), and particularly relates to an Intrusion Detection System based on multi-cycle model locking.
Background
Networks have become an indispensable part of life in people's daily life, and the problem of security is getting more and more attention. Deployment of intrusion detection systems in a network environment has become a very important means of protecting against network security risks. The abnormal traffic is identified by extracting features from the traffic to be detected and detecting the features, which is the first step of intrusion detection, and meanwhile, classification of the abnormal traffic type is helpful for subsequent processing of the abnormal traffic.
In the existing method, there are two corresponding technologies, one is to match based on attack characteristics, and match the traffic to be detected through the known characteristics of network attack to identify abnormal traffic. However, this method can only detect known intrusion attack types, and cannot detect unknown attack types. The other technology is machine learning, which extracts features from the network traffic sessions, determines the selected features, trains a machine learning model according to the extracted data to obtain a corresponding prediction model, and judges whether the network traffic sessions in the traffic are abnormal or normal by the model. The sessions determined to be abnormal can be further determined, and the abnormal categories can be classified. The method can detect the attack types of unknown types, but the application of the conventional machine learning to the intrusion detection technology only considers the information of the currently owned training data. However, with the operation of the intrusion detection system, the owned training data will gradually increase, and due to the limitation of the storage space, the newly obtained data cannot be cached indefinitely, so that a part of the older data must be discarded, the model is retrained and deployed with the new data, and the information of the old data set cannot be utilized any more.
Disclosure of Invention
The invention aims to provide an intrusion detection system based on multi-cycle model locking to solve the problems of network intrusion detection and identification, and the system can effectively utilize historical data of multiple cycles and avoid the problem that the traditional method cannot be utilized due to the fact that the old information is easy to lose.
The technical scheme adopted by the invention is as follows:
an intrusion detection system based on multi-cycle model locking, the system comprises the following components:
the method comprises the steps of extracting features of original flow data, firstly capturing pcap format data packets of the original flow on a network by using a packet capturing tool such as a wireshark tool, and then segmenting the data packets into a plurality of sessions by using a packet segmenting tool. On the basis of which data to be detected are collected. The extraction of the features mainly collects the head information and the statistical features of time, flow and message number of the conversation. And extracting relevant statistical features from the sessions generated by the segmentation, and finally generating a feature vector for each session for detection.
Carrying out improved stacking integration on the historical training model by utilizing an improved multi-period stacking model, wherein the method comprises the following steps:
selecting a base learner (base learner) and a meta learner (meta learner), firstly setting the number N of the base learners to be integrated in total, wherein the number of the historical models is N-1, when the system is deployed at the beginning, the historical models are not too many for use, at the moment, only the improved stacking integration is needed according to the current owned models, and the process of the whole system is not influenced. Meanwhile, in order to prevent overfitting, the number cv of the cross validation segmentation training data sets of stacking needs to be set, and when the intrusion detection system enters the next period from the current period, an improved stacking set is used for updating the system, specifically as follows:
step 1, firstly checking the number T of all base classifiers of the current system; if T is 0, the period of the system when the system is initially deployed is the period, all current training data are only needed to train the base learner, and then deployment is carried out, and new training data are continuously accumulated along with continuous operation of the current period; when the current period is over, deleting a part of the older data, entering the next period, and returning to the step 1; if T > 0, in which case step 3 is entered if T ═ N, otherwise step 2 is entered;
and 2, when T is more than 0 and less than N, using an improved stacking integration algorithm: assuming that the currently owned data set D is a matrix of n × m, where n is the number of samples of the data set and m is the number of features of the extracted feature vector: then:
2) For each segmented data set, e.g. the ith data set DiUsing the remaining data, i.e. D-DiTraining a base learner, and then predicting by using the model to obtain D which does not participate in trainingiPredicted result of (1) FiThe outcome is either a prediction probability or a prediction category;
3) then the result is compared with the data DiMerging and splicing according to characteristics to obtain a new data matrixThe entire data set is then transformed into
4) The historical period models are processed similarly, and because the historical models are trained, the historical models keep the information of historical data, the data sets do not need to be subjected to cross validation segmentation and then combined, the data sets D only need to be directly predicted to obtain corresponding prediction results, and then the new transformed data sets are combined to finally obtain the new transformed data sets
5) Using the resulting new data setTraining a meta learner, and then training a base learner by using a data set D to obtain a base classifier of the current period, wherein the base classifier is used as a historical model in the next period;
6) combining all the models according to a stacking integrated frame structure to obtain a stacked model;
returning to the step 1 after the period is ended;
step 3, at this time, the number T of the existing base classifiers is equal to N, the overall updating step is similar to that in step 2, and the only difference is that in step 4), a history model needs to be discarded, so as to ensure that the number of the total base classifiers is N; and returning to the step 1 after the period is finished.
In the above technical solution, because the lightGBM model has better prediction performance, the lightGBM model may be selected as the base learner, and the meta learner may adopt a simpler Logistic Regression model or other more complex models.
The invention has the beneficial effects that:
according to the intrusion detection system based on the multi-period model locking, more training data are gradually accumulated along with the continuous operation of the system, the system integrates the model obtained by training the historical period and the model obtained by training the current data through an improved locking method, so that the historical data of multiple periods can be more effectively utilized, the problem of information loss caused by the fact that all training data cannot be cached due to insufficient storage space of equipment is solved, even if the historical data are discarded, the information in the historical data can be reflected in the latest training through the historical training model, and the detection performance of the intrusion detection system can be improved.
Drawings
FIG. 1 is a block schematic diagram of a detection system of the present invention.
Fig. 2 is a flow diagram of multi-cycle stacking integration.
Detailed Description
The present invention is further illustrated by the following examples, which are not intended to be limiting.
The most important data and models are for the actual deployment of machine learning based intrusion detection systems. As the system operates, a lot of data available for training of the machine learning model will be gradually accumulated through a feedback mechanism. However, due to the memory space limitations of the devices, as the amount of data acquired increases, the devices have no way to cache all of the acquired data, and a portion of the older data must be deleted from the devices. The multi-period model stacking intrusion detection system provided by the invention can make the intrusion detection system use a model trained by history, namely, the model is initially trained by original data, and the model is retrained by updated data in each updated period along with the continuous updating of the data, namely, the model can use the information contained in the discarded data, so that the problem of information loss caused by equipment storage limitation is relieved, and the generalization capability of the model is enhanced.
The invention specifically comprises the following steps:
1. data feature extraction
The whole detection needs to be carried out by preprocessing after the characteristics of the original flow data are extracted.
Therefore, firstly, a packet grabbing tool such as wireshark and the like is used for grabbing a pcap format data packet of original traffic on a network, and then, the pcap format data packet can be segmented into a plurality of sessions by using a packet cutting tool. On the basis of which data to be detected are collected. The extraction of the features mainly collects the head information and the statistical features of time, flow and message number of the conversation. And extracting relevant statistical features from the sessions generated by the segmentation, and finally generating a feature vector for each session for detection.
2. Training and updating of multi-cycle stacking model
After the feature vector to be detected is obtained, in order to realize the target of intrusion detection and detect abnormal traffic, a proper machine learning model is selected to be trained, and then the extracted feature vector is predicted by using the model obtained by training. As the system continues to operate, the feedback mechanism will continuously generate data, and since the data cannot be stored all the time, a part of the data needs to be deleted periodically, which causes information loss.
In order to make full use of data, an improved multi-cycle stacking model method is proposed, which solves the problem caused by data deletion by performing improved stacking integration on a historically trained model. The specific scheme is as follows:
first, a base learner (base learner) and a meta learner (meta learner) are selected, because the lightGBM model has better prediction performance, the lightGBM model can be selected as the base learner, and the meta learner can adopt a simpler Logistic Regression model or other more complex models. The method comprises the steps of firstly setting the total number N of base learners to be integrated, wherein the number of historical models is N-1, when a system is deployed at the beginning, the historical models are not too many for use, and at the moment, only improved stacking integration is needed according to the current owned models, so that the flow of the whole system is not influenced. Meanwhile, in order to prevent overfitting, the number cv of the cross-validation segmentation training data sets of the stacking needs to be set, and the following mainly describes how the intrusion detection system should be updated when the intrusion detection system enters the next cycle from the current cycle. The method comprises the following steps:
step 1, firstly checking the number T of all the base learners of the current system. If T is 0, at this time, for the period in which the system initially starts to be deployed, only all the current training data need to be used for training the lightGBM model, and then deployment is performed, and as the current period continues, new training data are also continuously accumulated. And when the current period is finished, deleting a part of the older data, entering the next period, and returning to the step 1. If T > 0, in which case step 3 is entered if T ═ N, otherwise step 2 is entered
And 2, when T is more than 0 and less than N, an improved stacking integration algorithm is supposed to be used, and the current owned data set D is assumed to be a matrix of N multiplied by m, wherein N is the sample number of the data set, and m is the special number of the extracted feature vector.
2) For each segmented data set, e.g. the ith data set DiUsing the remaining data D-DiTraining a LightGBM model of a base classifier, and then predicting by using the LightGBM model to obtain D which does not participate in trainingiPredicted result of (1) FiThe result is either a prediction probability or a prediction category.
3) The result is then compared with data DiMerging and splicing according to characteristics to obtain a new data matrixThe entire data set is then transformed into
4) The historical period models are processed similarly, and because the historical models are trained, the historical models keep the information of historical data, the data sets do not need to be subjected to cross validation segmentation and then combined, the data sets D only need to be directly predicted to obtain corresponding prediction results, and then the new transformed data sets are combined to finally obtain the new transformed data sets
5) Using new data setsTraining the meta classifier, and then training the lightGBM model by using the data set D to obtain a base classifier of the current cycle, which will be used as a history model in the next cycle.
6) And combining all the models according to the structure of the stacking integrated frame diagram to obtain the models after stacking for deployment.
Returning to step 1 after the period is finished
And 3, at this time, the number T of the existing base classifiers is equal to N, the overall updating step is similar to the step 2, and the only difference is that the step 4) needs to discard one historical model. To ensure that the total number of basis classifiers is N. And returning to the step 1 after the period is finished.
Claims (3)
1. An intrusion detection system based on multi-cycle model locking is characterized by comprising the following components:
extracting the characteristics of original flow data, namely firstly, grabbing a pcap format data packet of the original flow on a network by using a packet grabbing tool, then segmenting the data packet into a plurality of sessions by using a packet segmenting tool, collecting data to be detected on the basis of the sessions, and finally generating a characteristic vector for detection aiming at each session;
selecting a base learner and a meta-learner, firstly setting the total number N of the base learners to be integrated, wherein the number of historical models is N-1, and also setting the number cv of a stacking cross validation segmentation training data set in order to prevent overfitting, and when an intrusion detection system enters the next period from the current period, performing improved stacking set pair system updating, specifically as follows:
step 1, firstly checking the number T of all base classifiers of the current system; if T is 0, the period of the system when the system is initially deployed is the period, all current training data are only needed to train the base learner, and then deployment is carried out, and new training data are continuously accumulated along with continuous operation of the current period; when the current period is over, deleting a part of the older data, entering the next period, and returning to the step 1; if T > 0, in which case step 3 is entered if T ═ N, otherwise step 2 is entered;
and 2, when T is more than 0 and less than N, using an improved stacking integration algorithm: assuming that the currently owned data set D is a matrix of n × m, where n is the number of samples of the data set and m is the number of features of the extracted feature vector: then:
2) For each segmented data set, e.g. the ith data set DiUsing the remaining data, i.e. D-DiTraining a base learner, and then predicting by using the model to obtain D which does not participate in trainingiPredicted result of (1) FiThe outcome is either a prediction probability or a prediction category;
3) then the result is compared with the data DiMerging and splicing according to characteristics to obtain a new data matrixThe entire data set is then transformed into
4) The historical period models are processed similarly, and because the historical models are trained, the historical models keep the information of historical data, the data sets do not need to be subjected to cross validation segmentation and then combined, the data sets D only need to be directly predicted to obtain corresponding prediction results, and then the new transformed data sets are combined to finally obtain the new transformed data sets
5) Using the resulting new data setTraining a meta learner, and then training a base learner by using a data set D to obtain a base classifier of the current period, wherein the base classifier is used as a historical model in the next period;
6) combining all the models according to a stacking integrated frame structure to obtain a stacked model;
returning to the step 1 after the period is ended;
step 3, at this time, the number T of the existing base classifiers is equal to N, the overall updating step is similar to that in step 2, and the only difference is that in step 4), a history model needs to be discarded, so as to ensure that the number of the total base classifiers is N; and returning to the step 1 after the period is finished.
2. The system of claim 1, wherein the base learner uses a lightGBM model.
3. The system of claim 1, wherein the meta-learner employs a Logistic Regression model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010372115.9A CN111683048B (en) | 2020-05-06 | 2020-05-06 | Intrusion detection system based on multicycle model stacking |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010372115.9A CN111683048B (en) | 2020-05-06 | 2020-05-06 | Intrusion detection system based on multicycle model stacking |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111683048A CN111683048A (en) | 2020-09-18 |
CN111683048B true CN111683048B (en) | 2021-05-07 |
Family
ID=72433411
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010372115.9A Active CN111683048B (en) | 2020-05-06 | 2020-05-06 | Intrusion detection system based on multicycle model stacking |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111683048B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108023876A (en) * | 2017-11-20 | 2018-05-11 | 西安电子科技大学 | Intrusion detection method and intruding detection system based on sustainability integrated study |
CN110351307A (en) * | 2019-08-14 | 2019-10-18 | 杭州安恒信息技术股份有限公司 | Abnormal user detection method and system based on integrated study |
CN110874373A (en) * | 2019-12-10 | 2020-03-10 | 杭州岑石能源科技有限公司 | Linear variation relation judgment method based on machine learning stacking model |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102291392B (en) * | 2011-07-22 | 2015-03-25 | 中国电力科学研究院 | Hybrid intrusion detection method based on Bagging algorithm |
US9178812B2 (en) * | 2013-06-05 | 2015-11-03 | Cisco Technology, Inc. | Stacking metadata contexts for service chains |
CN106973057B (en) * | 2017-03-31 | 2018-12-14 | 浙江大学 | A kind of classification method suitable for intrusion detection |
CN108962397B (en) * | 2018-06-06 | 2022-07-15 | 中国科学院软件研究所 | Pen and voice-based cooperative task nervous system disease auxiliary diagnosis system |
CN110247910B (en) * | 2019-06-13 | 2022-08-09 | 深信服科技股份有限公司 | Abnormal flow detection method, system and related components |
CN110414554B (en) * | 2019-06-18 | 2022-03-22 | 浙江大学 | Stacking ensemble learning fish identification method based on multi-model improvement |
CN110674947B (en) * | 2019-09-02 | 2021-02-19 | 三峡大学 | Spectral feature variable selection and optimization method based on Stacking integrated framework |
CN110763660B (en) * | 2019-10-22 | 2021-07-30 | 中国科学院广州地球化学研究所 | LIBS quantitative analysis method based on ensemble learning |
-
2020
- 2020-05-06 CN CN202010372115.9A patent/CN111683048B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108023876A (en) * | 2017-11-20 | 2018-05-11 | 西安电子科技大学 | Intrusion detection method and intruding detection system based on sustainability integrated study |
CN110351307A (en) * | 2019-08-14 | 2019-10-18 | 杭州安恒信息技术股份有限公司 | Abnormal user detection method and system based on integrated study |
CN110874373A (en) * | 2019-12-10 | 2020-03-10 | 杭州岑石能源科技有限公司 | Linear variation relation judgment method based on machine learning stacking model |
Also Published As
Publication number | Publication date |
---|---|
CN111683048A (en) | 2020-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107294993B (en) | WEB abnormal traffic monitoring method based on ensemble learning | |
CN111107102A (en) | Real-time network flow abnormity detection method based on big data | |
CN108038049A (en) | Real-time logs control system and control method, cloud computing system and server | |
CN113518063B (en) | Network intrusion detection method and system based on data enhancement and BilSTM | |
CN112381121A (en) | Unknown class network flow detection and identification method based on twin network | |
CN102420723A (en) | Anomaly detection method for various kinds of intrusion | |
CN110535878B (en) | Threat detection method based on event sequence | |
CN110414367B (en) | Time sequence behavior detection method based on GAN and SSN | |
CN107483451B (en) | Method and system for processing network security data based on serial-parallel structure and social network | |
CN113556319B (en) | Intrusion detection method based on long-short term memory self-coding classifier under internet of things | |
CN113536256B (en) | Statistical analysis method and device for population mobility data and electronic equipment | |
CN113378990A (en) | Traffic data anomaly detection method based on deep learning | |
CN112738014A (en) | Industrial control flow abnormity detection method and system based on convolution time sequence network | |
CN114021135A (en) | LDoS attack detection and defense method based on R-SAX | |
Shi et al. | Deepddos: Online ddos attack detection | |
CN116318830A (en) | Log intrusion detection system based on generation of countermeasure network | |
CN110866553A (en) | User behavior classification method and system based on encrypted camera flow statistical characteristics | |
Yang et al. | Pedestrian tracking algorithm for dense crowd based on deep learning | |
CN110166422A (en) | Domain name Activity recognition method, apparatus, readable storage medium storing program for executing and computer equipment | |
CN111683048B (en) | Intrusion detection system based on multicycle model stacking | |
CN105930430B (en) | Real-time fraud detection method and device based on non-accumulative attribute | |
CN109376531B (en) | Web intrusion detection method based on semantic recoding and feature space separation | |
CN109871469A (en) | Tuftlet crowd recognition method based on dynamic graphical component | |
CN113328986A (en) | Network flow abnormity detection method based on combination of convolutional neural network and LSTM | |
CN112491866A (en) | Intrusion detection method and device combining data flow detection and time sequence feature extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |