CN115600194A - Intrusion detection method, storage medium and device based on XGboost and LGBM - Google Patents
Intrusion detection method, storage medium and device based on XGboost and LGBM Download PDFInfo
- Publication number
- CN115600194A CN115600194A CN202211391189.2A CN202211391189A CN115600194A CN 115600194 A CN115600194 A CN 115600194A CN 202211391189 A CN202211391189 A CN 202211391189A CN 115600194 A CN115600194 A CN 115600194A
- Authority
- CN
- China
- Prior art keywords
- classifier
- lgbm
- xgboost
- data
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/552—Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Hardware Design (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an intrusion detection method, a storage medium and equipment based on XGboost and LGBM, wherein the method comprises the following steps: acquiring a data set, and performing data preprocessing on the data set; performing feature selection on the preprocessed data set according to the information gain and the FCBF algorithm; the XGboost classifier and the LGBM classifier are adopted to classify the data; and optimizing the XGboost classifier and the LGBM classifier, and outputting classification results by the classifier with higher selectivity according to the comparison result of the performance of the classifier. According to the invention, the data preprocessing is carried out on the data set, so that the unbalance problem of the data set sample can be relieved, and meanwhile, the quality of the data set is improved; through constantly optimizing the classifier, the classification processing performance of the classifier is improved, and the classifier with higher performance is selected to output the classification result, so that the detection precision of intrusion detection can be improved, and the abnormal false alarm rate of intrusion detection can be reduced.
Description
Technical Field
The invention relates to the technical field of network security, in particular to an intrusion detection method, a storage medium and equipment based on XGboost and LGBM.
Background
With the increasing popularity of the internet in modern life, a large number of devices have become interoperable and can interact through a network, and with the same, a large number of device security problems have arisen, so that security of network space has received much attention. Intrusion Detection Systems (IDS), which are used to efficiently detect various malicious attacks on a network, are one of the most critical systems for maintaining the security of a cyber space. From a Machine Learning (ML) perspective, IDS can be defined as a system that aims to classify network traffic, and a simple model is a binary classification model to distinguish between normal and malicious network traffic, thereby enabling detection of intrusion traffic. With recent advances in ML-focused research, many studies have shown that ML algorithms can be designed to implement IDS.
When the data set is preprocessed and classified, a better result can be obtained by adopting a machine learning method. Although there have been some research efforts in the field of flow anomaly detection, there are some problems with machine learning: firstly, few research works at present propose a very effective solution to solve the sample imbalance problem under the intrusion detection problem; and secondly, the detection precision and the false alarm rate can not meet the requirements of products, so that the detection precision and the false alarm rate are few in practical application.
Therefore, how to balance the samples and improve the accuracy of the intrusion detection in the network intrusion detection is a problem to be solved in the network intrusion detection.
Disclosure of Invention
In order to overcome the technical defects, the invention provides an intrusion detection method, a storage medium and equipment based on XGboost and LGBM, which can improve the accuracy of network intrusion detection.
In order to solve the problems, the invention is realized according to the following technical scheme:
in a first aspect, the present invention provides an intrusion detection method based on XGBoost and LGBM, comprising the steps of:
acquiring a data set, and performing data preprocessing on the data set;
performing feature selection on the preprocessed data set according to the information gain and FCBF algorithm;
the XGboost classifier and the LGBM classifier are adopted to classify the data of the data set;
and optimizing the XGboost classifier and the LGBM classifier, and outputting classification results by the classifier with higher selectivity according to the comparison result of the performance of the classifier.
As an improvement of the above solution, the acquiring a data set and the preprocessing the data set include the steps of:
deleting null values, incorrectly formatted values and repeated values in the data set, so that each value in the data set only retains one valid data;
dividing data in a data set into K clusters by adopting a K-Means clustering algorithm, and randomly selecting data from each cluster as an instantiated subset;
the data set is standardized according to an Attman Z-score model;
and carrying out data balance on the data set by adopting a DSSTEE algorithm.
As an improvement of the above solution, the data balancing of the data set by using the DSSTE algorithm includes the steps of:
dividing an unbalanced training set into a near neighbor set and a far neighbor set by adopting an ENN algorithm, defining the near neighbor set as a difficult sample, and defining the far neighbor set as a simple sample;
compressing multiple samples in the difficult samples by adopting a K-Means clustering algorithm, and replacing clustering by a clustering center;
and amplifying few samples in the difficult samples, and combining the simple samples, the amplified difficult samples and the compressed difficult samples to form a new training set.
As an improvement of the scheme, the hyperparameter K of the K-Means clustering algorithm is optimized by a BO-GP algorithm.
As an improvement of the above solution, the feature selection of the preprocessed data set according to the information gain and FC BC algorithm includes the steps of:
calculating an IG value of each feature according to an information gain algorithm, standardizing the IG value of each feature into a value between 0 and 1, sequencing the IG values of all the features, sequentially selecting the features from large value to small value, stopping selection until a first threshold value is reached, and removing the unselected features;
and calculating the similarity of every two features by adopting an FCBF algorithm, comparing the two feature IG values subjected to similarity calculation if the similarity value is greater than a second threshold value, removing the features with low IG values, and repeating the step until the similarity of any two features in the data set is less than the second threshold value.
As an improvement of the scheme, the first threshold and the second threshold are optimized by adopting a BO-GP algorithm.
As an improvement of the above scheme, the classifying the data by using the XGBoost and LGBM classifiers includes:
dividing the data set into a training set and a testing set, wherein the training set comprises 70% of data samples of the data set, and the testing set comprises 30% of data samples of the data set;
and the test set adopts a ten-fold cross method to carry out iterative training on the model, wherein in each iteration of the ten-fold cross verification method, 90% of the original training set is used for model training, and 10% of the original training set is used as a verification set to carry out model test.
As an improvement of the above scheme, the optimizing the XGBoost and LGBM classifiers, according to the comparison result of the classifier performance, the outputting of the classification result by the classifier with higher selectivity includes the steps of:
carrying out hyper-parameter optimization on the XGboost classifier and the LGBM classifier by adopting a BO-TPE algorithm;
and calculating the accuracy of the LGBM classifier and the XGboost classifier and the time length of the optimal accuracy, comparing, when the accuracy of the XGboost classifier is greater than the accuracy of the LGBM classifier and the time length of the XGboost classifier with the optimal accuracy is less than that of the LGBM classifier, selecting the XGboost classifier to output the classification result, otherwise, selecting the LGBM classifier to output the classification result.
In a second aspect, the present invention provides a computer-readable storage medium having at least one instruction, at least one program, code set, or set of instructions stored therein, which is loaded and executed by a processor to implement the XGBoost and LGBM-based intrusion detection method according to the first aspect.
In a third aspect, the present invention provides an apparatus comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, code set, or instruction set, which is loaded and executed by the processor to implement the XGBoost and LGBM-based intrusion detection method according to the first aspect.
Compared with the prior art, the invention has the following beneficial effects:
by preprocessing the data of the data set, the problem of unbalance of the data set sample can be relieved, and meanwhile, the quality of the data set is improved; through constantly optimizing the classifier, the classification processing performance of the classifier is improved, and the classifier with higher performance is selected to output the classification result, so that the detection precision of intrusion detection can be improved, and the abnormal false alarm rate of intrusion detection can be reduced.
Drawings
Embodiments of the invention are described in further detail below with reference to the attached drawing figures, wherein:
fig. 1 is a schematic flow diagram of an intrusion detection method based on XGBoost and LGBM in an embodiment;
FIG. 2 is a schematic flow chart illustrating step S100 in one embodiment;
FIG. 3 is a flowchart illustrating the step S140 according to an embodiment;
FIG. 4 is a flowchart illustrating the step S200 according to an embodiment;
FIG. 5 is a flowchart illustrating the step S300 according to an embodiment;
fig. 6 is a schematic flowchart of step S400 in one embodiment.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
It should be noted that the sequence numbers mentioned herein, such as S100, S200 \ 8230, are merely used as a distinction between steps and do not mean that the steps must be executed strictly according to the sequence numbers.
In one embodiment, as shown in fig. 1, there is provided an XGBoost and LGBM-based intrusion detection method, including the following steps:
s100: acquiring a data set, and performing data preprocessing on the data set;
in the traffic intrusion detection analysis technology, the object of data analysis is data in a data set, the data in the data set has problems of possible incompleteness (such as uncertain or missing values of some attributes), noise and inconsistency (such as different names of one attribute in different tables), different dimensions, unbalanced samples, etc., if the data is directly analyzed on unprocessed data, the result is not necessarily accurate, and the efficiency is also possibly low, so that the data needs to be preprocessed first to improve the data quality, thereby improving the efficiency and quality of data analysis, wherein the preprocessing method includes data standardization, mapping to 01 uniform distribution, data normalization, data binarization, nonlinear conversion, data feature coding, processing missing values, etc. The data preprocessing is to sample the data set first and then to perform sample balance, data standardization, data normalization, data binarization and other processing aiming at the problems of the data set, so as to improve the quality of the data.
S200: performing feature selection on the preprocessed data set according to the information gain and FCBF algorithm;
after the data preprocessing is completed, high-quality data can be obtained, and then characteristic engineering is needed to select meaningful characteristics to input into an algorithm and a model for machine learning for training. And the feature selection of the data set adopts an information gain and FCBF mode, wherein the information is calculated based on information entropy and represents the degree of uncertainty elimination of the information, and the feature selection is carried out by ordering variables through the size of the information gain. The information quantity and the probability are in a monotonous decreasing relation, and the smaller the probability is, the larger the information quantity is. The FCBF algorithm (Fast Correlation-base Filter Solution), a feature selection algorithm based on the rapid filtering of Symmetry Uncertainties (SU), is used in a pair of redundant features, keeps the features with greater relevance to the target, eliminates the features with smaller relevance, and utilizes the features with higher relevance to screen other features, so as to reduce the time complexity.
S300: classifying data of the data set by adopting an XGboost classifier and an LGBM classifier;
the XGboost classifier and the LGBM classifier have excellent classification processing performance, and the data are classified by using the XGboost classifier and the LGBM classifier, wherein initial hyper-parameters of the XGboost classifier and the LGBM classifier are set as default values of the classifier, the default values of five hyper-parameters of the XGboost are respectively 0.3, 1, 100, 6, and the default values of six hyper-parameters of the XGboost are respectively 0.1, 100, 255, 3, 31 and 1.0.
S400: and optimizing the XGboost classifier and the LGBM classifier, and outputting classification results by the classifier with higher selectivity according to the comparison result of the performance of the classifier.
Specifically, in order to obtain a better classification result output, the hyper-parameters of the XGBoost classifier and the LGBM classifier are optimized, the hyper-parameters of the XGBoost classifier and the LGBM classifier are continuously optimized, so that the XGBoost classifier and the LGBM classifier have a more objective classification performance, indexes are set through classification processing of the XGBoost classifier and the LGBM classifier, the indexes of the XGBoost classifier and the LGBM classifier are further compared, and a better party is selected as the classifier to output the classification result.
In one embodiment, as shown in FIG. 2, the acquiring the data set, the preprocessing the data set comprises
The method comprises the following steps:
s110: deleting null values, incorrect format values and repeated values in the data set, so that each value in the data set only retains one valid data;
specifically, the data preprocessing deletes null values and values with incorrect formats in the data set, traverses the data set at the same time, retains one valid data of repeated values, and rejects the rest repeated values.
S120: dividing data in the data set into K clusters by adopting a K-Means clustering algorithm, and randomly selecting data from each cluster as an instantiated subset;
specifically, useless data is removed using K-Means clusters that divide the data in the dataset into K clusters based on their Euclidean, manhattan, and Mahalanobis distances, and randomly select 10% of the data from each cluster as an instantiated subset. K-means aims at minimizing the sum of the squared distances between all data points and the respective centroids of the clusters, expressed as:
wherein (x) 1 ,...,x n ) Is a data matrix; u. of j Also known as cluster C k Is a center of mass of C k Average of all samples in (1); n is k Is a cluster C k Total number of sample points in (b).
The euclidean distance is defined as follows:
an n-dimensional euclidean space is a set of points, each of which may be represented as (x (1), x (2), \8230;, x (n)), where x (i) (i =1,2 \8230; n) is the ith coordinate where the real number is called x and where x (i) (i =1,2 \8230; n) is the ith coordinate where the real number is called x.
The manhattan distance is defined as follows:
c=|x 1 -x 2 |+|y 1 -y 2 |
wherein (x) i ,y i ) Are coordinate values of the points.
The mahalanobis distance between data points x, y is defined as follows:
where Σ is the covariance matrix of the multidimensional random variable.
S130: the data set is standardized according to an Attman Z-score model;
Z-Score normalization of datasets compares different magnitudes of data converted to a unified Z-Score, where the eigenvalue x is n The normalized formula is as follows:
where x is the original eigenvalue and μ and σ are the mean and standard deviation, respectively, of the eigenvalue x.
S140: and carrying out data balance on the data set by adopting a DSSTEE algorithm.
The method comprises the steps of using a DSSTE (Difficult Set Sampling Technique) method to solve the problem of unbalanced data sets, reducing the unbalance of an original training Set by the DSSTE algorithm, providing targeted data expansion for a few classes to be learned, enabling a classifier to better learn the differences of training stages, and improving classification performance, wherein the step S130 and the step S140 aim to reduce the weight of the data sets on the premise of not losing important data information.
Specifically, as shown in fig. 3, the data balancing of the data set by using the DSSTE algorithm includes the following steps:
s141: dividing an unbalanced training set into a near neighbor set and a far neighbor set by adopting an ENN algorithm, defining the near neighbor set as a difficult sample, and defining the far neighbor set as a simple sample;
s142: compressing multiple samples in the difficult samples by adopting a K-Means clustering algorithm, and replacing clustering by a clustering center;
the data is divided into K clusters by using a K-Means method, and then 10% of the data in each cluster is randomly extracted. By the above process, i.e., by replacing clusters with 10% (i.e., cluster centers) of data within the clusters, the amount of data is reduced.
S143: and amplifying few samples in the difficult samples, and combining the simple samples, the amplified difficult samples and the compressed difficult samples to form a new training set.
In one embodiment, the hyper-parameter K of the K-Means clustering algorithm is optimized by a BO-GP algorithm.
Specifically, the BO algorithm is a method for determining the next hyper-parameter optimization of the hyper-parameter based on the previous evaluation result. In BO, all currently tested data points are fitted to an objective function using a proxy model, GP being the proxy model of the BO algorithm whose predictions obey a gaussian distribution:
where D is the hyper-parameter configuration space, y = f (x) is the objective function value for each hyper-parameter configuration, with the mean μ and the covariance σ.
In one embodiment, as shown in fig. 4, the step S200 includes the following steps:
s210: calculating an IG value of each feature according to an information gain algorithm, standardizing the IG value of each feature into a value between 0 and 1, sequencing the IG values of all the features, sequentially selecting the features from large value to small value, stopping selection until a first threshold value is reached, and removing the unselected features;
specifically, the IG value of each feature is calculated through an Information Gain algorithm, the IG values are normalized to be values between 0 and 1, all the values are sorted, the feature is selected from high to low once, the selection is stopped until a first threshold value alpha is reached, and all the IG values with the sum smaller than 1-alpha after the feature is normalized are removed. The first threshold value α is a target function of verification accuracy. The IG value is expressed as follows:
IG(T|X)=H(T)-H(T|X)
where H (T) is the entropy of the target variable T and H (T | X) is the conditional entropy of T over X.
S220: and calculating the similarity of every two features by adopting an FCBF algorithm, comparing the two feature IG values subjected to similarity calculation if the similarity value is greater than a second threshold value, removing the features with low IG values, and repeating the step until the similarity of any two features in the data set is less than the second threshold value.
Specifically, the FCBF algorithm is used to calculate the similarity of the features obtained by the information gain processing in step S210, two of the features obtained in step S210 are arbitrarily selected for similarity calculation, and if the similarity value SU is greater than the second threshold value β, the feature with a low IG value is deleted. The feature extraction is carried out by adopting a KPCA algorithm, and the extracted feature quantity and the kernel attribute of the KPCA algorithm are obtained by optimizing BO-GP by using verification accuracy as a target function. And performing similarity calculation on all the features until the similarity of any two features in the data set is smaller than a second threshold value, and finishing feature selection. The similarity value SU has the following formula:
wherein SU (X, Y) indicates the similarity between feature X and feature Y.
In one embodiment, the first threshold α and the second threshold β are optimized by using a BO-GP algorithm.
In one embodiment, as shown in fig. 5, the step S300 includes the following steps:
s310: dividing the data set into a training set and a testing set, wherein the training set comprises 70% of data samples of the data set, and the testing set comprises 30% of data samples of the data set;
s320: and the test set adopts a ten-fold cross method to carry out iterative training on the model, wherein in each iteration of the ten-fold cross verification method, 90% of the original training set is used for model training, and 10% of the original training set is used as a verification set to carry out model test.
The classification model XGBoost is an optimized version of the gradient hoist (GBM) to improve speed and prediction performance. The objective function is as follows:
the objective function includes two parts, the first part being a loss function and the second part being a regularization term. Obj (θ) in the equation represents the objective function, n is the number of predictions,is the training error for the ith sample and Ω is the regularization function.
The LightGBM consists of a gradient-based one-sided sampling (GOSS) and an Exclusive Feature Bundling (EFB) algorithm to shorten the training time and improve the training performance of the XGboost algorithm.
In one embodiment, as shown in fig. 6, the step S400 includes the following steps:
s410: carrying out hyper-parameter optimization on the XGboost classifier and the LGBM classifier by adopting a BO-TPE algorithm;
specifically, the BO-TPE algorithm is used for optimizing hyper-parameters of XGboost, such as "learning _ rate", "sampling _ byte", "subsample", "n _ estimators", "max _ depth", and the hyper-parameters of LGBM algorithm, such as "learning _ rate", "n _ estimators", "max _ bin", "num _ leaves", "max _ depth", and "feature _ fraction". Wherein the optimal values of "learning _ rate", "sampling _ byte", "subsample", "n _ estimators", "max _ depth", "max _ bin", "num _ leaves", "feature _ fraction" are respectively found in the following ranges [0.02-0.2], [0.1-0.5], [0.8-2], [200-600], [3-7], [400-800], [10-50], [0.1-0.9 ].
The BO-TPE hyper-parameter optimization algorithm has the following objective function:
where l (x) and g (x) represent the probability of detecting the next superparameter value in well-performing and poorly-performing regions, respectively, where BO-TPE obtains the optimal superparameter by maximizing l (x)/g (x), y * Is a threshold to distinguish between relatively good and bad results.
S420: and calculating the accuracy of the LGBM classifier and the XGboost classifier and the time length of the optimal accuracy, comparing, when the accuracy of the XGboost classifier is greater than the accuracy of the LGBM classifier and the time length of the XGboost classifier with the optimal accuracy is less than that of the LGBM classifier, selecting the XGboost classifier to output the classification result, otherwise, selecting the LGBM classifier to output the classification result.
Specifically, the performance of the intrusion detection algorithm is comprehensively evaluated by using Accuracy (Accuracy, AC), precision (Precision, PR), recall (Recall, RE) and F1-score (F1) as evaluation indexes, and the calculation formula is as follows:
wherein: TP represents samples predicted to be positive and truly positive, FP represents samples predicted to be positive and truly negative, TN represents samples predicted to be negative and truly negative, and FN represents samples predicted to be negative and truly positive.
And respectively calculating the accuracy ACC value of the LGBM and the XGboost after the hyper-parameter optimization and the time when the optimal accuracy ACC occurs, comparing, when the accuracy of the XGboost classifier is greater than the accuracy of the LGBM classifier and the time when the optimal accuracy of the XGboost classifier occurs is less than that of the LGBM classifier, selecting the optimized XGboost classifier to output classification results, and selecting the LGBM classifier to output classification results under other conditions.
By preprocessing the data of the data set, the problem of unbalance of the data set sample can be relieved, and meanwhile, the quality of the data set is improved; through constantly optimizing the classifier, the classification processing performance of the classifier is improved, and the classifier with higher performance is selected to output the classification result, so that the detection precision of intrusion detection can be improved, and the abnormal false alarm rate of intrusion detection can be reduced.
In one embodiment, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, causes the processor to implement the XGBoost and LGBM-based intrusion detection method provided in the first aspect.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable storage media, which may include computer readable storage media (or non-transitory media) and communication media (or transitory media).
The term computer-readable storage medium includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as is well known to those skilled in the art.
For example, the computer readable storage medium may be an internal storage unit of the network management device in the foregoing embodiment, for example, a hard disk or a memory of the network management device. The computer readable storage medium may also be an external storage device of the network management device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are equipped on the network management device.
In one embodiment, an apparatus is provided that includes a processor and a memory to store a computer program; the processor is configured to execute the computer program and implement the XGBoost and LGBM-based intrusion detection method provided by the first aspect of the present invention when executing the computer program.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. An intrusion detection method based on XGboost and LGBM is characterized by comprising the following steps:
acquiring a data set, and performing data preprocessing on the data set;
performing feature selection on the preprocessed data set according to the information gain and FCBF algorithm;
classifying data of the data set by adopting an XGboost classifier and an LGBM classifier;
and optimizing the XGboost classifier and the LGBM classifier, and outputting classification results by the classifier with higher selectivity according to the comparison result of the performance of the classifier.
2. The XGBoost and LGBM-based intrusion detection method of claim 1, wherein said obtaining a data set, pre-processing said data set comprises the steps of:
deleting null values, incorrectly formatted values and repeated values in the data set, so that each value in the data set only retains one valid data;
dividing data in a data set into K clusters by adopting a K-Means clustering algorithm, and randomly selecting data from each cluster as an instantiated subset;
the data set is standardized according to an Attman Z-score model;
and performing data balance on the data set by adopting a DSSTEE algorithm.
3. The XGBoost and LGBM-based intrusion detection method of claim 2, wherein said data balancing the data set using DSSTE algorithm comprises the steps of:
dividing an unbalanced training set into a near neighbor set and a far neighbor set by adopting an ENN algorithm, defining the near neighbor set as a difficult sample, and defining the far neighbor set as a simple sample;
compressing multiple samples in the difficult samples by adopting a K-Means clustering algorithm, and replacing clustering by a clustering center;
and amplifying few samples in the difficult samples, and combining the simple samples, the amplified difficult samples and the compressed difficult samples to form a new training set.
4. The XGboost and LGBM-based intrusion detection method according to claim 3, wherein the hyperparameter K of the K-Means clustering algorithm is optimized by a BO-GP algorithm.
5. The XGBoost and LGBM based intrusion detection method of claim 1, wherein said feature selection of the preprocessed data set according to information gain and FC BC algorithm comprises the steps of:
calculating an IG value of each feature according to an information gain algorithm, standardizing the IG value of each feature into a value between 0 and 1, sequencing the IG values of all the features, sequentially selecting the features from large value to small value, stopping selection until a first threshold value is reached, and removing the unselected features;
and calculating the similarity of every two characteristics by adopting an FCBF algorithm, comparing the IG values of the two characteristics subjected to similarity calculation when the similarity value is greater than a second threshold value, eliminating the characteristics with low IG values, and repeating the step until the similarity of any two characteristics in the data set is less than the second threshold value.
6. The XGboost and LGBM-based intrusion detection method according to claim 5, wherein the first and second thresholds are optimized using a BO-GP algorithm.
7. The XGboost and LGBM-based intrusion detection method according to claim 1, wherein the classification of data using XGboost and LGBM classifiers comprises the steps of:
dividing the data set into a training set and a testing set, wherein the training set comprises 70% of data samples of the data set, and the testing set comprises 30% of data samples of the data set;
the test set adopts a ten-fold cross method to carry out iterative training on the model, in each iteration of the ten-fold cross verification method, 90% of the original training set is used for model training, and 10% of the original training set is used as a verification set to carry out model testing.
8. The XGboost and LGBM-based intrusion detection method according to claim 1, wherein optimizing the XGboost and LGBM classifiers based on classifier performance comparison results, with higher selectivity classifiers for classification result output comprises:
carrying out hyper-parameter optimization on the XGboost classifier and the LGBM classifier by adopting a BO-TPE algorithm;
and calculating the accuracy of the LGBM classifier and the XGboost classifier and the time length of the optimal accuracy, comparing, when the accuracy of the XGboost classifier is greater than the accuracy of the LGBM classifier and the time length of the XGboost classifier of the optimal accuracy is less than that of the LGBM classifier, selecting the XGboost classifier to output the classification result, otherwise, selecting the LGBM classifier to output the classification result.
9. A computer-readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the XGBoost and LGBM based intrusion detection method according to any one of claims 1 to 8.
10. An apparatus comprising a processor and a memory having at least one instruction, at least one program, set of codes, or set of instructions stored therein, which is loaded and executed by the processor to implement the XGBoost and LGBM based intrusion detection method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211391189.2A CN115600194A (en) | 2022-11-08 | 2022-11-08 | Intrusion detection method, storage medium and device based on XGboost and LGBM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211391189.2A CN115600194A (en) | 2022-11-08 | 2022-11-08 | Intrusion detection method, storage medium and device based on XGboost and LGBM |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115600194A true CN115600194A (en) | 2023-01-13 |
Family
ID=84853117
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211391189.2A Pending CN115600194A (en) | 2022-11-08 | 2022-11-08 | Intrusion detection method, storage medium and device based on XGboost and LGBM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115600194A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116319427A (en) * | 2023-05-22 | 2023-06-23 | 北京国信蓝盾科技有限公司 | Safety evaluation method, device, electronic equipment and medium based on equipment network |
CN117035241A (en) * | 2023-10-08 | 2023-11-10 | 广东省农业科学院植物保护研究所 | Intelligent winged insect trap management method, system and medium based on insect condition prediction |
CN117997652A (en) * | 2024-04-03 | 2024-05-07 | 江西师范大学 | Vehicle intrusion detection method and device based on ensemble learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108768946A (en) * | 2018-04-27 | 2018-11-06 | 中山大学 | A kind of Internet Intrusion Detection Model based on random forests algorithm |
CN113468555A (en) * | 2021-06-07 | 2021-10-01 | 厦门国际银行股份有限公司 | Method, system and device for identifying client access behavior |
CN114422262A (en) * | 2022-02-21 | 2022-04-29 | 上海应用技术大学 | Industrial control network intrusion detection model construction method based on automatic machine learning |
-
2022
- 2022-11-08 CN CN202211391189.2A patent/CN115600194A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108768946A (en) * | 2018-04-27 | 2018-11-06 | 中山大学 | A kind of Internet Intrusion Detection Model based on random forests algorithm |
CN113468555A (en) * | 2021-06-07 | 2021-10-01 | 厦门国际银行股份有限公司 | Method, system and device for identifying client access behavior |
CN114422262A (en) * | 2022-02-21 | 2022-04-29 | 上海应用技术大学 | Industrial control network intrusion detection model construction method based on automatic machine learning |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116319427A (en) * | 2023-05-22 | 2023-06-23 | 北京国信蓝盾科技有限公司 | Safety evaluation method, device, electronic equipment and medium based on equipment network |
CN117035241A (en) * | 2023-10-08 | 2023-11-10 | 广东省农业科学院植物保护研究所 | Intelligent winged insect trap management method, system and medium based on insect condition prediction |
CN117035241B (en) * | 2023-10-08 | 2024-01-23 | 广东省农业科学院植物保护研究所 | Intelligent winged insect trap management method, system and medium based on insect condition prediction |
CN117997652A (en) * | 2024-04-03 | 2024-05-07 | 江西师范大学 | Vehicle intrusion detection method and device based on ensemble learning |
CN117997652B (en) * | 2024-04-03 | 2024-06-07 | 江西师范大学 | Vehicle intrusion detection method and device based on ensemble learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10262410B2 (en) | Methods and systems for inspecting goods | |
CN115600194A (en) | Intrusion detection method, storage medium and device based on XGboost and LGBM | |
CN110633725B (en) | Method and device for training classification model and classification method and device | |
US20070005556A1 (en) | Probabilistic techniques for detecting duplicate tuples | |
CN111695597B (en) | Credit fraud group identification method and system based on improved isolated forest algorithm | |
CN111612041A (en) | Abnormal user identification method and device, storage medium and electronic equipment | |
CN111507385B (en) | Extensible network attack behavior classification method | |
CN112437053B (en) | Intrusion detection method and device | |
CN112926592B (en) | Trademark retrieval method and device based on improved Fast algorithm | |
CN112036476A (en) | Data feature selection method and device based on two-classification service and computer equipment | |
CN112632609A (en) | Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium | |
CN115801374A (en) | Network intrusion data classification method and device, electronic equipment and storage medium | |
CN115577357A (en) | Android malicious software detection method based on stacking integration technology | |
CN111753299A (en) | Unbalanced malicious software detection method based on packet integration | |
CN113569920B (en) | Second neighbor anomaly detection method based on automatic coding | |
CN113962324A (en) | Picture detection method and device, storage medium and electronic equipment | |
CN114897764A (en) | Pulmonary nodule false positive elimination method and device based on standardized channel attention | |
CN111428064B (en) | Small-area fingerprint image fast indexing method, device, equipment and storage medium | |
CN111598116B (en) | Data classification method, device, electronic equipment and readable storage medium | |
CN112418313B (en) | Big data online noise filtering system and method | |
CN115688101A (en) | Deep learning-based file classification method and device | |
CN115842645A (en) | UMAP-RF-based network attack traffic detection method and device and readable storage medium | |
KR101085066B1 (en) | An Associative Classification Method for detecting useful knowledge from huge multi-attributes dataset | |
CN111931229B (en) | Data identification method, device and storage medium | |
CN113656354A (en) | Log classification method, system, computer device and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |