CN115600194A - Intrusion detection method, storage medium and device based on XGboost and LGBM - Google Patents

Intrusion detection method, storage medium and device based on XGboost and LGBM Download PDF

Info

Publication number
CN115600194A
CN115600194A CN202211391189.2A CN202211391189A CN115600194A CN 115600194 A CN115600194 A CN 115600194A CN 202211391189 A CN202211391189 A CN 202211391189A CN 115600194 A CN115600194 A CN 115600194A
Authority
CN
China
Prior art keywords
classifier
lgbm
xgboost
data
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211391189.2A
Other languages
Chinese (zh)
Inventor
刘兰
吴亚峰
陈桂铭
胡峻涵
陈子力
林子萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Polytechnic Normal University
Original Assignee
Guangdong Polytechnic Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Polytechnic Normal University filed Critical Guangdong Polytechnic Normal University
Priority to CN202211391189.2A priority Critical patent/CN115600194A/en
Publication of CN115600194A publication Critical patent/CN115600194A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses an intrusion detection method, a storage medium and equipment based on XGboost and LGBM, wherein the method comprises the following steps: acquiring a data set, and performing data preprocessing on the data set; performing feature selection on the preprocessed data set according to the information gain and the FCBF algorithm; the XGboost classifier and the LGBM classifier are adopted to classify the data; and optimizing the XGboost classifier and the LGBM classifier, and outputting classification results by the classifier with higher selectivity according to the comparison result of the performance of the classifier. According to the invention, the data preprocessing is carried out on the data set, so that the unbalance problem of the data set sample can be relieved, and meanwhile, the quality of the data set is improved; through constantly optimizing the classifier, the classification processing performance of the classifier is improved, and the classifier with higher performance is selected to output the classification result, so that the detection precision of intrusion detection can be improved, and the abnormal false alarm rate of intrusion detection can be reduced.

Description

Intrusion detection method, storage medium and device based on XGboost and LGBM
Technical Field
The invention relates to the technical field of network security, in particular to an intrusion detection method, a storage medium and equipment based on XGboost and LGBM.
Background
With the increasing popularity of the internet in modern life, a large number of devices have become interoperable and can interact through a network, and with the same, a large number of device security problems have arisen, so that security of network space has received much attention. Intrusion Detection Systems (IDS), which are used to efficiently detect various malicious attacks on a network, are one of the most critical systems for maintaining the security of a cyber space. From a Machine Learning (ML) perspective, IDS can be defined as a system that aims to classify network traffic, and a simple model is a binary classification model to distinguish between normal and malicious network traffic, thereby enabling detection of intrusion traffic. With recent advances in ML-focused research, many studies have shown that ML algorithms can be designed to implement IDS.
When the data set is preprocessed and classified, a better result can be obtained by adopting a machine learning method. Although there have been some research efforts in the field of flow anomaly detection, there are some problems with machine learning: firstly, few research works at present propose a very effective solution to solve the sample imbalance problem under the intrusion detection problem; and secondly, the detection precision and the false alarm rate can not meet the requirements of products, so that the detection precision and the false alarm rate are few in practical application.
Therefore, how to balance the samples and improve the accuracy of the intrusion detection in the network intrusion detection is a problem to be solved in the network intrusion detection.
Disclosure of Invention
In order to overcome the technical defects, the invention provides an intrusion detection method, a storage medium and equipment based on XGboost and LGBM, which can improve the accuracy of network intrusion detection.
In order to solve the problems, the invention is realized according to the following technical scheme:
in a first aspect, the present invention provides an intrusion detection method based on XGBoost and LGBM, comprising the steps of:
acquiring a data set, and performing data preprocessing on the data set;
performing feature selection on the preprocessed data set according to the information gain and FCBF algorithm;
the XGboost classifier and the LGBM classifier are adopted to classify the data of the data set;
and optimizing the XGboost classifier and the LGBM classifier, and outputting classification results by the classifier with higher selectivity according to the comparison result of the performance of the classifier.
As an improvement of the above solution, the acquiring a data set and the preprocessing the data set include the steps of:
deleting null values, incorrectly formatted values and repeated values in the data set, so that each value in the data set only retains one valid data;
dividing data in a data set into K clusters by adopting a K-Means clustering algorithm, and randomly selecting data from each cluster as an instantiated subset;
the data set is standardized according to an Attman Z-score model;
and carrying out data balance on the data set by adopting a DSSTEE algorithm.
As an improvement of the above solution, the data balancing of the data set by using the DSSTE algorithm includes the steps of:
dividing an unbalanced training set into a near neighbor set and a far neighbor set by adopting an ENN algorithm, defining the near neighbor set as a difficult sample, and defining the far neighbor set as a simple sample;
compressing multiple samples in the difficult samples by adopting a K-Means clustering algorithm, and replacing clustering by a clustering center;
and amplifying few samples in the difficult samples, and combining the simple samples, the amplified difficult samples and the compressed difficult samples to form a new training set.
As an improvement of the scheme, the hyperparameter K of the K-Means clustering algorithm is optimized by a BO-GP algorithm.
As an improvement of the above solution, the feature selection of the preprocessed data set according to the information gain and FC BC algorithm includes the steps of:
calculating an IG value of each feature according to an information gain algorithm, standardizing the IG value of each feature into a value between 0 and 1, sequencing the IG values of all the features, sequentially selecting the features from large value to small value, stopping selection until a first threshold value is reached, and removing the unselected features;
and calculating the similarity of every two features by adopting an FCBF algorithm, comparing the two feature IG values subjected to similarity calculation if the similarity value is greater than a second threshold value, removing the features with low IG values, and repeating the step until the similarity of any two features in the data set is less than the second threshold value.
As an improvement of the scheme, the first threshold and the second threshold are optimized by adopting a BO-GP algorithm.
As an improvement of the above scheme, the classifying the data by using the XGBoost and LGBM classifiers includes:
dividing the data set into a training set and a testing set, wherein the training set comprises 70% of data samples of the data set, and the testing set comprises 30% of data samples of the data set;
and the test set adopts a ten-fold cross method to carry out iterative training on the model, wherein in each iteration of the ten-fold cross verification method, 90% of the original training set is used for model training, and 10% of the original training set is used as a verification set to carry out model test.
As an improvement of the above scheme, the optimizing the XGBoost and LGBM classifiers, according to the comparison result of the classifier performance, the outputting of the classification result by the classifier with higher selectivity includes the steps of:
carrying out hyper-parameter optimization on the XGboost classifier and the LGBM classifier by adopting a BO-TPE algorithm;
and calculating the accuracy of the LGBM classifier and the XGboost classifier and the time length of the optimal accuracy, comparing, when the accuracy of the XGboost classifier is greater than the accuracy of the LGBM classifier and the time length of the XGboost classifier with the optimal accuracy is less than that of the LGBM classifier, selecting the XGboost classifier to output the classification result, otherwise, selecting the LGBM classifier to output the classification result.
In a second aspect, the present invention provides a computer-readable storage medium having at least one instruction, at least one program, code set, or set of instructions stored therein, which is loaded and executed by a processor to implement the XGBoost and LGBM-based intrusion detection method according to the first aspect.
In a third aspect, the present invention provides an apparatus comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, code set, or instruction set, which is loaded and executed by the processor to implement the XGBoost and LGBM-based intrusion detection method according to the first aspect.
Compared with the prior art, the invention has the following beneficial effects:
by preprocessing the data of the data set, the problem of unbalance of the data set sample can be relieved, and meanwhile, the quality of the data set is improved; through constantly optimizing the classifier, the classification processing performance of the classifier is improved, and the classifier with higher performance is selected to output the classification result, so that the detection precision of intrusion detection can be improved, and the abnormal false alarm rate of intrusion detection can be reduced.
Drawings
Embodiments of the invention are described in further detail below with reference to the attached drawing figures, wherein:
fig. 1 is a schematic flow diagram of an intrusion detection method based on XGBoost and LGBM in an embodiment;
FIG. 2 is a schematic flow chart illustrating step S100 in one embodiment;
FIG. 3 is a flowchart illustrating the step S140 according to an embodiment;
FIG. 4 is a flowchart illustrating the step S200 according to an embodiment;
FIG. 5 is a flowchart illustrating the step S300 according to an embodiment;
fig. 6 is a schematic flowchart of step S400 in one embodiment.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
It should be noted that the sequence numbers mentioned herein, such as S100, S200 \ 8230, are merely used as a distinction between steps and do not mean that the steps must be executed strictly according to the sequence numbers.
In one embodiment, as shown in fig. 1, there is provided an XGBoost and LGBM-based intrusion detection method, including the following steps:
s100: acquiring a data set, and performing data preprocessing on the data set;
in the traffic intrusion detection analysis technology, the object of data analysis is data in a data set, the data in the data set has problems of possible incompleteness (such as uncertain or missing values of some attributes), noise and inconsistency (such as different names of one attribute in different tables), different dimensions, unbalanced samples, etc., if the data is directly analyzed on unprocessed data, the result is not necessarily accurate, and the efficiency is also possibly low, so that the data needs to be preprocessed first to improve the data quality, thereby improving the efficiency and quality of data analysis, wherein the preprocessing method includes data standardization, mapping to 01 uniform distribution, data normalization, data binarization, nonlinear conversion, data feature coding, processing missing values, etc. The data preprocessing is to sample the data set first and then to perform sample balance, data standardization, data normalization, data binarization and other processing aiming at the problems of the data set, so as to improve the quality of the data.
S200: performing feature selection on the preprocessed data set according to the information gain and FCBF algorithm;
after the data preprocessing is completed, high-quality data can be obtained, and then characteristic engineering is needed to select meaningful characteristics to input into an algorithm and a model for machine learning for training. And the feature selection of the data set adopts an information gain and FCBF mode, wherein the information is calculated based on information entropy and represents the degree of uncertainty elimination of the information, and the feature selection is carried out by ordering variables through the size of the information gain. The information quantity and the probability are in a monotonous decreasing relation, and the smaller the probability is, the larger the information quantity is. The FCBF algorithm (Fast Correlation-base Filter Solution), a feature selection algorithm based on the rapid filtering of Symmetry Uncertainties (SU), is used in a pair of redundant features, keeps the features with greater relevance to the target, eliminates the features with smaller relevance, and utilizes the features with higher relevance to screen other features, so as to reduce the time complexity.
S300: classifying data of the data set by adopting an XGboost classifier and an LGBM classifier;
the XGboost classifier and the LGBM classifier have excellent classification processing performance, and the data are classified by using the XGboost classifier and the LGBM classifier, wherein initial hyper-parameters of the XGboost classifier and the LGBM classifier are set as default values of the classifier, the default values of five hyper-parameters of the XGboost are respectively 0.3, 1, 100, 6, and the default values of six hyper-parameters of the XGboost are respectively 0.1, 100, 255, 3, 31 and 1.0.
S400: and optimizing the XGboost classifier and the LGBM classifier, and outputting classification results by the classifier with higher selectivity according to the comparison result of the performance of the classifier.
Specifically, in order to obtain a better classification result output, the hyper-parameters of the XGBoost classifier and the LGBM classifier are optimized, the hyper-parameters of the XGBoost classifier and the LGBM classifier are continuously optimized, so that the XGBoost classifier and the LGBM classifier have a more objective classification performance, indexes are set through classification processing of the XGBoost classifier and the LGBM classifier, the indexes of the XGBoost classifier and the LGBM classifier are further compared, and a better party is selected as the classifier to output the classification result.
In one embodiment, as shown in FIG. 2, the acquiring the data set, the preprocessing the data set comprises
The method comprises the following steps:
s110: deleting null values, incorrect format values and repeated values in the data set, so that each value in the data set only retains one valid data;
specifically, the data preprocessing deletes null values and values with incorrect formats in the data set, traverses the data set at the same time, retains one valid data of repeated values, and rejects the rest repeated values.
S120: dividing data in the data set into K clusters by adopting a K-Means clustering algorithm, and randomly selecting data from each cluster as an instantiated subset;
specifically, useless data is removed using K-Means clusters that divide the data in the dataset into K clusters based on their Euclidean, manhattan, and Mahalanobis distances, and randomly select 10% of the data from each cluster as an instantiated subset. K-means aims at minimizing the sum of the squared distances between all data points and the respective centroids of the clusters, expressed as:
Figure BDA0003931339560000051
wherein (x) 1 ,...,x n ) Is a data matrix; u. of j Also known as cluster C k Is a center of mass of C k Average of all samples in (1); n is k Is a cluster C k Total number of sample points in (b).
The euclidean distance is defined as follows:
Figure BDA0003931339560000052
an n-dimensional euclidean space is a set of points, each of which may be represented as (x (1), x (2), \8230;, x (n)), where x (i) (i =1,2 \8230; n) is the ith coordinate where the real number is called x and where x (i) (i =1,2 \8230; n) is the ith coordinate where the real number is called x.
The manhattan distance is defined as follows:
c=|x 1 -x 2 |+|y 1 -y 2 |
wherein (x) i ,y i ) Are coordinate values of the points.
The mahalanobis distance between data points x, y is defined as follows:
Figure BDA0003931339560000053
where Σ is the covariance matrix of the multidimensional random variable.
S130: the data set is standardized according to an Attman Z-score model;
Z-Score normalization of datasets compares different magnitudes of data converted to a unified Z-Score, where the eigenvalue x is n The normalized formula is as follows:
Figure BDA0003931339560000061
where x is the original eigenvalue and μ and σ are the mean and standard deviation, respectively, of the eigenvalue x.
S140: and carrying out data balance on the data set by adopting a DSSTEE algorithm.
The method comprises the steps of using a DSSTE (Difficult Set Sampling Technique) method to solve the problem of unbalanced data sets, reducing the unbalance of an original training Set by the DSSTE algorithm, providing targeted data expansion for a few classes to be learned, enabling a classifier to better learn the differences of training stages, and improving classification performance, wherein the step S130 and the step S140 aim to reduce the weight of the data sets on the premise of not losing important data information.
Specifically, as shown in fig. 3, the data balancing of the data set by using the DSSTE algorithm includes the following steps:
s141: dividing an unbalanced training set into a near neighbor set and a far neighbor set by adopting an ENN algorithm, defining the near neighbor set as a difficult sample, and defining the far neighbor set as a simple sample;
s142: compressing multiple samples in the difficult samples by adopting a K-Means clustering algorithm, and replacing clustering by a clustering center;
the data is divided into K clusters by using a K-Means method, and then 10% of the data in each cluster is randomly extracted. By the above process, i.e., by replacing clusters with 10% (i.e., cluster centers) of data within the clusters, the amount of data is reduced.
S143: and amplifying few samples in the difficult samples, and combining the simple samples, the amplified difficult samples and the compressed difficult samples to form a new training set.
In one embodiment, the hyper-parameter K of the K-Means clustering algorithm is optimized by a BO-GP algorithm.
Specifically, the BO algorithm is a method for determining the next hyper-parameter optimization of the hyper-parameter based on the previous evaluation result. In BO, all currently tested data points are fitted to an objective function using a proxy model, GP being the proxy model of the BO algorithm whose predictions obey a gaussian distribution:
Figure BDA0003931339560000062
where D is the hyper-parameter configuration space, y = f (x) is the objective function value for each hyper-parameter configuration, with the mean μ and the covariance σ.
In one embodiment, as shown in fig. 4, the step S200 includes the following steps:
s210: calculating an IG value of each feature according to an information gain algorithm, standardizing the IG value of each feature into a value between 0 and 1, sequencing the IG values of all the features, sequentially selecting the features from large value to small value, stopping selection until a first threshold value is reached, and removing the unselected features;
specifically, the IG value of each feature is calculated through an Information Gain algorithm, the IG values are normalized to be values between 0 and 1, all the values are sorted, the feature is selected from high to low once, the selection is stopped until a first threshold value alpha is reached, and all the IG values with the sum smaller than 1-alpha after the feature is normalized are removed. The first threshold value α is a target function of verification accuracy. The IG value is expressed as follows:
IG(T|X)=H(T)-H(T|X)
where H (T) is the entropy of the target variable T and H (T | X) is the conditional entropy of T over X.
S220: and calculating the similarity of every two features by adopting an FCBF algorithm, comparing the two feature IG values subjected to similarity calculation if the similarity value is greater than a second threshold value, removing the features with low IG values, and repeating the step until the similarity of any two features in the data set is less than the second threshold value.
Specifically, the FCBF algorithm is used to calculate the similarity of the features obtained by the information gain processing in step S210, two of the features obtained in step S210 are arbitrarily selected for similarity calculation, and if the similarity value SU is greater than the second threshold value β, the feature with a low IG value is deleted. The feature extraction is carried out by adopting a KPCA algorithm, and the extracted feature quantity and the kernel attribute of the KPCA algorithm are obtained by optimizing BO-GP by using verification accuracy as a target function. And performing similarity calculation on all the features until the similarity of any two features in the data set is smaller than a second threshold value, and finishing feature selection. The similarity value SU has the following formula:
Figure BDA0003931339560000071
wherein SU (X, Y) indicates the similarity between feature X and feature Y.
In one embodiment, the first threshold α and the second threshold β are optimized by using a BO-GP algorithm.
In one embodiment, as shown in fig. 5, the step S300 includes the following steps:
s310: dividing the data set into a training set and a testing set, wherein the training set comprises 70% of data samples of the data set, and the testing set comprises 30% of data samples of the data set;
s320: and the test set adopts a ten-fold cross method to carry out iterative training on the model, wherein in each iteration of the ten-fold cross verification method, 90% of the original training set is used for model training, and 10% of the original training set is used as a verification set to carry out model test.
The classification model XGBoost is an optimized version of the gradient hoist (GBM) to improve speed and prediction performance. The objective function is as follows:
Figure BDA0003931339560000072
the objective function includes two parts, the first part being a loss function and the second part being a regularization term. Obj (θ) in the equation represents the objective function, n is the number of predictions,
Figure BDA0003931339560000073
is the training error for the ith sample and Ω is the regularization function.
The LightGBM consists of a gradient-based one-sided sampling (GOSS) and an Exclusive Feature Bundling (EFB) algorithm to shorten the training time and improve the training performance of the XGboost algorithm.
In one embodiment, as shown in fig. 6, the step S400 includes the following steps:
s410: carrying out hyper-parameter optimization on the XGboost classifier and the LGBM classifier by adopting a BO-TPE algorithm;
specifically, the BO-TPE algorithm is used for optimizing hyper-parameters of XGboost, such as "learning _ rate", "sampling _ byte", "subsample", "n _ estimators", "max _ depth", and the hyper-parameters of LGBM algorithm, such as "learning _ rate", "n _ estimators", "max _ bin", "num _ leaves", "max _ depth", and "feature _ fraction". Wherein the optimal values of "learning _ rate", "sampling _ byte", "subsample", "n _ estimators", "max _ depth", "max _ bin", "num _ leaves", "feature _ fraction" are respectively found in the following ranges [0.02-0.2], [0.1-0.5], [0.8-2], [200-600], [3-7], [400-800], [10-50], [0.1-0.9 ].
The BO-TPE hyper-parameter optimization algorithm has the following objective function:
Figure BDA0003931339560000081
where l (x) and g (x) represent the probability of detecting the next superparameter value in well-performing and poorly-performing regions, respectively, where BO-TPE obtains the optimal superparameter by maximizing l (x)/g (x), y * Is a threshold to distinguish between relatively good and bad results.
S420: and calculating the accuracy of the LGBM classifier and the XGboost classifier and the time length of the optimal accuracy, comparing, when the accuracy of the XGboost classifier is greater than the accuracy of the LGBM classifier and the time length of the XGboost classifier with the optimal accuracy is less than that of the LGBM classifier, selecting the XGboost classifier to output the classification result, otherwise, selecting the LGBM classifier to output the classification result.
Specifically, the performance of the intrusion detection algorithm is comprehensively evaluated by using Accuracy (Accuracy, AC), precision (Precision, PR), recall (Recall, RE) and F1-score (F1) as evaluation indexes, and the calculation formula is as follows:
Figure BDA0003931339560000082
Figure BDA0003931339560000083
Figure BDA0003931339560000084
Figure BDA0003931339560000085
wherein: TP represents samples predicted to be positive and truly positive, FP represents samples predicted to be positive and truly negative, TN represents samples predicted to be negative and truly negative, and FN represents samples predicted to be negative and truly positive.
And respectively calculating the accuracy ACC value of the LGBM and the XGboost after the hyper-parameter optimization and the time when the optimal accuracy ACC occurs, comparing, when the accuracy of the XGboost classifier is greater than the accuracy of the LGBM classifier and the time when the optimal accuracy of the XGboost classifier occurs is less than that of the LGBM classifier, selecting the optimized XGboost classifier to output classification results, and selecting the LGBM classifier to output classification results under other conditions.
By preprocessing the data of the data set, the problem of unbalance of the data set sample can be relieved, and meanwhile, the quality of the data set is improved; through constantly optimizing the classifier, the classification processing performance of the classifier is improved, and the classifier with higher performance is selected to output the classification result, so that the detection precision of intrusion detection can be improved, and the abnormal false alarm rate of intrusion detection can be reduced.
In one embodiment, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, causes the processor to implement the XGBoost and LGBM-based intrusion detection method provided in the first aspect.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable storage media, which may include computer readable storage media (or non-transitory media) and communication media (or transitory media).
The term computer-readable storage medium includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as is well known to those skilled in the art.
For example, the computer readable storage medium may be an internal storage unit of the network management device in the foregoing embodiment, for example, a hard disk or a memory of the network management device. The computer readable storage medium may also be an external storage device of the network management device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are equipped on the network management device.
In one embodiment, an apparatus is provided that includes a processor and a memory to store a computer program; the processor is configured to execute the computer program and implement the XGBoost and LGBM-based intrusion detection method provided by the first aspect of the present invention when executing the computer program.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. An intrusion detection method based on XGboost and LGBM is characterized by comprising the following steps:
acquiring a data set, and performing data preprocessing on the data set;
performing feature selection on the preprocessed data set according to the information gain and FCBF algorithm;
classifying data of the data set by adopting an XGboost classifier and an LGBM classifier;
and optimizing the XGboost classifier and the LGBM classifier, and outputting classification results by the classifier with higher selectivity according to the comparison result of the performance of the classifier.
2. The XGBoost and LGBM-based intrusion detection method of claim 1, wherein said obtaining a data set, pre-processing said data set comprises the steps of:
deleting null values, incorrectly formatted values and repeated values in the data set, so that each value in the data set only retains one valid data;
dividing data in a data set into K clusters by adopting a K-Means clustering algorithm, and randomly selecting data from each cluster as an instantiated subset;
the data set is standardized according to an Attman Z-score model;
and performing data balance on the data set by adopting a DSSTEE algorithm.
3. The XGBoost and LGBM-based intrusion detection method of claim 2, wherein said data balancing the data set using DSSTE algorithm comprises the steps of:
dividing an unbalanced training set into a near neighbor set and a far neighbor set by adopting an ENN algorithm, defining the near neighbor set as a difficult sample, and defining the far neighbor set as a simple sample;
compressing multiple samples in the difficult samples by adopting a K-Means clustering algorithm, and replacing clustering by a clustering center;
and amplifying few samples in the difficult samples, and combining the simple samples, the amplified difficult samples and the compressed difficult samples to form a new training set.
4. The XGboost and LGBM-based intrusion detection method according to claim 3, wherein the hyperparameter K of the K-Means clustering algorithm is optimized by a BO-GP algorithm.
5. The XGBoost and LGBM based intrusion detection method of claim 1, wherein said feature selection of the preprocessed data set according to information gain and FC BC algorithm comprises the steps of:
calculating an IG value of each feature according to an information gain algorithm, standardizing the IG value of each feature into a value between 0 and 1, sequencing the IG values of all the features, sequentially selecting the features from large value to small value, stopping selection until a first threshold value is reached, and removing the unselected features;
and calculating the similarity of every two characteristics by adopting an FCBF algorithm, comparing the IG values of the two characteristics subjected to similarity calculation when the similarity value is greater than a second threshold value, eliminating the characteristics with low IG values, and repeating the step until the similarity of any two characteristics in the data set is less than the second threshold value.
6. The XGboost and LGBM-based intrusion detection method according to claim 5, wherein the first and second thresholds are optimized using a BO-GP algorithm.
7. The XGboost and LGBM-based intrusion detection method according to claim 1, wherein the classification of data using XGboost and LGBM classifiers comprises the steps of:
dividing the data set into a training set and a testing set, wherein the training set comprises 70% of data samples of the data set, and the testing set comprises 30% of data samples of the data set;
the test set adopts a ten-fold cross method to carry out iterative training on the model, in each iteration of the ten-fold cross verification method, 90% of the original training set is used for model training, and 10% of the original training set is used as a verification set to carry out model testing.
8. The XGboost and LGBM-based intrusion detection method according to claim 1, wherein optimizing the XGboost and LGBM classifiers based on classifier performance comparison results, with higher selectivity classifiers for classification result output comprises:
carrying out hyper-parameter optimization on the XGboost classifier and the LGBM classifier by adopting a BO-TPE algorithm;
and calculating the accuracy of the LGBM classifier and the XGboost classifier and the time length of the optimal accuracy, comparing, when the accuracy of the XGboost classifier is greater than the accuracy of the LGBM classifier and the time length of the XGboost classifier of the optimal accuracy is less than that of the LGBM classifier, selecting the XGboost classifier to output the classification result, otherwise, selecting the LGBM classifier to output the classification result.
9. A computer-readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the XGBoost and LGBM based intrusion detection method according to any one of claims 1 to 8.
10. An apparatus comprising a processor and a memory having at least one instruction, at least one program, set of codes, or set of instructions stored therein, which is loaded and executed by the processor to implement the XGBoost and LGBM based intrusion detection method according to any one of claims 1 to 8.
CN202211391189.2A 2022-11-08 2022-11-08 Intrusion detection method, storage medium and device based on XGboost and LGBM Pending CN115600194A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211391189.2A CN115600194A (en) 2022-11-08 2022-11-08 Intrusion detection method, storage medium and device based on XGboost and LGBM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211391189.2A CN115600194A (en) 2022-11-08 2022-11-08 Intrusion detection method, storage medium and device based on XGboost and LGBM

Publications (1)

Publication Number Publication Date
CN115600194A true CN115600194A (en) 2023-01-13

Family

ID=84853117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211391189.2A Pending CN115600194A (en) 2022-11-08 2022-11-08 Intrusion detection method, storage medium and device based on XGboost and LGBM

Country Status (1)

Country Link
CN (1) CN115600194A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116319427A (en) * 2023-05-22 2023-06-23 北京国信蓝盾科技有限公司 Safety evaluation method, device, electronic equipment and medium based on equipment network
CN117035241A (en) * 2023-10-08 2023-11-10 广东省农业科学院植物保护研究所 Intelligent winged insect trap management method, system and medium based on insect condition prediction

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108768946A (en) * 2018-04-27 2018-11-06 中山大学 A kind of Internet Intrusion Detection Model based on random forests algorithm
CN113468555A (en) * 2021-06-07 2021-10-01 厦门国际银行股份有限公司 Method, system and device for identifying client access behavior
CN114422262A (en) * 2022-02-21 2022-04-29 上海应用技术大学 Industrial control network intrusion detection model construction method based on automatic machine learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108768946A (en) * 2018-04-27 2018-11-06 中山大学 A kind of Internet Intrusion Detection Model based on random forests algorithm
CN113468555A (en) * 2021-06-07 2021-10-01 厦门国际银行股份有限公司 Method, system and device for identifying client access behavior
CN114422262A (en) * 2022-02-21 2022-04-29 上海应用技术大学 Industrial control network intrusion detection model construction method based on automatic machine learning

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116319427A (en) * 2023-05-22 2023-06-23 北京国信蓝盾科技有限公司 Safety evaluation method, device, electronic equipment and medium based on equipment network
CN117035241A (en) * 2023-10-08 2023-11-10 广东省农业科学院植物保护研究所 Intelligent winged insect trap management method, system and medium based on insect condition prediction
CN117035241B (en) * 2023-10-08 2024-01-23 广东省农业科学院植物保护研究所 Intelligent winged insect trap management method, system and medium based on insect condition prediction

Similar Documents

Publication Publication Date Title
US10262410B2 (en) Methods and systems for inspecting goods
CN110633725B (en) Method and device for training classification model and classification method and device
CN115600194A (en) Intrusion detection method, storage medium and device based on XGboost and LGBM
US20070005556A1 (en) Probabilistic techniques for detecting duplicate tuples
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
CN111695597B (en) Credit fraud group identification method and system based on improved isolated forest algorithm
CN111507385B (en) Extensible network attack behavior classification method
CN112437053B (en) Intrusion detection method and device
CN112632609A (en) Abnormality detection method, abnormality detection device, electronic apparatus, and storage medium
CN112036476A (en) Data feature selection method and device based on two-classification service and computer equipment
CN115801374A (en) Network intrusion data classification method and device, electronic equipment and storage medium
CN115577357A (en) Android malicious software detection method based on stacking integration technology
CN111753299A (en) Unbalanced malicious software detection method based on packet integration
CN112926592B (en) Trademark retrieval method and device based on improved Fast algorithm
CN113962324A (en) Picture detection method and device, storage medium and electronic equipment
CN114897764A (en) Pulmonary nodule false positive elimination method and device based on standardized channel attention
CN111428064B (en) Small-area fingerprint image fast indexing method, device, equipment and storage medium
CN112418313B (en) Big data online noise filtering system and method
CN115688101A (en) Deep learning-based file classification method and device
KR101085066B1 (en) An Associative Classification Method for detecting useful knowledge from huge multi-attributes dataset
CN111931229B (en) Data identification method, device and storage medium
CN113656354A (en) Log classification method, system, computer device and readable storage medium
CN111598116B (en) Data classification method, device, electronic equipment and readable storage medium
CN112836747A (en) Eye movement data outlier processing method and device, computer equipment and storage medium
CN111581640A (en) Malicious software detection method, device and equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination