CN116541698A - XGBoost-based network anomaly intrusion detection method and system - Google Patents
XGBoost-based network anomaly intrusion detection method and system Download PDFInfo
- Publication number
- CN116541698A CN116541698A CN202210089952.XA CN202210089952A CN116541698A CN 116541698 A CN116541698 A CN 116541698A CN 202210089952 A CN202210089952 A CN 202210089952A CN 116541698 A CN116541698 A CN 116541698A
- Authority
- CN
- China
- Prior art keywords
- model
- xgboost
- training
- processor
- network access
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 19
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 26
- 230000006870 function Effects 0.000 claims abstract description 15
- 238000012360 testing method Methods 0.000 claims abstract description 14
- 238000003066 decision tree Methods 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000004590 computer program Methods 0.000 claims description 14
- 230000002159 abnormal effect Effects 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000003860 storage Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a network anomaly intrusion detection method and system based on XGBoost, the method includes preprocessing network access data, in the intrusion detection flow, encoding text, inputting the filling of the longest text string, and dividing the data set into a training set and a testing set; constructing a model, and establishing an XGBoost model based on a decision tree; model training, and optimizing training is carried out on the model. According to the invention, the weak classifier is integrated through the integrated learning method, compared with the traditional integrated learning method, a regularization operator is added on the complexity of the tree, the second-order approximation of the loss function is used instead of the nonlinear approximation, and the feature selection is performed at the intermediate node, so that the detection performance and efficiency of the existing invention in network access flow are more powerful.
Description
Technical Field
The invention relates to the field of network security, in particular to a network anomaly intrusion detection method and system based on XGBoost.
Background
The judgment of the abnormal network flow is often accompanied by the analysis of a large amount of log information, and the analysis based on the preset rule is more serious due to the fact that the network attack method which has occurred needs to be researched and summarized, and can not timely and effectively distinguish new abnormal network attack activities. Therefore, the machine learning method is used for judging the abnormal flow of the network, and is also a research field of a current relatively hot spot.
But the network abnormal flow data has the characteristics of high latitude and nonlinearity, and limits the effect of the traditional machine learning algorithm. Therefore, the method solves the dimension disaster, improves the operation efficiency of the algorithm model and is an important research content at present.
Disclosure of Invention
In order to solve the above problems, the present invention aims to provide a network anomaly intrusion detection method based on ensemble learning, which combines multiple models into a stronger model.
The technical scheme of the invention is as follows:
a network anomaly intrusion detection method based on XGBoost comprises the following steps:
s1, preprocessing network access data
In the intrusion detection process, encoding a text, inputting the filling of the longest text string, and simultaneously dividing a data set into a training set and a testing set;
s2, building a model, and building an XGBoost model based on a decision tree;
s3, model training, namely performing optimization training on the model.
Further, in S1, the data preprocessing is specifically as follows:
s11, text encoding of network access data, wherein each character in the network access data is mapped into a corresponding digital format according to a code table; meanwhile, setting the input length required by the model, complementing the insufficient input length, and cutting off the excess length; the data set is denoted as d= { (X, Y) }, where x= (X1, X2, …, xn) represents mapped network access data, y= (Y1, Y2, …, yn) represents a class label corresponding to the network access data, 0 represents normal access, and 1 represents malicious attack;
s12, segmenting the data set into training sets and testing sets according to a certain proportion; the training set is used for model training, and the testing set is used for model selection.
Further, in S2, the model construction includes the following steps:
s21, constructing each independent decision tree, and inputting a characteristic dimension d of a vector x at each node i i And a threshold t i Comparing, dividing x into one of left and right branches according to the comparison result; leaf nodes of the tree are prediction results of the model; first node judges x 1 Whether or not it is smaller than a threshold t 1 ;
If the value is smaller than the value, continuing to judge x 2 Whether or not it is smaller than a threshold t 2 If smaller, enter the left leaf node; the spatial region corresponding to the leaf node is:
R 1 ={x:x 1 ≤t 1 ,x2≤t 2 };
linking the region with the prediction output through coordinate axis segmentation; using average response
Associated with these regions, where y n Is a category label; the single decision tree model is:
wherein Rj is the region corresponding to the jth leaf node, W j Is the predicted output of the leaf node,
θ= { (Rj, wj): j= 1:J }, J being the number of leaf nodes;
s22, constructing an abnormal flow detection model based on XGBoost:
under the original mean square error, a regularization term is introduced, as shown in the following formula:
wherein J is the number of leaf nodes, gamma is more than or equal to 0, lambda is more than or equal to 0, and the number is a regularization coefficient;
in step m, the loss function is as follows:
h im is a hessen matrix:
F(x)=w q(x) whereinAssigning node x to a leaf node, < >>Is the weight of the leaf node;
finally, XGBoost is integrated as follows:
W m is the corresponding weight.
Further, in S3, the model training includes the following steps:
s31, solving optimal parameters by using a loss function of XGBoost:
I j ={i:q(x i ) =j } is the set of indices where the data sample is split into the j-th leaf node;
this is a quadratic programming problem, with the optimal weights as follows:
for different tree structures q, the loss function is:
and minimizing the loss function and solving the optimal parameters.
The invention also relates to a computer system comprising a collector, a memory, a processor and a computer program on the memory and running on the processor, the collector collecting information, the processor executing the computer program realizing the steps of the method.
The invention also relates to an electronic device comprising a memory, a processor and a computer program on the memory and executable on the processor, which processor implements the steps of the above method when executing the computer program.
The invention also relates to a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.
Compared with the prior art, the invention has the following beneficial effects:
(1) Compared with the traditional integrated learning method, the method has the advantages that a regularization operator is added on the complexity of the tree, the second-order approximation of the loss function is used instead of the nonlinear approximation, and the feature selection is performed at the intermediate node, so that the detection performance and efficiency of the existing method in network access flow are more powerful.
(2) The method provided by the invention is a supervised learning method, and meanwhile, the tree model has better performance and can evaluate the importance of the features as the interpretability is stronger than that of the neural network model.
Drawings
FIG. 1 is a flow chart of a detection method according to an embodiment of the present invention;
FIG. 2 is an algorithmic flow of XGBoost of an embodiment of the present invention;
FIG. 3 is an experimental result of an embodiment of the present invention verifying the effectiveness of the XGBoost algorithm.
Detailed Description
The technical solutions in this embodiment will be clearly and completely described in conjunction with the embodiment of the present invention, and it is obvious that the described embodiment is only a part of examples of the present invention, not all examples. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
In the embodiment, a basic classifier is constructed based on a single classification and regression decision tree (CART). The decision tree recursively segments the input space and a local model is defined on each segment space. The entire model may be represented as a tree. The prediction of the tree model is less accurate than other models. This is caused by greedy algorithms for tree model solutions.
In combination with the idea of ensemble learning, the embodiment provides a relatively stable prediction classification algorithm by using an ensemble method based on XGBoost.
As shown in fig. 1, the method for detecting network anomaly intrusion based on XGBoost in this embodiment includes the following steps:
s1: network access data preprocessing.
The network access data is a text string, and in the intrusion detection procedure, it is necessary to encode the text and enter a fill of the longest text string. Simultaneously dividing the data set into a training set and a testing set; the method specifically comprises the following steps:
s11: text encoding of network access data, wherein each character in the network access data is mapped into a corresponding digital format according to a code table; meanwhile, setting the input length required by the model, complementing the insufficient input length, and cutting off the excess length; the dataset is denoted as D = { (X, Y) }, where X = (X1, X2, …, xn) represents mapped network access data, Y = (Y1, Y2, …, yn) represents a class label corresponding to the network access data, 0 represents normal access, and 1 represents malicious attack.
S12: the segmentation of the data set is to practically test the model effect, and the data set is segmented into a training set and a testing set according to a certain proportion. The training set is used for model training, and the testing set is used for model selection.
S2, constructing a model, and constructing an XGBoost model based on a decision tree, wherein the method comprises the following steps of:
s21: constructing each independent decision tree, and inputting characteristic dimension d of vector x at each node i i And a threshold t i Comparing, dividing x into one of left and right branches according to the comparison result; leaf nodes of the tree are prediction results of the model; first node judges x 1 Whether or not it is smaller than a threshold t 1 ;
If the value is smaller than the value, continuing to judge x 2 Whether or not it is smaller than a threshold t 2 If smaller, enter the left leaf node; the spatial region corresponding to the leaf node is:
R 1 ={x:x 1 ≤t 1 ,x2≤t 2 };
linking the region with the prediction output through coordinate axis segmentation; using average response
Associated with these regions, where y n Is a category label; the single decision tree model is:
wherein Rj is the region corresponding to the jth leaf node, W j Is the predicted output of the leaf nodes, θ= { (Rj, wj): j= 1:J }, J is the number of leaf nodes.
S22, constructing an abnormal flow detection model based on XGBoost:
under the original mean square error, a regularization term is introduced, as shown in the following formula:
wherein J is the number of leaf nodes, gamma is more than or equal to 0, lambda is more than or equal to 0, and the number is a regularization coefficient;
in step m, the loss function is as follows:
h im is a hessen matrix:
F(x)=w q(x) whereinAssigning node x to a leaf node, < >>Is the weight of the leaf node;
finally, XGBoost is integrated as follows:
W m is the corresponding weight.
Further, in S3, the model training includes the following steps:
s31, solving optimal parameters by using a loss function of XGBoost:
I j ={i:q(x i ) =j } is the set of indices where the data sample is split into the j-th leaf node;
this is a quadratic programming problem, with the optimal weights as follows:
for different tree structures q, the loss function is:
and minimizing the loss function and solving the optimal parameters.
The present application also provides a program product, which comprises a computer program stored in a storage medium, from which at least one processor can read the computer program, and the method of the above embodiment can be implemented when the at least one processor executes the computer program.
The embodiment of the application also provides a chip for running the instructions, and the chip is used for executing the method of the embodiment.
The present embodiments also provide a storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the method of the embodiments as described above.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also Digital signal processors (Digital SignalProcessing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
To verify the effectiveness of XGBoost-based network anomaly intrusion detection, we used a public data set to verify that the data collected access data from the bank, of which 21991 are normal access data and 1097 are anomaly access data. The public dataset was read as 8:2 into training sets and testing sets. And evaluating the performance of the model on a test set, wherein the evaluation index is that an ROC curve is utilized to calculate the area AUC under the ROC curve.
Step 1, experimental configuration and data preparation:
experimental configuration:
the experiment is operated on a notebook computer with 4 cores of Intel (R) i7-4720HQ-cpu@2601Mhz and 16GB memory. The public data set adopted is a network access data set of a bank, wherein 21991 normal access records and 1097 access records with aggressiveness are adopted.
Data preparation:
an example of one network access of a bank dataset is as follows:
step 2: data preprocessing
First, the accessed body information is filtered out using regular expressions, the text between st@rt and INFO is removed, only the body portion between INFO to END is reserved, and the last addition of < EOS > representation at each network access text ENDs.
The text after regular matching is:
and then converting the characters into numbers by using a dictionary table, wherein the result after the text conversion is as follows:
and finally labeling the data set, and dividing the data set into a training set and a verification set according to the 8:2 segmentation.
Step 3: model construction and training
An XGBoost model object class is created, and 100 simple decision trees are set. And inputting the training set into a model, and minimizing the loss function to obtain the optimal parameters.
Step 4: model verification
And evaluating the performance of the model on a test set, wherein the evaluation index is that an ROC curve is utilized to calculate the area AUC under the ROC curve.
Experimental results:
the criteria used herein to evaluate the results of the experiment are ROC curves. Essentially solving the two classification problems. A common indicator for evaluating the two-classification problem is the ROC curve. The results of the classification model are four:
true Positive (TP) \1: the classifier determines that the access is abnormal and is actually abnormal.
2. False Positives (FP): the classifier discriminates as an abnormal access, in fact a normal access.
True Negative (TN): the classifier discriminates as normal access, and actually as normal access.
4. False Negative (FN): the classifier discriminates as normal access, in fact, abnormal access.
Given a reject threshold η, the proportion TPR correctly judged as true and the proportion FPR incorrectly judged as false can be calculated.
Each threshold η, corresponding to a (FPR, TPR) coordinate, forms a series of coordinate points, which form a ROC curve. AUC is the area under the ROC curve.
ROC curves and AUC values for network anomaly intrusion detection based on ensemble learning in a bank dataset are shown in fig. 3.
The trend of the roc curve, with AUC value of 0.9957, also identifies abnormal network attacks very accurately.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.
Claims (7)
1. A network anomaly intrusion detection method based on XGBoost is characterized in that: the method comprises the following steps:
s1, preprocessing network access data
In the intrusion detection process, encoding a text, inputting the filling of the longest text string, and simultaneously dividing a data set into a training set and a testing set;
s2, building a model, and building an XGBoost model based on a decision tree;
s3, model training, namely performing optimization training on the model.
2. The method according to claim 1, characterized in that: in S1, the data preprocessing is specifically as follows:
s11, text encoding of network access data, wherein each character in the network access data is mapped into a corresponding digital format according to a code table; meanwhile, setting the input length required by the model, complementing the insufficient input length, and cutting off the excess length; the data set is denoted as d= { (X, Y) }, where x= (X1, X2, …, xn) represents mapped network access data, y= (Y1, Y2, …, yn) represents a class label corresponding to the network access data, 0 represents normal access, and 1 represents malicious attack;
s12, segmenting the data set into training sets and testing sets according to a certain proportion; the training set is used for model training, and the testing set is used for model selection.
3. The method according to claim 1, characterized in that: in S2, the model construction includes the following steps: s21, constructing each independent decision tree, and inputting a characteristic dimension d of a vector x at each node i i And a threshold t i Comparing, dividing x into one of left and right branches according to the comparison result; leaf nodes of the tree are prediction results of the model; first node judges x 1 Whether or not it is smaller than a threshold t 1 ;
If the value is smaller than the value, continuing to judge x 2 Whether or not it is smaller than a threshold t 2 If smaller, enter the left leaf node;
the spatial region corresponding to the leaf node is:
R 1 ={x:x 1 ≤t 1 ,x2≤t 2 };
linking the region with the prediction output through coordinate axis segmentation; using average response
Associated with these regions, where y n Is a category label; the single decision tree model is:
wherein Rj is the region corresponding to the jth leaf node, W j Is the predicted output of the leaf node,
θ= { (Rj, wj): j= 1:J }, J being the number of leaf nodes;
s22, constructing an abnormal flow detection model based on XGBoost:
under the original mean square error, a regularization term is introduced, as shown in the following formula:
wherein J is the number of leaf nodes, gamma is more than or equal to 0, lambda is more than or equal to 0, and the number is a regularization coefficient;
in step m, the loss function is as follows:
h im is a hessen matrix:
F(x)=w q(x) whereinAssigning node x to a leaf node, < >>Is the weight of the leaf node;
finally, XGBoost is integrated as follows:
W m is the corresponding weight.
4. The method according to claim 1, characterized in that: in S3, the model training includes the following steps:
s31, solving optimal parameters by using a loss function of XGBoost:
I j ={i:q(x i ) =j } is the set of indices where the data sample is split into the j-th leaf node;
this is a quadratic programming problem, with the optimal weights as follows:
for different tree structures q, the loss function is:
and minimizing the loss function and solving the optimal parameters.
5. A computer system comprising a collector, a memory, a processor, and a computer program on the memory and executable on the processor, the collector collecting information, characterized in that: the processor, when executing the computer program, implements the steps of the method of any of the preceding claims 1 to 4.
6. An electronic device comprising a memory, a processor, and a computer program on the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements the steps of the method of any of the preceding claims 1 to 4.
7. A non-transitory computer readable storage medium having a computer program stored thereon, characterized by: which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210089952.XA CN116541698A (en) | 2022-01-25 | 2022-01-25 | XGBoost-based network anomaly intrusion detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210089952.XA CN116541698A (en) | 2022-01-25 | 2022-01-25 | XGBoost-based network anomaly intrusion detection method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116541698A true CN116541698A (en) | 2023-08-04 |
Family
ID=87442328
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210089952.XA Pending CN116541698A (en) | 2022-01-25 | 2022-01-25 | XGBoost-based network anomaly intrusion detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116541698A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117081858A (en) * | 2023-10-16 | 2023-11-17 | 山东省计算中心(国家超级计算济南中心) | Intrusion behavior detection method, system, equipment and medium based on multi-decision tree |
CN117811795A (en) * | 2023-12-28 | 2024-04-02 | 苏州市职业大学(苏州开放大学) | Campus network safety protection system and method |
-
2022
- 2022-01-25 CN CN202210089952.XA patent/CN116541698A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117081858A (en) * | 2023-10-16 | 2023-11-17 | 山东省计算中心(国家超级计算济南中心) | Intrusion behavior detection method, system, equipment and medium based on multi-decision tree |
CN117081858B (en) * | 2023-10-16 | 2024-01-19 | 山东省计算中心(国家超级计算济南中心) | Intrusion behavior detection method, system, equipment and medium based on multi-decision tree |
CN117811795A (en) * | 2023-12-28 | 2024-04-02 | 苏州市职业大学(苏州开放大学) | Campus network safety protection system and method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111967502B (en) | Network intrusion detection method based on conditional variation self-encoder | |
CN103870751B (en) | Method and system for intrusion detection | |
CN112491796B (en) | Intrusion detection and semantic decision tree quantitative interpretation method based on convolutional neural network | |
CN111881983B (en) | Data processing method and device based on classification model, electronic equipment and medium | |
CN106570513A (en) | Fault diagnosis method and apparatus for big data network system | |
CN116541698A (en) | XGBoost-based network anomaly intrusion detection method and system | |
CN116595463B (en) | Construction method of electricity larceny identification model, and electricity larceny behavior identification method and device | |
TW200849917A (en) | Detecting method of network invasion | |
CN114553545A (en) | Intrusion flow detection and identification method and system | |
WO2023116111A1 (en) | Disk fault prediction method and apparatus | |
CN111191720B (en) | Service scene identification method and device and electronic equipment | |
CN112633426A (en) | Method and device for processing data class imbalance, electronic equipment and storage medium | |
CN114239807A (en) | RFE-DAGMM-based high-dimensional data anomaly detection method | |
CN114037478A (en) | Advertisement abnormal flow detection method and system, electronic equipment and readable storage medium | |
CN113542241A (en) | Intrusion detection method and device based on CNN-BiGRU mixed model | |
CN116866054A (en) | Public information safety monitoring system and method thereof | |
Cheng et al. | Blocking bug prediction based on XGBoost with enhanced features | |
CN114448657B (en) | Distribution communication network security situation awareness and abnormal intrusion detection method | |
CN116628584A (en) | Power sensitive data processing method and device, electronic equipment and storage medium | |
CN117375896A (en) | Intrusion detection method and system based on multi-scale space-time feature residual fusion | |
CN116611003A (en) | Transformer fault diagnosis method, device and medium | |
CN116805245A (en) | Fraud detection method and system based on graph neural network and decoupling representation learning | |
CN114553468A (en) | Three-level network intrusion detection method based on feature intersection and ensemble learning | |
CN114095268A (en) | Method, terminal and storage medium for network intrusion detection | |
Chareka et al. | A study of fitness functions for data classification using grammatical evolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |