CN116541698A

CN116541698A - XGBoost-based network anomaly intrusion detection method and system

Info

Publication number: CN116541698A
Application number: CN202210089952.XA
Authority: CN
Inventors: 郭威; 谢林江; 毛正雄; 罗震宇; 何映军; 张振红; 杭菲璐; 陈何雄; 占梦来; 张军
Original assignee: Information Center of Yunnan Power Grid Co Ltd
Current assignee: Information Center of Yunnan Power Grid Co Ltd
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2023-08-04

Abstract

The invention relates to a network anomaly intrusion detection method and system based on XGBoost, the method includes preprocessing network access data, in the intrusion detection flow, encoding text, inputting the filling of the longest text string, and dividing the data set into a training set and a testing set; constructing a model, and establishing an XGBoost model based on a decision tree; model training, and optimizing training is carried out on the model. According to the invention, the weak classifier is integrated through the integrated learning method, compared with the traditional integrated learning method, a regularization operator is added on the complexity of the tree, the second-order approximation of the loss function is used instead of the nonlinear approximation, and the feature selection is performed at the intermediate node, so that the detection performance and efficiency of the existing invention in network access flow are more powerful.

Description

XGBoost-based network anomaly intrusion detection method and system

Technical Field

The invention relates to the field of network security, in particular to a network anomaly intrusion detection method and system based on XGBoost.

Background

The judgment of the abnormal network flow is often accompanied by the analysis of a large amount of log information, and the analysis based on the preset rule is more serious due to the fact that the network attack method which has occurred needs to be researched and summarized, and can not timely and effectively distinguish new abnormal network attack activities. Therefore, the machine learning method is used for judging the abnormal flow of the network, and is also a research field of a current relatively hot spot.

But the network abnormal flow data has the characteristics of high latitude and nonlinearity, and limits the effect of the traditional machine learning algorithm. Therefore, the method solves the dimension disaster, improves the operation efficiency of the algorithm model and is an important research content at present.

Disclosure of Invention

In order to solve the above problems, the present invention aims to provide a network anomaly intrusion detection method based on ensemble learning, which combines multiple models into a stronger model.

The technical scheme of the invention is as follows:

a network anomaly intrusion detection method based on XGBoost comprises the following steps:

s1, preprocessing network access data

In the intrusion detection process, encoding a text, inputting the filling of the longest text string, and simultaneously dividing a data set into a training set and a testing set;

s2, building a model, and building an XGBoost model based on a decision tree;

s3, model training, namely performing optimization training on the model.

Further, in S1, the data preprocessing is specifically as follows:

s11, text encoding of network access data, wherein each character in the network access data is mapped into a corresponding digital format according to a code table; meanwhile, setting the input length required by the model, complementing the insufficient input length, and cutting off the excess length; the data set is denoted as d= { (X, Y) }, where x= (X1, X2, …, xn) represents mapped network access data, y= (Y1, Y2, …, yn) represents a class label corresponding to the network access data, 0 represents normal access, and 1 represents malicious attack;

s12, segmenting the data set into training sets and testing sets according to a certain proportion; the training set is used for model training, and the testing set is used for model selection.

Further, in S2, the model construction includes the following steps:

s21, constructing each independent decision tree, and inputting a characteristic dimension d of a vector x at each node i _i And a threshold t _i Comparing, dividing x into one of left and right branches according to the comparison result; leaf nodes of the tree are prediction results of the model; first node judges x ₁ Whether or not it is smaller than a threshold t ₁ ；

If the value is smaller than the value, continuing to judge x ₂ Whether or not it is smaller than a threshold t ₂ If smaller, enter the left leaf node; the spatial region corresponding to the leaf node is:

R ₁ ＝{x:x ₁ ≤t ₁ ，x2≤t ₂ }；

linking the region with the prediction output through coordinate axis segmentation; using average response

Associated with these regions, where y _n Is a category label; the single decision tree model is:

wherein Rj is the region corresponding to the jth leaf node, W _j Is the predicted output of the leaf node,

θ= { (Rj, wj): j= 1:J }, J being the number of leaf nodes;

s22, constructing an abnormal flow detection model based on XGBoost:

under the original mean square error, a regularization term is introduced, as shown in the following formula:

wherein J is the number of leaf nodes, gamma is more than or equal to 0, lambda is more than or equal to 0, and the number is a regularization coefficient;

in step m, the loss function is as follows:

h _im is a hessen matrix:

F(x)＝w _q(x) whereinAssigning node x to a leaf node, < >>Is the weight of the leaf node;

finally, XGBoost is integrated as follows:

W _m is the corresponding weight.

Further, in S3, the model training includes the following steps:

s31, solving optimal parameters by using a loss function of XGBoost:

I _j ＝{i：q(x _i ) =j } is the set of indices where the data sample is split into the j-th leaf node;

this is a quadratic programming problem, with the optimal weights as follows:

for different tree structures q, the loss function is:

and minimizing the loss function and solving the optimal parameters.

The invention also relates to a computer system comprising a collector, a memory, a processor and a computer program on the memory and running on the processor, the collector collecting information, the processor executing the computer program realizing the steps of the method.

The invention also relates to an electronic device comprising a memory, a processor and a computer program on the memory and executable on the processor, which processor implements the steps of the above method when executing the computer program.

The invention also relates to a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.

Compared with the prior art, the invention has the following beneficial effects:

(1) Compared with the traditional integrated learning method, the method has the advantages that a regularization operator is added on the complexity of the tree, the second-order approximation of the loss function is used instead of the nonlinear approximation, and the feature selection is performed at the intermediate node, so that the detection performance and efficiency of the existing method in network access flow are more powerful.

(2) The method provided by the invention is a supervised learning method, and meanwhile, the tree model has better performance and can evaluate the importance of the features as the interpretability is stronger than that of the neural network model.

Drawings

FIG. 1 is a flow chart of a detection method according to an embodiment of the present invention;

FIG. 2 is an algorithmic flow of XGBoost of an embodiment of the present invention;

FIG. 3 is an experimental result of an embodiment of the present invention verifying the effectiveness of the XGBoost algorithm.

Detailed Description

The technical solutions in this embodiment will be clearly and completely described in conjunction with the embodiment of the present invention, and it is obvious that the described embodiment is only a part of examples of the present invention, not all examples. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

In the embodiment, a basic classifier is constructed based on a single classification and regression decision tree (CART). The decision tree recursively segments the input space and a local model is defined on each segment space. The entire model may be represented as a tree. The prediction of the tree model is less accurate than other models. This is caused by greedy algorithms for tree model solutions.

In combination with the idea of ensemble learning, the embodiment provides a relatively stable prediction classification algorithm by using an ensemble method based on XGBoost.

As shown in fig. 1, the method for detecting network anomaly intrusion based on XGBoost in this embodiment includes the following steps:

s1: network access data preprocessing.

The network access data is a text string, and in the intrusion detection procedure, it is necessary to encode the text and enter a fill of the longest text string. Simultaneously dividing the data set into a training set and a testing set; the method specifically comprises the following steps:

s11: text encoding of network access data, wherein each character in the network access data is mapped into a corresponding digital format according to a code table; meanwhile, setting the input length required by the model, complementing the insufficient input length, and cutting off the excess length; the dataset is denoted as D = { (X, Y) }, where X = (X1, X2, …, xn) represents mapped network access data, Y = (Y1, Y2, …, yn) represents a class label corresponding to the network access data, 0 represents normal access, and 1 represents malicious attack.

S12: the segmentation of the data set is to practically test the model effect, and the data set is segmented into a training set and a testing set according to a certain proportion. The training set is used for model training, and the testing set is used for model selection.

S2, constructing a model, and constructing an XGBoost model based on a decision tree, wherein the method comprises the following steps of:

s21: constructing each independent decision tree, and inputting characteristic dimension d of vector x at each node i _i And a threshold t _i Comparing, dividing x into one of left and right branches according to the comparison result; leaf nodes of the tree are prediction results of the model; first node judges x ₁ Whether or not it is smaller than a threshold t ₁ ；

R ₁ ＝{x:x ₁ ≤t ₁ ，x2≤t ₂ }；

wherein Rj is the region corresponding to the jth leaf node, W _j Is the predicted output of the leaf nodes, θ= { (Rj, wj): j= 1:J }, J is the number of leaf nodes.

S22, constructing an abnormal flow detection model based on XGBoost:

in step m, the loss function is as follows:

h _im is a hessen matrix:

finally, XGBoost is integrated as follows:

W _m is the corresponding weight.

Further, in S3, the model training includes the following steps:

s31, solving optimal parameters by using a loss function of XGBoost:

this is a quadratic programming problem, with the optimal weights as follows:

for different tree structures q, the loss function is:

and minimizing the loss function and solving the optimal parameters.

The present application also provides a program product, which comprises a computer program stored in a storage medium, from which at least one processor can read the computer program, and the method of the above embodiment can be implemented when the at least one processor executes the computer program.

The embodiment of the application also provides a chip for running the instructions, and the chip is used for executing the method of the embodiment.

The present embodiments also provide a storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the method of the embodiments as described above.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also Digital signal processors (Digital SignalProcessing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

To verify the effectiveness of XGBoost-based network anomaly intrusion detection, we used a public data set to verify that the data collected access data from the bank, of which 21991 are normal access data and 1097 are anomaly access data. The public dataset was read as 8:2 into training sets and testing sets. And evaluating the performance of the model on a test set, wherein the evaluation index is that an ROC curve is utilized to calculate the area AUC under the ROC curve.

Step 1, experimental configuration and data preparation:

experimental configuration:

the experiment is operated on a notebook computer with 4 cores of Intel (R) i7-4720HQ-cpu@2601Mhz and 16GB memory. The public data set adopted is a network access data set of a bank, wherein 21991 normal access records and 1097 access records with aggressiveness are adopted.

Data preparation:

an example of one network access of a bank dataset is as follows:

step 2: data preprocessing

First, the accessed body information is filtered out using regular expressions, the text between st@rt and INFO is removed, only the body portion between INFO to END is reserved, and the last addition of < EOS > representation at each network access text ENDs.

The text after regular matching is:

and then converting the characters into numbers by using a dictionary table, wherein the result after the text conversion is as follows:

and finally labeling the data set, and dividing the data set into a training set and a verification set according to the 8:2 segmentation.

Step 3: model construction and training

An XGBoost model object class is created, and 100 simple decision trees are set. And inputting the training set into a model, and minimizing the loss function to obtain the optimal parameters.

Step 4: model verification

And evaluating the performance of the model on a test set, wherein the evaluation index is that an ROC curve is utilized to calculate the area AUC under the ROC curve.

Experimental results:

the criteria used herein to evaluate the results of the experiment are ROC curves. Essentially solving the two classification problems. A common indicator for evaluating the two-classification problem is the ROC curve. The results of the classification model are four:

true Positive (TP) \1: the classifier determines that the access is abnormal and is actually abnormal.

2. False Positives (FP): the classifier discriminates as an abnormal access, in fact a normal access.

True Negative (TN): the classifier discriminates as normal access, and actually as normal access.

4. False Negative (FN): the classifier discriminates as normal access, in fact, abnormal access.

Given a reject threshold η, the proportion TPR correctly judged as true and the proportion FPR incorrectly judged as false can be calculated.

Each threshold η, corresponding to a (FPR, TPR) coordinate, forms a series of coordinate points, which form a ROC curve. AUC is the area under the ROC curve.

ROC curves and AUC values for network anomaly intrusion detection based on ensemble learning in a bank dataset are shown in fig. 3.

The trend of the roc curve, with AUC value of 0.9957, also identifies abnormal network attacks very accurately.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A network anomaly intrusion detection method based on XGBoost is characterized in that: the method comprises the following steps:

s1, preprocessing network access data

s2, building a model, and building an XGBoost model based on a decision tree;

s3, model training, namely performing optimization training on the model.

2. The method according to claim 1, characterized in that: in S1, the data preprocessing is specifically as follows:

3. The method according to claim 1, characterized in that: in S2, the model construction includes the following steps: s21, constructing each independent decision tree, and inputting a characteristic dimension d of a vector x at each node i _i And a threshold t _i Comparing, dividing x into one of left and right branches according to the comparison result; leaf nodes of the tree are prediction results of the model; first node judges x ₁ Whether or not it is smaller than a threshold t ₁ ；

If the value is smaller than the value, continuing to judge x ₂ Whether or not it is smaller than a threshold t ₂ If smaller, enter the left leaf node;

the spatial region corresponding to the leaf node is:

R ₁ ＝{x：x ₁ ≤t ₁ ，x2≤t ₂ }；

θ= { (Rj, wj): j= 1:J }, J being the number of leaf nodes;

s22, constructing an abnormal flow detection model based on XGBoost:

in step m, the loss function is as follows:

h _im is a hessen matrix:

finally, XGBoost is integrated as follows:

W _m is the corresponding weight.

4. The method according to claim 1, characterized in that: in S3, the model training includes the following steps:

s31, solving optimal parameters by using a loss function of XGBoost:

this is a quadratic programming problem, with the optimal weights as follows:

for different tree structures q, the loss function is:

and minimizing the loss function and solving the optimal parameters.

5. A computer system comprising a collector, a memory, a processor, and a computer program on the memory and executable on the processor, the collector collecting information, characterized in that: the processor, when executing the computer program, implements the steps of the method of any of the preceding claims 1 to 4.

6. An electronic device comprising a memory, a processor, and a computer program on the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements the steps of the method of any of the preceding claims 1 to 4.

7. A non-transitory computer readable storage medium having a computer program stored thereon, characterized by: which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.