CN116541698A - XGBoost-based network anomaly intrusion detection method and system - Google Patents

XGBoost-based network anomaly intrusion detection method and system Download PDF

Info

Publication number
CN116541698A
CN116541698A CN202210089952.XA CN202210089952A CN116541698A CN 116541698 A CN116541698 A CN 116541698A CN 202210089952 A CN202210089952 A CN 202210089952A CN 116541698 A CN116541698 A CN 116541698A
Authority
CN
China
Prior art keywords
model
xgboost
training
processor
network access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210089952.XA
Other languages
Chinese (zh)
Inventor
郭威
谢林江
毛正雄
罗震宇
何映军
张振红
杭菲璐
陈何雄
占梦来
张军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Center of Yunnan Power Grid Co Ltd
Original Assignee
Information Center of Yunnan Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Center of Yunnan Power Grid Co Ltd filed Critical Information Center of Yunnan Power Grid Co Ltd
Priority to CN202210089952.XA priority Critical patent/CN116541698A/en
Publication of CN116541698A publication Critical patent/CN116541698A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a network anomaly intrusion detection method and system based on XGBoost, the method includes preprocessing network access data, in the intrusion detection flow, encoding text, inputting the filling of the longest text string, and dividing the data set into a training set and a testing set; constructing a model, and establishing an XGBoost model based on a decision tree; model training, and optimizing training is carried out on the model. According to the invention, the weak classifier is integrated through the integrated learning method, compared with the traditional integrated learning method, a regularization operator is added on the complexity of the tree, the second-order approximation of the loss function is used instead of the nonlinear approximation, and the feature selection is performed at the intermediate node, so that the detection performance and efficiency of the existing invention in network access flow are more powerful.

Description

XGBoost-based network anomaly intrusion detection method and system
Technical Field
The invention relates to the field of network security, in particular to a network anomaly intrusion detection method and system based on XGBoost.
Background
The judgment of the abnormal network flow is often accompanied by the analysis of a large amount of log information, and the analysis based on the preset rule is more serious due to the fact that the network attack method which has occurred needs to be researched and summarized, and can not timely and effectively distinguish new abnormal network attack activities. Therefore, the machine learning method is used for judging the abnormal flow of the network, and is also a research field of a current relatively hot spot.
But the network abnormal flow data has the characteristics of high latitude and nonlinearity, and limits the effect of the traditional machine learning algorithm. Therefore, the method solves the dimension disaster, improves the operation efficiency of the algorithm model and is an important research content at present.
Disclosure of Invention
In order to solve the above problems, the present invention aims to provide a network anomaly intrusion detection method based on ensemble learning, which combines multiple models into a stronger model.
The technical scheme of the invention is as follows:
a network anomaly intrusion detection method based on XGBoost comprises the following steps:
s1, preprocessing network access data
In the intrusion detection process, encoding a text, inputting the filling of the longest text string, and simultaneously dividing a data set into a training set and a testing set;
s2, building a model, and building an XGBoost model based on a decision tree;
s3, model training, namely performing optimization training on the model.
Further, in S1, the data preprocessing is specifically as follows:
s11, text encoding of network access data, wherein each character in the network access data is mapped into a corresponding digital format according to a code table; meanwhile, setting the input length required by the model, complementing the insufficient input length, and cutting off the excess length; the data set is denoted as d= { (X, Y) }, where x= (X1, X2, …, xn) represents mapped network access data, y= (Y1, Y2, …, yn) represents a class label corresponding to the network access data, 0 represents normal access, and 1 represents malicious attack;
s12, segmenting the data set into training sets and testing sets according to a certain proportion; the training set is used for model training, and the testing set is used for model selection.
Further, in S2, the model construction includes the following steps:
s21, constructing each independent decision tree, and inputting a characteristic dimension d of a vector x at each node i i And a threshold t i Comparing, dividing x into one of left and right branches according to the comparison result; leaf nodes of the tree are prediction results of the model; first node judges x 1 Whether or not it is smaller than a threshold t 1
If the value is smaller than the value, continuing to judge x 2 Whether or not it is smaller than a threshold t 2 If smaller, enter the left leaf node; the spatial region corresponding to the leaf node is:
R 1 ={x:x 1 ≤t 1 ,x2≤t 2 };
linking the region with the prediction output through coordinate axis segmentation; using average response
Associated with these regions, where y n Is a category label; the single decision tree model is:
wherein Rj is the region corresponding to the jth leaf node, W j Is the predicted output of the leaf node,
θ= { (Rj, wj): j= 1:J }, J being the number of leaf nodes;
s22, constructing an abnormal flow detection model based on XGBoost:
under the original mean square error, a regularization term is introduced, as shown in the following formula:
wherein J is the number of leaf nodes, gamma is more than or equal to 0, lambda is more than or equal to 0, and the number is a regularization coefficient;
in step m, the loss function is as follows:
h im is a hessen matrix:
F(x)=w q(x) whereinAssigning node x to a leaf node, < >>Is the weight of the leaf node;
finally, XGBoost is integrated as follows:
W m is the corresponding weight.
Further, in S3, the model training includes the following steps:
s31, solving optimal parameters by using a loss function of XGBoost:
I j ={i:q(x i ) =j } is the set of indices where the data sample is split into the j-th leaf node;
this is a quadratic programming problem, with the optimal weights as follows:
for different tree structures q, the loss function is:
and minimizing the loss function and solving the optimal parameters.
The invention also relates to a computer system comprising a collector, a memory, a processor and a computer program on the memory and running on the processor, the collector collecting information, the processor executing the computer program realizing the steps of the method.
The invention also relates to an electronic device comprising a memory, a processor and a computer program on the memory and executable on the processor, which processor implements the steps of the above method when executing the computer program.
The invention also relates to a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.
Compared with the prior art, the invention has the following beneficial effects:
(1) Compared with the traditional integrated learning method, the method has the advantages that a regularization operator is added on the complexity of the tree, the second-order approximation of the loss function is used instead of the nonlinear approximation, and the feature selection is performed at the intermediate node, so that the detection performance and efficiency of the existing method in network access flow are more powerful.
(2) The method provided by the invention is a supervised learning method, and meanwhile, the tree model has better performance and can evaluate the importance of the features as the interpretability is stronger than that of the neural network model.
Drawings
FIG. 1 is a flow chart of a detection method according to an embodiment of the present invention;
FIG. 2 is an algorithmic flow of XGBoost of an embodiment of the present invention;
FIG. 3 is an experimental result of an embodiment of the present invention verifying the effectiveness of the XGBoost algorithm.
Detailed Description
The technical solutions in this embodiment will be clearly and completely described in conjunction with the embodiment of the present invention, and it is obvious that the described embodiment is only a part of examples of the present invention, not all examples. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
In the embodiment, a basic classifier is constructed based on a single classification and regression decision tree (CART). The decision tree recursively segments the input space and a local model is defined on each segment space. The entire model may be represented as a tree. The prediction of the tree model is less accurate than other models. This is caused by greedy algorithms for tree model solutions.
In combination with the idea of ensemble learning, the embodiment provides a relatively stable prediction classification algorithm by using an ensemble method based on XGBoost.
As shown in fig. 1, the method for detecting network anomaly intrusion based on XGBoost in this embodiment includes the following steps:
s1: network access data preprocessing.
The network access data is a text string, and in the intrusion detection procedure, it is necessary to encode the text and enter a fill of the longest text string. Simultaneously dividing the data set into a training set and a testing set; the method specifically comprises the following steps:
s11: text encoding of network access data, wherein each character in the network access data is mapped into a corresponding digital format according to a code table; meanwhile, setting the input length required by the model, complementing the insufficient input length, and cutting off the excess length; the dataset is denoted as D = { (X, Y) }, where X = (X1, X2, …, xn) represents mapped network access data, Y = (Y1, Y2, …, yn) represents a class label corresponding to the network access data, 0 represents normal access, and 1 represents malicious attack.
S12: the segmentation of the data set is to practically test the model effect, and the data set is segmented into a training set and a testing set according to a certain proportion. The training set is used for model training, and the testing set is used for model selection.
S2, constructing a model, and constructing an XGBoost model based on a decision tree, wherein the method comprises the following steps of:
s21: constructing each independent decision tree, and inputting characteristic dimension d of vector x at each node i i And a threshold t i Comparing, dividing x into one of left and right branches according to the comparison result; leaf nodes of the tree are prediction results of the model; first node judges x 1 Whether or not it is smaller than a threshold t 1
If the value is smaller than the value, continuing to judge x 2 Whether or not it is smaller than a threshold t 2 If smaller, enter the left leaf node; the spatial region corresponding to the leaf node is:
R 1 ={x:x 1 ≤t 1 ,x2≤t 2 };
linking the region with the prediction output through coordinate axis segmentation; using average response
Associated with these regions, where y n Is a category label; the single decision tree model is:
wherein Rj is the region corresponding to the jth leaf node, W j Is the predicted output of the leaf nodes, θ= { (Rj, wj): j= 1:J }, J is the number of leaf nodes.
S22, constructing an abnormal flow detection model based on XGBoost:
under the original mean square error, a regularization term is introduced, as shown in the following formula:
wherein J is the number of leaf nodes, gamma is more than or equal to 0, lambda is more than or equal to 0, and the number is a regularization coefficient;
in step m, the loss function is as follows:
h im is a hessen matrix:
F(x)=w q(x) whereinAssigning node x to a leaf node, < >>Is the weight of the leaf node;
finally, XGBoost is integrated as follows:
W m is the corresponding weight.
Further, in S3, the model training includes the following steps:
s31, solving optimal parameters by using a loss function of XGBoost:
I j ={i:q(x i ) =j } is the set of indices where the data sample is split into the j-th leaf node;
this is a quadratic programming problem, with the optimal weights as follows:
for different tree structures q, the loss function is:
and minimizing the loss function and solving the optimal parameters.
The present application also provides a program product, which comprises a computer program stored in a storage medium, from which at least one processor can read the computer program, and the method of the above embodiment can be implemented when the at least one processor executes the computer program.
The embodiment of the application also provides a chip for running the instructions, and the chip is used for executing the method of the embodiment.
The present embodiments also provide a storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the method of the embodiments as described above.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also Digital signal processors (Digital SignalProcessing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
To verify the effectiveness of XGBoost-based network anomaly intrusion detection, we used a public data set to verify that the data collected access data from the bank, of which 21991 are normal access data and 1097 are anomaly access data. The public dataset was read as 8:2 into training sets and testing sets. And evaluating the performance of the model on a test set, wherein the evaluation index is that an ROC curve is utilized to calculate the area AUC under the ROC curve.
Step 1, experimental configuration and data preparation:
experimental configuration:
the experiment is operated on a notebook computer with 4 cores of Intel (R) i7-4720HQ-cpu@2601Mhz and 16GB memory. The public data set adopted is a network access data set of a bank, wherein 21991 normal access records and 1097 access records with aggressiveness are adopted.
Data preparation:
an example of one network access of a bank dataset is as follows:
step 2: data preprocessing
First, the accessed body information is filtered out using regular expressions, the text between st@rt and INFO is removed, only the body portion between INFO to END is reserved, and the last addition of < EOS > representation at each network access text ENDs.
The text after regular matching is:
and then converting the characters into numbers by using a dictionary table, wherein the result after the text conversion is as follows:
and finally labeling the data set, and dividing the data set into a training set and a verification set according to the 8:2 segmentation.
Step 3: model construction and training
An XGBoost model object class is created, and 100 simple decision trees are set. And inputting the training set into a model, and minimizing the loss function to obtain the optimal parameters.
Step 4: model verification
And evaluating the performance of the model on a test set, wherein the evaluation index is that an ROC curve is utilized to calculate the area AUC under the ROC curve.
Experimental results:
the criteria used herein to evaluate the results of the experiment are ROC curves. Essentially solving the two classification problems. A common indicator for evaluating the two-classification problem is the ROC curve. The results of the classification model are four:
true Positive (TP) \1: the classifier determines that the access is abnormal and is actually abnormal.
2. False Positives (FP): the classifier discriminates as an abnormal access, in fact a normal access.
True Negative (TN): the classifier discriminates as normal access, and actually as normal access.
4. False Negative (FN): the classifier discriminates as normal access, in fact, abnormal access.
Given a reject threshold η, the proportion TPR correctly judged as true and the proportion FPR incorrectly judged as false can be calculated.
Each threshold η, corresponding to a (FPR, TPR) coordinate, forms a series of coordinate points, which form a ROC curve. AUC is the area under the ROC curve.
ROC curves and AUC values for network anomaly intrusion detection based on ensemble learning in a bank dataset are shown in fig. 3.
The trend of the roc curve, with AUC value of 0.9957, also identifies abnormal network attacks very accurately.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (7)

1. A network anomaly intrusion detection method based on XGBoost is characterized in that: the method comprises the following steps:
s1, preprocessing network access data
In the intrusion detection process, encoding a text, inputting the filling of the longest text string, and simultaneously dividing a data set into a training set and a testing set;
s2, building a model, and building an XGBoost model based on a decision tree;
s3, model training, namely performing optimization training on the model.
2. The method according to claim 1, characterized in that: in S1, the data preprocessing is specifically as follows:
s11, text encoding of network access data, wherein each character in the network access data is mapped into a corresponding digital format according to a code table; meanwhile, setting the input length required by the model, complementing the insufficient input length, and cutting off the excess length; the data set is denoted as d= { (X, Y) }, where x= (X1, X2, …, xn) represents mapped network access data, y= (Y1, Y2, …, yn) represents a class label corresponding to the network access data, 0 represents normal access, and 1 represents malicious attack;
s12, segmenting the data set into training sets and testing sets according to a certain proportion; the training set is used for model training, and the testing set is used for model selection.
3. The method according to claim 1, characterized in that: in S2, the model construction includes the following steps: s21, constructing each independent decision tree, and inputting a characteristic dimension d of a vector x at each node i i And a threshold t i Comparing, dividing x into one of left and right branches according to the comparison result; leaf nodes of the tree are prediction results of the model; first node judges x 1 Whether or not it is smaller than a threshold t 1
If the value is smaller than the value, continuing to judge x 2 Whether or not it is smaller than a threshold t 2 If smaller, enter the left leaf node;
the spatial region corresponding to the leaf node is:
R 1 ={x:x 1 ≤t 1 ,x2≤t 2 };
linking the region with the prediction output through coordinate axis segmentation; using average response
Associated with these regions, where y n Is a category label; the single decision tree model is:
wherein Rj is the region corresponding to the jth leaf node, W j Is the predicted output of the leaf node,
θ= { (Rj, wj): j= 1:J }, J being the number of leaf nodes;
s22, constructing an abnormal flow detection model based on XGBoost:
under the original mean square error, a regularization term is introduced, as shown in the following formula:
wherein J is the number of leaf nodes, gamma is more than or equal to 0, lambda is more than or equal to 0, and the number is a regularization coefficient;
in step m, the loss function is as follows:
h im is a hessen matrix:
F(x)=w q(x) whereinAssigning node x to a leaf node, < >>Is the weight of the leaf node;
finally, XGBoost is integrated as follows:
W m is the corresponding weight.
4. The method according to claim 1, characterized in that: in S3, the model training includes the following steps:
s31, solving optimal parameters by using a loss function of XGBoost:
I j ={i:q(x i ) =j } is the set of indices where the data sample is split into the j-th leaf node;
this is a quadratic programming problem, with the optimal weights as follows:
for different tree structures q, the loss function is:
and minimizing the loss function and solving the optimal parameters.
5. A computer system comprising a collector, a memory, a processor, and a computer program on the memory and executable on the processor, the collector collecting information, characterized in that: the processor, when executing the computer program, implements the steps of the method of any of the preceding claims 1 to 4.
6. An electronic device comprising a memory, a processor, and a computer program on the memory and executable on the processor, characterized by: the processor, when executing the computer program, implements the steps of the method of any of the preceding claims 1 to 4.
7. A non-transitory computer readable storage medium having a computer program stored thereon, characterized by: which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
CN202210089952.XA 2022-01-25 2022-01-25 XGBoost-based network anomaly intrusion detection method and system Pending CN116541698A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210089952.XA CN116541698A (en) 2022-01-25 2022-01-25 XGBoost-based network anomaly intrusion detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210089952.XA CN116541698A (en) 2022-01-25 2022-01-25 XGBoost-based network anomaly intrusion detection method and system

Publications (1)

Publication Number Publication Date
CN116541698A true CN116541698A (en) 2023-08-04

Family

ID=87442328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210089952.XA Pending CN116541698A (en) 2022-01-25 2022-01-25 XGBoost-based network anomaly intrusion detection method and system

Country Status (1)

Country Link
CN (1) CN116541698A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117081858A (en) * 2023-10-16 2023-11-17 山东省计算中心(国家超级计算济南中心) Intrusion behavior detection method, system, equipment and medium based on multi-decision tree
CN117811795A (en) * 2023-12-28 2024-04-02 苏州市职业大学(苏州开放大学) Campus network safety protection system and method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117081858A (en) * 2023-10-16 2023-11-17 山东省计算中心(国家超级计算济南中心) Intrusion behavior detection method, system, equipment and medium based on multi-decision tree
CN117081858B (en) * 2023-10-16 2024-01-19 山东省计算中心(国家超级计算济南中心) Intrusion behavior detection method, system, equipment and medium based on multi-decision tree
CN117811795A (en) * 2023-12-28 2024-04-02 苏州市职业大学(苏州开放大学) Campus network safety protection system and method

Similar Documents

Publication Publication Date Title
CN111967502B (en) Network intrusion detection method based on conditional variation self-encoder
CN103870751B (en) Method and system for intrusion detection
CN112491796B (en) Intrusion detection and semantic decision tree quantitative interpretation method based on convolutional neural network
CN111881983B (en) Data processing method and device based on classification model, electronic equipment and medium
CN106570513A (en) Fault diagnosis method and apparatus for big data network system
CN116541698A (en) XGBoost-based network anomaly intrusion detection method and system
CN116595463B (en) Construction method of electricity larceny identification model, and electricity larceny behavior identification method and device
TW200849917A (en) Detecting method of network invasion
CN114553545A (en) Intrusion flow detection and identification method and system
WO2023116111A1 (en) Disk fault prediction method and apparatus
CN111191720B (en) Service scene identification method and device and electronic equipment
CN112633426A (en) Method and device for processing data class imbalance, electronic equipment and storage medium
CN114239807A (en) RFE-DAGMM-based high-dimensional data anomaly detection method
CN114037478A (en) Advertisement abnormal flow detection method and system, electronic equipment and readable storage medium
CN113542241A (en) Intrusion detection method and device based on CNN-BiGRU mixed model
CN116866054A (en) Public information safety monitoring system and method thereof
Cheng et al. Blocking bug prediction based on XGBoost with enhanced features
CN114448657B (en) Distribution communication network security situation awareness and abnormal intrusion detection method
CN116628584A (en) Power sensitive data processing method and device, electronic equipment and storage medium
CN117375896A (en) Intrusion detection method and system based on multi-scale space-time feature residual fusion
CN116611003A (en) Transformer fault diagnosis method, device and medium
CN116805245A (en) Fraud detection method and system based on graph neural network and decoupling representation learning
CN114553468A (en) Three-level network intrusion detection method based on feature intersection and ensemble learning
CN114095268A (en) Method, terminal and storage medium for network intrusion detection
Chareka et al. A study of fitness functions for data classification using grammatical evolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination