CN113904801B

CN113904801B - Network intrusion detection method and system

Info

Publication number: CN113904801B
Application number: CN202111030340.5A
Authority: CN
Inventors: 徐凤振; 寿增; 汪明; 高明慧; 赵航; 卢楷; 马力; 张志军; 董昱; 许洪强; 周劼英; 詹雄; 张晓�; 李新鹏; 崔旭东; 何纪成; 王洋; 郭乃豪; 王浩; 赵宇
Original assignee: State Grid Corp of China SGCC; Beijing Kedong Electric Power Control System Co Ltd; State Grid Liaoning Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Beijing Kedong Electric Power Control System Co Ltd; State Grid Liaoning Electric Power Co Ltd
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2024-02-06
Anticipated expiration: 2041-09-03
Also published as: CN113904801A

Abstract

The invention discloses a network intrusion detection method, which is characterized by comprising the following steps: the intercepted network data packets are arranged to obtain a network data set; performing feature engineering processing on the network data set to obtain a network detection data set; performing dimension reduction on the network detection data set by adopting a trained denoising self-coding neural network model; performing intrusion detection on the network detection data after the dimension reduction by adopting a trained XGBoost network intrusion detection model; constructing an intrusion database according to the network data of which the detection result is intrusion and the disclosed intrusion data; the denoising self-coding neural network model and the XGBoost network intrusion detection model are retrained according to the intrusion database at regular intervals, and the network data is subjected to intrusion detection according to the retrained model.

Description

Network intrusion detection method and system

Technical Field

The invention belongs to the technical field of network security, and particularly relates to a network intrusion detection method.

Background

With the advent of the internet era, networks have penetrated in various aspects of life of people, and various network intrusion means with various changes can cause various security problems such as personal information leakage, confidential document theft and account theft, and the like, so that immeasurable losses are caused. Therefore, how to construct an effective network intrusion detection model is increasingly emphasized by the relevant scholars.

In recent years, various machine learning algorithms are applied to network intrusion detection, and some classical algorithms such as KNN, decision trees, SVM and the like have been applied, however, these algorithms have problems of low detection efficiency, high false detection rate and the like in application. The method has lower false detection rate, but has the obvious defect of longer prediction time; in order to improve the detection rate and reduce the false detection rate, the Shougarmy et al performs the operations of reducing and eliminating samples according to the correlation between the characteristics and the similarity between the similar samples in the data processing process, and then uses an SVM algorithm for modeling, however, the operations of directly eliminating the characteristics and eliminating the samples inevitably bring about information loss, and the detection accuracy is also necessarily limited; chen Hong et al perform data dimension reduction based on a deep belief neural network (Deep Belief Networks, DBN), perform intrusion detection by using a plurality of gradient lifting trees, and have good detection effect on unbalanced intrusion data of intrusion data, however, the processing process is complex, and the experiment time is long; zhang Yang et al introduce XGBoost algorithm into intrusion detection to obtain good detection rate and low false detection rate, but have the problems of difficulty in high-data processing and unsatisfactory detection effect on attack types with fewer samples.

The data in the network is in a growing state at any time, a few intrusion threat behaviors are often hidden in massive normal network behaviors, meanwhile, the network environment is very complex, influence factors are many, and the data has the characteristics of large scale, high dimensionality and unbalance in the intrusion detection process. Although the current research method has some achievements, the method has the problems that high-dimensional data processing is difficult and the detection effect on attack types with fewer samples is not ideal.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a network intrusion detection method which can accurately detect network intrusion data.

The technical problems to be solved by the invention are realized by the following technical scheme:

in a first aspect, a network intrusion detection method is provided, including: the intercepted network data packets are arranged to obtain a network data set;

performing feature engineering processing on the network data set to obtain a network detection data set;

performing dimension reduction on the network detection data set by adopting a trained denoising self-coding neural network model;

performing intrusion detection on the network detection data after the dimension reduction by adopting a trained XGBoost network intrusion detection model;

constructing an intrusion database according to the network data of which the detection result is intrusion and the disclosed intrusion data;

and periodically retraining the denoising self-coding neural network model and the XGBoost network intrusion detection model according to the intrusion database, and performing intrusion detection on network data according to the retrained model.

With reference to the first aspect, further, the step of organizing the intercepted network data packet to obtain a network data set specifically includes: and obtaining a network data set ND according to the basic characteristic attribute of the TCP connection, the content attribute of the TCP connection, the time-based network traffic characteristic attribute and the content of the host-based network traffic statistical characteristic in the intercepted network data packet.

With reference to the first aspect, further, the performing feature engineering processing on the network data set to obtain a network detection data set specifically includes:

and digitizing character data in the network data set by adopting onehot coding, normalizing the numerical data in the network data set, and obtaining a network detection data set D according to the digitized and normalized data.

With reference to the first aspect, further, the trained denoising self-coding neural network training process includes:

manually marking the historical data in the same network environment, and forming a training data set by the marked historical data and the public network attack data;

performing numeric processing on character type data in the training data set by adopting onehot coding, performing normalization processing on numeric data in the numeric data set, and dividing the processed training data set into training sets T ₁ And test set T ₂ ；

Inputting the training set into a denoising self-coding neural network model to train the model;

will test the data set T ₂ And (3) inputting the model to a trained denoising self-coding neural network model to test the model, and continuing training after adjusting the model parameters according to the standard of the model until the model parameters reach the standard.

With reference to the first aspect, further, the performing dimension reduction on the network detection data set by using the trained denoising self-coding neural network model includes:

training set T ₁ And test set T ₂ Inputting the training data into an encoder part in a trained self-coding neural network model reaching the standard, and outputting a training set T after dimension reduction ₁ ' sum test set T ₂ '。

the network detection data set D is input into an encoder part of the trained denoising self-coding neural network model, and the output of the encoder is the network detection data set D' after the dimensionality reduction.

With reference to the first aspect, further, the training process of the trained XGBoost network intrusion detection model includes:

training set T after dimension reduction ₁ ' training an XGBoost network intrusion detection model;

through the test set T after dimension reduction ₂ And (3) testing the trained model, and if the model does not reach the standard, continuing training after adjusting the parameters of the model until the test result reaches the standard.

With reference to the first aspect, further, building an intrusion database includes:

inputting the network detection data set D' subjected to dimension reduction into an XGBoost network intrusion detection model, and putting network data with intrusion detection results into the data set D _p In the method, network data with normal detection result is put into a data set D _n In (a) and (b);

according to D _p And network intrusion data disclosed in netlab to construct an intrusion database IDB.

In a second aspect, a network intrusion detection system is provided, comprising:

the data processing module is used for sorting the intercepted network data packets to obtain a network data set;

the intrusion detection module is used for reducing the dimension of the network detection data set by adopting a trained denoising self-coding neural network model;

The invention has the beneficial effects that: before the XGBoost model is used for detection, network data is input into the DAE model to reduce the dimension of the DAE model. In the training process of the self-coding neural network model, certain characteristics of part of samples are randomly covered or replaced before the samples are input into the neural network, so that effective information and data dimension reduction can be effectively extracted from massive complex network data, and the information extraction capacity and robustness of the model are improved; meanwhile, the invention adds the sample weighing factor into the loss function of the XGBoost model to influence the construction process of the tree, so that the rare attack type is more important in the modeling process, and the detection capability of the model for rare network intrusion is improved; the invention combines the information extraction capability of DAE to large-scale high-dimensional data with the detection capability of XGBoost model added with a weighing factor to unbalanced intrusion type, builds a DAE-XGBoost intrusion detection model, and well solves the problems of difficult high-dimensional and non-ideal rare attack type detection effect in the network intrusion detection process.

Drawings

Fig. 1 is a flowchart of a network intrusion detection method provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to better understand the present invention, the following describes related technologies in the technical solution of the present invention.

Example 1

The invention provides a network intrusion detection method, which comprises the following steps:

collecting network data by using a network analysis tool, selecting a basic characteristic attribute (TCP_B) of a TCP connection, a content attribute (TCP_C) of the TCP connection in a data packet, a time-based network traffic characteristic attribute (TCP_TF) and a HOST-based network traffic statistical characteristic (HOST_NF) to form a network data set. The specific content contained by each attribute is shown in table 1:

table 1 details of attributes in network packets

Step two, carrying out data preprocessing and characteristic engineering on the collected network data set ND, wherein the characteristic engineering comprises the following steps: for character type characteristics of network data, adopting onehot coding technology to convert the character type characteristics into numerical vectors which can be identified by a computer, for example, protocol types comprise three types of TCP, UDP and ICMP, and the character type characteristics are respectively [1, 0], [0,1,0] and [0,1] after onehot coding; the numerical value type characteristic is normalized, the mathematical expression is shown as a formula (1), the numerical value is constrained in the range of [0,1] so as to reduce the influence of the difference of the dimensions among different characteristics on the modeling process, and a network detection data set D is obtained at the moment;

wherein, the sample is normalized by x'; x is the original sample; x is x _max Is the maximum value of the sample; x is x _min Is the minimum of the samples.

Thirdly, manually marking all data into five types of normal, dos attack, probe attack, U2R attack and R2L attack by utilizing network data collected by a network analyzer and network intrusion data disclosed on the network to form a pre-training data set T;

step four, dividing the training data into a pre-training data set T according to 8:2 into training set T ₁ And test set T ₂ By T ₁ Training a DAE model (denoising self-coding neural network model), and stopping training when the loss function value is lower than 0.0001 in the training process; using test set T ₂ Testing the training effect of the DAE model, wherein if the loss function value expressed by the formula (6) in the test is lower than 0.0001, the DAE model can be used in the network intrusion detection process after the training is finished; if the number of the neurons is more than 0.0001, manually adjusting the number of the neurons and the number of layers of the neural network, and retraining the DAE model; after training is finished, training set T ₁ And test set T ₂ The encoder part input into the trained DAE model is output by the encoder to obtain a low-dimensional training set T ₁ 'and Low dimensional test set T' ₂ For subsequent training of XGBoost models.

Step five, utilizing a low-dimensional training set T ₁ ' training the XGBoost model, and stopping training when the recall rate shown in the following formula (16) is more than or equal to 0.75 in the training process; with a low-dimensional test set T' ₂ The XGBoost model intrusion detection effect is tested, and when the recall rate in the test is more than or equal to 0.75, the model is trained, so that the XGBoost model intrusion detection method can be used in the actual network intrusion detection process; and if the recall rate is less than 0.75, adjusting parameters of the XGBoost model, and retraining.

And step six, inputting the network data set D into an encoder part of the DAE model trained in the step four, and outputting to obtain a low-dimensional network detection data set D' by the encoder.

Step seven, inputting the low-dimensional network detection data set D' into the XGBoost network intrusion detection model trained in the step five; construction of data set D _p Putting network data with intrusion detection result into D _p In (a) and (b); construction of data set D _n Putting the network data with normal detection result into D _n Is a kind of medium.

Step eight, constructing an intrusion database IDB, and carrying out D _p And network intrusion data disclosed in netlab are stored in the IDB.

Step nine, utilizing the whole IDB and 5000D _n The DAE and XGBoost models are retrained every two months, and when the loss function value in the training is lower than 0.0001 and the recall rate is greater than or equal to 0.75, the training is ended, so that the DAE and XGBoost models have a good detection effect on new intrusion means.

The DAE model and the XGBoost model obtained through the previous steps can be combined to be used for detecting network intrusion in a network environment, and the invention aims to overcome the defects of the prior art, provide a network intrusion detection model based on a neural network DAE and integrated learning XGBoost, solve the problems that the network intrusion detection is difficult to process high-dimensional data and the unbalanced attack type detection effect is not ideal enough, and have higher detection effect on various intrusion detection.

In the data preprocessing stage, the invention constructs a data structure consisting of 26 fields, and network data collected by a network analysis tool after data preprocessing and feature engineering are stored in a csv file in the structure, wherein the structure comprises the following 26 tuples for forming transaction features of each time in a network data set ND, and the specific meaning of each field is shown in table 1:

<“duration”，“protocol_type”，“service”，“flag”，“src_bytes”，“dst_bytes”，“land”，“wrong_fragment”，“urgent”，“hot”，“num_failed_logins”，“logged_in”，“num_compromised”，“root_shell”，“su_attempted”，“num_root”，“num_file_creations”，“num_shells”，“num_access_files”，“num_outbound_cmds”，“is_hot_logins”，“is_guest_login”，“count”，“srv_count”，“dst_host_count”，“dst_host_srv_count”>

a piece of data is collected as follows in the network analysis tool:

<“2”,“tcp”,“smtp”,“SF”,“1684”,“363”,“0”,“0”,“0”,“0”,“0”,“1”,“0”,“0”,“0”,“0”,“0”,“0”,“0”,“0”,“0”,“0”,“1”,“1”,“104”,“66”>

the data structure designed by the invention can be obtained after the onehot coding and normalized feature engineering operation as follows:

<“0.16”,“0,0,1”,“1,0(69)”,“1,0(10)”,“0.37”,“0.15”,“0”,“0”,“0”,“0”,“0”,“1”，“0”,“0”,“0”,“0”,“0”,“0”,“0”,“0”,“0”,“0”,“1”,“1”,“0.24”,“0.19”>

in the DAE data dimension reduction stage, model parameters trained in advance in the embodiment are stored in a pkl file, and the DAE data dimension reduction can be performed only by calling the model parameters in the intrusion detection process.

In XGBoost intrusion detection stage, the embodiment stores the model parameters trained in advance in a pkl file, when intrusion is detected, the file is called to perform model recovery, then the low-dimensional network detection data subjected to DAE dimension reduction is input into the XGBoost model, a detection result can be obtained, and the data with intrusion as the result is stored in D _p Simultaneously storing network intrusion data which are newly disclosed on the internet together with the network intrusion data in an intrusion database IDB; save the data with normal result in D _n Is a kind of medium.

During the periodic model update phase, data and part D of the intrusion database IDB are utilized _n The DAE model and the XGBoost model are retrained, and the detection capability of the DAE model and the XGBoost model to new intrusion means is ensured.

The experimental results and analysis are given below:

the experimental environment is CPU Intel (R) Core (TM) i5-6300@2.30GHz,8 memory, hard disk 1T and operating system is Windows 10 computer. Run in Anaconda jupyter notebook using the Python language.

The test data sets experimentally selected were as follows:

the data set selected by the invention is a KDD99 data set, which is derived from advanced planning agency (DARRA) of the United states department of defense in 1998, carries out an intrusion detection evaluation project in MIT Lincoln laboratories, simulates various user types, various different network flows and attack means, and is like a real network environment.

Performance analysis:

in order to measure the classified detection condition of each intrusion detection model for each type of attack, several evaluation indexes of accuracy (precision), recall (recall), false detection rate and F1score are designed according to the common macro average macro evaluation mode of the multi-classification problem.

In order to comprehensively evaluate the model designed by the invention, two levels of verification experiments are designed, namely normal and attack types are classified, and at the moment, attack types except normal are unified into the attack type; and secondly, multi-classification, namely, detecting various attack types according to the label condition in the data set.

The number of sample categories in the training set and the test set are shown in tables 2 and 3 respectively. In evaluating the model, four test sets were constructed in order to test the generalization ability of the model for different data sets.

Table 2 attack category, number of training sets

Attack type	Normal	Dos	Probe	R2L	U2R	Total
							Quantity of	47278	191458	1607	500	180	241023

Table 3 attack category, number of test sets

Test Data	Normal	Dos	Probe	R2L	U2R	Total
							DATA1	12634	5362	1032	26	35	19159
DATA2	2836	78326	262	323	6	81753
							DATA3	32048	86343	569	148	32	119140
DATA4	2482	29969	637	129	27	33244

In order to compare the detection effect, the invention designs four groups of comparison experiments for respectively combining the DAE model with a Random Forest (RF), a k-nearest neighbor (knn) and a Support Vector Machine (SVM) and directly using XGBoost for detection without combining the DAE model to obtain the evaluation index of each model.

(1) In the two-classification experiment, the average value of the evaluation indexes obtained in the test data set in each experiment is shown in table 4:

TABLE 4 average evaluation index for each test set under two classification conditions

Model	precision	recall	F1score	false
					DAE-knn	0.9795	0.9685	0.974	0.00214
DAE-SVM	0.9562	0.9488	0.9524	0.00162
					DAE-RF	0.9626	0.9528	0.9577	0.00174
XGBoost	0.9842	0.9786	0.9813	0.00186
					DAE-XGBoost	0.9921	0.9823	0.9871	0.0008

The results of the evaluation indexes shown in the comprehensive table show that the detection effect of each method is good for the two-class condition of intrusion detection, and the method designed by the invention has little effect compared with other comparison methods.

(2) For the case of multi-class testing, the average evaluation index for each test set is shown in table 5,

TABLE 5 average evaluation index for each test set under multiple categories

Model	Macro-P	Macro-R	Macro-f1	Macro-F
					DAE-knn	0.7594	0.7562	0.7578	0.0427
DAE-SVM	0.7847	0.7684	0.7765	0.0254
					DAE-RF	0.8126	0.8042	0.8084	0.0204
XGBoost	0.8394	0.8094	0.8241	0.0214
					DAE-XGBoost	0.8785	0.8572	0.8677	0.00809

Example 2

There is provided a network intrusion detection system comprising:

As can be seen from the average evaluation index of each test set in the table, when various attack types are detected under the condition of unbalanced distribution of various attack types, compared with other methods, the method designed by the invention has higher accuracy and lower false detection rate.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method for network intrusion detection, comprising:

the intercepted network data packets are arranged to obtain a network data set;

the denoising self-coding neural network model and the XGBoost network intrusion detection model are retrained according to the intrusion database at regular intervals, and intrusion detection is carried out on network data according to the retrained model;

the trained denoising self-coding neural network training process comprises the following steps:

counting character data in training data set by adopting onehot codingThe value processing is carried out, the normalization processing is carried out on the numerical value data in the value processing, and the processed training data set is divided into a training set T ₁ And test set T ₂ ；

will test the data set T ₂ The model is input into a trained denoising self-coding neural network model to test the model, and if the model parameter does not reach the standard, the model parameter is adjusted and then the training is continued until the model reaches the standard;

the step of adopting the trained denoising self-coding neural network model to reduce the dimension of the network detection data set comprises the following steps:

training set T ₁ And test set T ₂ Inputting the training data into an encoder part in a trained self-coding neural network model reaching the standard, and outputting a training set T after dimension reduction ₁ 'and test set T' ₂ The encoder is an encoder;

and inputting the network detection data set D into an encoder part of the trained denoising self-coding neural network model, and outputting the network detection data set after the dimensionality reduction by the encoder.

2. The network intrusion detection method according to claim 1, wherein: the step of acquiring the network data set by arranging the intercepted network data packet comprises the following steps: and obtaining a network data set ND according to the basic characteristic attribute of the TCP connection, the content attribute of the TCP connection, the time-based network traffic characteristic attribute and the content of the host-based network traffic statistical characteristic in the intercepted network data packet.

3. A network intrusion detection method according to claim 2, wherein: the feature engineering processing is performed on the network data set to obtain a network detection data set specifically comprises the following steps:

4. The network intrusion detection method according to claim 1, wherein the training process of the trained XGBoost network intrusion detection model comprises:

by test set T 'after dimension reduction' ₂ And testing the trained model, and if the model does not reach the standard, continuing training after adjusting the model parameters until the test result reaches the standard.

5. The network intrusion detection method according to claim 1, wherein building an intrusion database comprises:

according to D _p And constructing an intrusion database IDB by using the network intrusion data disclosed in netlab, wherein the netlab is a network laboratory.

6. The network intrusion detection method of claim 1 wherein the XGBoost network intrusion detection model incorporates a sample tradeoff factor.

7. A network intrusion detection system, comprising: