CN109299741B - Network attack type identification method based on multi-layer detection - Google Patents

Network attack type identification method based on multi-layer detection Download PDF

Info

Publication number
CN109299741B
CN109299741B CN201811146113.7A CN201811146113A CN109299741B CN 109299741 B CN109299741 B CN 109299741B CN 201811146113 A CN201811146113 A CN 201811146113A CN 109299741 B CN109299741 B CN 109299741B
Authority
CN
China
Prior art keywords
classification
data
data set
training
classification module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811146113.7A
Other languages
Chinese (zh)
Other versions
CN109299741A (en
Inventor
胡昌振
吕坤
孙冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Publication of CN109299741A publication Critical patent/CN109299741A/en
Application granted granted Critical
Publication of CN109299741B publication Critical patent/CN109299741B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Abstract

The invention relates to a network attack type identification method based on multilayer detection, and belongs to the technical field of information security. The specific operation steps are as follows: step one, acquiring original training data and preprocessing. And step two, constructing an integrated classification model. And step three, training an integrated classification model. And step four, preprocessing the test data. And step five, classifying the test data. Compared with the prior art, the network attack type identification method based on multilayer detection has the following advantages that: the smote algorithm is adopted to carry out up-sampling on a few samples and carry out down-sampling on a plurality of samples, and the problem of unbalanced samples in a data set is solved. Secondly, an integrated model is adopted, so that the detection accuracy and the recall rate are improved. And thirdly, combining the drosophila optimization algorithm FOA with a Support Vector Machine (SVM) to realize the optimal and self-adaptive selection of parameters C and gamma in the SVM.

Description

Network attack type identification method based on multi-layer detection
Technical Field
The invention relates to a network attack type identification method based on multilayer detection, and belongs to the technical field of information security.
Background
In the cyberspace, the number and scale of cyber attacks in recent years have increased dramatically, and the basic main types of cybersattacks include Denial of Service (DoS), unauthorized Remote host access (Remote-to-log, R2L), unauthorized super User access (User-to-Root, U2R), snoop detection (Probing), and the like, each of which includes a plurality of sub-attack types. To effectively detect these network attacks, it has become an urgent task to deploy efficient intrusion detection systems.
The current commonly used network attack detection methods include: the rule-based detection method has the disadvantages that new intrusions are difficult to detect, and editing the rules is time-consuming and highly dependent on the known intrusion knowledge base. Secondly, an entropy detection method depending on network flow characteristic distribution has the defects that entropy expresses randomness, and abnormal flows which do not disturb randomness cannot be detected. And detecting methods based on machine learning, such as neural networks, support vector machines, clustering algorithms, and the like. The detection method based on machine learning can detect new intrusion and is generally applied to current intrusion detection, but the detection result is greatly influenced by data imbalance and parameters of an algorithm model.
Disclosure of Invention
The invention aims to solve the problems of unbalanced network attack detection data sets and low accuracy and recall rate of a network attack classification algorithm, and provides a network attack type identification method based on multilayer detection.
The invention is realized by the following technical scheme.
The invention provides a network attack type identification method based on multilayer detection, which comprises the following specific operation steps:
step one, acquiring original training data and preprocessing.
Step 1.1: and acquiring network attack data to form an original training data set. The network attack data includes numerical features and character discrete features. The character discrete type features include: protocol type, service type, and connection error identification.
Step 1.2: each piece of original training data in the original training data set is converted into a numerical type original training data feature vector. The method specifically comprises the following steps:
step 1.2.1: and extracting character discrete type features from each piece of data, and respectively coding the character discrete type features in a one-hot vector form, wherein one-hot vector is obtained by corresponding to one character discrete type feature.
Step 1.2.2: constructing a numerical characteristic vector by using the value of the numerical characteristic in each piece of data;
step 1.2.3: and merging the numerical characteristic vector in the step 1.2.2 with all the one-hot vectors obtained in the step 1.2.1.
Through the operation of the steps, a numerical-type original training data feature vector is obtained corresponding to an original training data.
Step 1.3: the problem of unbalanced quantity of each type of data of the original training data set is solved through data down sampling and data up sampling. The method specifically comprises the following steps:
case 1: if the amount of data of a certain type (denoted by symbol a) in the original training data set is much larger than that of data of other types, the amount of a type a is reduced by using a data down-sampling method, specifically: a part of data is randomly extracted from the data of the type A to reduce the data of the type A.
Case 2: if the number of a certain type (denoted by symbol B) in the original training data set is much lower than the number of other types of data, the data up-sampling method is adopted to increase the number of B types of data.
The data up-sampling algorithm is a SMOTE (Synthetic minimum Oversampling Technique) algorithm.
An original training data set after one-hot coding, data down-sampling and data up-sampling is called a basic training data set and is represented by a symbol X; by the symbol xijThe jth feature of the ith piece of data representing the base training data set X, i ∈ [1, n [ ]]N is the number of data in the basic training data set X;
step 1.4: the data in the basic training data set X is normalized by equation (1).
Figure GDA0001883560090000021
Wherein, x'ijAs data xijData obtained after normalization processing; AVGjTraining for foundationCalculating the average value of the jth characteristic of all the data in the training data set X through a formula (2); STDjThe standard deviation of the jth feature of all data in the basic training data set X is calculated by formula (3).
Figure GDA0001883560090000022
Figure GDA0001883560090000031
After the basic training data set is preprocessed through the operation of the first step, a training data set is obtained and is represented by a symbol X'.
And step two, constructing an integrated classification model.
The integrated classification model comprises a GBDT (Gradient Boost Decision Tree) classifier, a KNN classifier and a stacking classifier.
The GBDT classifier is learned by an idea of Boosting through iterative construction of Classification And Regression Trees (CART). By the symbol ft-1(x) Representing the GBDT classifier obtained by the iteration of the (t-1) th round, wherein t is a positive integer; by the symbol ft(x) Representing a GBDT classifier obtained by the t-th iteration; by the symbol L (y, f)t-1(x) Represents the loss function of the GBDT classifier obtained from the (t-1) th iteration; by the symbol L (y, f)t(x) Represents the penalty function of the GBDT classifier obtained from the t-th iteration; by the symbol ht(x) The fitting function obtained in the t-th round learning is shown. In the learning process of the GBDT classifier, the (t-1) th iteration is to find the L (y, f) in the formula (4)t(x) ) takes a minimum value of ht(x) In that respect Finding the smallest ht(x) The process of (2) adopts a method of negative gradient fitting of a loss function.
L(y,ft(x))=L(y,ft-1(x)+ht(x))(4)
The KNN classifier is used for classifying DoS (Denial of Service) type data and predicting the subtype of the DoS type data. And setting the parameter K of the KNN classifier to be 3.
The stacking classifier is used for classifying non-DoS type data. The stacking classifier is divided into a primary classification model and a secondary classification model. The primary model is divided into an upper layer and a lower layer, wherein the upper layer of the primary model is formed by connecting 3 xgboost (eXtreme Gradient boost) classification module groups, 1 SVM (support vector machines) classification module group, 1 GBDT (Gradient boost Defison Tree) classification module group and 1 RF (random forest) classification module group in parallel. Each xgboost classification module group is formed by connecting m xgboost classification modules in parallel, each SVM classification module group is formed by connecting m SVM classification modules in parallel, each GBDT classification module group is formed by connecting m GBDT classification modules in parallel, and each RF classification module group is formed by connecting m RF classification modules in parallel; m is an artificial set value, and m belongs to [3,8 ]. The lower layer of the primary module is a splicing and voting module.
And the output ends of the 3 xgboost classification module groups, the 1 SVM classification module group, the 1 GBDT classification module group and the 1 RF classification module group at the upper layer of the primary model are respectively connected with the input ends of the splicing and voting modules at the lower layer of the primary model. In the training phase, the splicing and voting module is used for: and combining the output results of each xgboost classification module group, SVM classification module group, GBDT classification module group and RF classification module group of the upper layer of the primary model to obtain a vector matrix called stacking vector matrix. In the testing stage, the splicing and voting module is used for: corresponding to a piece of test data, output results of each xgboost classification module group, SVM classification module group, GBDT classification module group and RF classification module group of the upper layer of the primary model are respectively voted, each classification module group obtains a classification result, and then the classification results are combined to obtain a 1 x 6 stacking feature vector.
The secondary model is an SVM classifier, and the input of the secondary model is the stacking feature vector generated by the primary model.
The SVM classifier adopts a drosophila optimization algorithm FOA to optimally select SVM kernel function parameters (represented by a symbol gamma) and punishment parameters (represented by a symbol C), and the specific operation steps are as follows:
step 2.1: initializing SVM kernel function parameter gamma and punishment parameter C, gamma belongs to [0.001, 5]],C∈[0.001,5]. Setting the starting position of the fruit fly as (C)beginbegin) In which C isbegin=C,γbegin=γ。
Step 2.2: the population size (denoted by the symbol popsize), the number of iterations (denoted by the symbol epoch), and the search distance (denoted by the symbol val) of the penalty parameter C are setCRepresentation) and the search distance of the kernel parameter y (denoted by the symbol val)γRepresentation). popsize E [8,15 ]],epoch≥5,valC∈[0.05,0.5],valγ∈[0.001,0.01]。
Step 2.3: calculating the position of the pth fruit fly at the next moment according to the formulas (6) to (7), and using the symbol (C)pp) Denotes p ∈ [1, popsize >]。
Cp=Cbegin+valC×ε (6)
γp=γbegin+valγ×ε (7)
Wherein ε is a random value in the range of [ -1,1 ].
Step 2.4: if the penalty parameter C is less than 0.001 at the moment, C is 0.001; if C >5, then C ═ 5. If γ <0.001, γ is 0.001; when γ is greater than 5, γ is 5.
Step 2.5: and (4) calculating the fitness function values of the positions of all the drosophila flies obtained in the step 2.3 according to a formula (8).
Fit(Cpp)=accuracy(Cpp) (8)
Wherein, Fit (C)pp) The fitness function value of the position of the pth fruit fly is shown; accuracy (C)pp) Representing the SVM classifier at parameter (C)qq) Upper cross validation generated accuracy, Cq=Cpq=γp
Step 2.6, finding the maximum value (using the maximum value) in the fitness function values corresponding to the positions of all fruit flies at the current momentSymbol FitmaxRepresentation), and FitmaxThe corresponding position is judged to be Fit at the momentmaxIf the fitness function value is higher than the fitness function value of the initial position, Fit is usedmaxThe corresponding position replaces the initial position while saving the FitmaxThen the next iteration is performed. If it is at that time FitmaxIf the fitness function value is lower than the fitness function value of the initial position, the step 2.3 to the step 2.6 are repeatedly executed until the iteration times reach the epoch times, and the operation is finished.
The connection relation of the integrated classification model is as follows: external data enters the integrated classification model through the input end of the GBDT classifier; the output end of the GBDT classifier is respectively connected with the input ends of the KNN classifier and the stacking classifier; and the output of the KNN classifier and the stacking classifier is used as the external output of the integrated classification model.
And step three, training an integrated classification model.
And training an integrated classification model on the basis of the operation of the step one and the operation of the step two. The method specifically comprises the following steps:
step 3.1: the GBDT classifier is trained. The method specifically comprises the following steps:
step 3.1.1: the data in the training data set X' are labeled by category. The data in the training data set X' is labeled as DoS type and other type 2.
Step 3.1.2: the GBDT classifier is trained using the labeled training data set X'.
And 3.1, obtaining the trained GBDT classifier.
Step 3.2: and training the KNN classifier. The method specifically comprises the following steps:
step 3.2.1: constructing a DoS type data set for the data marked as DoS type in the training data set X', and using the symbol X1' means.
Step 3.2.2: to DoS type data set X'1The data in (1) is marked for fine classification. The DoS type dataset, in symbol X'1The data in (1) are subdivided into: smurf attacks, neptune attacks, back attacks, teardrop attacks, pod attacks, and Other attacks.
Step 3.2.3: to DoS type data set X'1Performing data down-sampling processing according to the subdivision type to solve the DoS type data set X'1The quantity of each subdivision type data is unbalanced; the data set after data down-sampling, called KNN training data set, is represented by symbol X1And (4) showing.
Step 3.2.4: training dataset X using KNN1And training the KNN classifier.
And 3.2, obtaining the trained KNN classifier.
Step 3.3: training a stacking classifier. The method specifically comprises the following steps:
step 3.3.1: constructing a stacking training data set using data labeled as other types in training data set X', denoted by symbol X2Representing, and then performing fine classification marking on data in a stacking training data set X2Is subdivided into: normal, Probe, U2L (User-to-Root, unauthorized access to a supervisor), R2L (Remote-to-log, unauthorized access to a Remote host).
Step 3.3.2: training data set X2Is divided into m subsets, called 1 st subset, 2 nd subset, … …, m subsets. The number of data per subset is denoted by the symbol M, which is a positive integer.
Step 3.3.3: the set of RF classification modules is trained. The method specifically comprises the following steps:
step 3.3.3.1: the temporary variable is denoted by the symbol h, h ∈ [1, m ]. The initial value of h is set to 1.
Step 3.3.3.2: training data set X2As verification data. Then, using stacking training data set X2As training data, to train an untrained RF classification module of the set of RF classification modules.
Step 3.3.3.3: and inputting the data of the h-th subset into the trained RF classification module in step 3.3.3.2 for classification, so as to obtain an M × 1 vector matrix.
Step 3.3.3.4: if h < m, the value of h is incremented by 1 and steps 3.3.3.2 through 3.3.3.4 are repeated. Otherwise, the operation of step 3.3.3.5 is performed.
Step 3.3.3.5: and merging the classification results of the 1 st subset to the m th subset obtained in the step 3.3.3.2 to obtain a classification result of the data of the stacking training data set in the RF classification module group, and sending the classification result to the splicing and voting module.
Completing the training of the RF classification module group through the operations from step 3.3.3.1 to step 3.3.3.5, and obtaining a stacking training data set X2The classification result of the data in the RF classification module group.
Step 3.3.4: training the SVM classification module group. The method specifically comprises the following steps:
step 3.3.4.1: the temporary variable is denoted by the symbol h, h ∈ [1, m ]. The initial value of h is set to 1.
Step 3.3.4.2: training data set X2As verification data. Then, using stacking training data set X2The other data is used as training data to train an untrained SVM classification module in the SVM classification module group.
Step 3.3.4.3: and inputting the data of the h-th subset into the SVM classification module trained in the step 3.3.4.2 for classification, so as to obtain an M × 1 vector matrix.
Step 3.3.4.4: if h < m, the value of h is incremented by 1 and steps 3.3.4.2 through 3.3.4.4 are repeated. Otherwise, the operation of step 3.3.4.5 is performed.
Step 3.3.4.5: merging the classification results of the 1 st subset to the m th subset obtained in the step 3.3.4.2 to obtain a stacking training data set X2The classification result of the data in the SVM classification module group is sent to the splicing and voting module.
Through the operations from step 3.3.4.1 to step 3.3.4.5, training of the SVM classification module group is completed, and a classification result of data of a stacking training data set in the SVM classification module group is obtained.
Step 3.3.5: and training the GBDT classification module group. The method specifically comprises the following steps:
step 3.3.5.1: the temporary variable is denoted by the symbol h, h ∈ [1, m ]. The initial value of h is set to 1.
Step 3.3.5.2: training data set X2As verification data. Then, using stacking training data set X2As training data, training an untrained GBDT classification module in the GBDT classification module group.
Step 3.3.5.3: and inputting the data of the h-th subset into the GBDT classification module trained in the step 3.3.5.2 for classification, so as to obtain an M × 1 vector matrix.
Step 3.3.5.4: if h < m, the value of h is incremented by 1 and steps 3.3.5.2 through 3.3.5.4 are repeated. Otherwise, the operation of step 3.3.5.5 is performed.
Step 3.3.5.5: merging the classification results of the 1 st subset to the m th subset obtained in the step 3.3.5.2 to obtain a stacking training data set X2And (4) the classification result of the data in the GBDT classification module group is sent to the splicing and voting module.
Through the operations from step 3.3.5.1 to step 3.3.5.5, the training of the GBDT classification module group is completed, and the classification result of the data of the stacking training data set in the GBDT classification module group is obtained.
Step 3.3.6: and training an XGBOOST classification module group. The method specifically comprises the following steps:
step 3.3.6.1: the temporary variable is denoted by the symbol h, h ∈ [1, m ]. The initial value of h is set to 1.
Step 3.3.6.2: training data set X2As verification data. Then, using stacking training data set X2The other data of XGBOOST classification module group is used as training data to train an untrained XGBOOST classification module in the XGBOOST classification module group.
Step 3.3.6.3: and inputting the h-th subset data into the trained XGBOOST classification module in step 3.3.6.2 for classification to obtain an Mx 1 vector matrix.
Step 3.3.6.4: if h < m, the value of h is incremented by 1 and steps 3.3.6.2 through 3.3.6.4 are repeated. Otherwise, the operation of step 3.3.6.5 is performed.
Step 3.3.6.5: and merging the classification results of the 1 st subset to the m th subset obtained in the step 3.3.6.2 to obtain the classification result of the data of the stacking training data set in the XGBOOST classification module group, and sending the classification result to the splicing and voting module.
Completing the training of the XGBOST classification module group through the operations from step 3.3.6.1 to step 3.3.6.5, and obtaining a stacking training data set X2The data of XGBOOST is classified into the classification result of the XGBOOST classification module group.
Step 3.3.7: repeating the step 3.3.6 for 2 times to finish the training of the other 2 XGB OST classification module groups and obtain a stacking training data set X2The data in the other 2 XGBOOST classification module groups are classified into results and sent to the splicing and voting module.
Step 3.3.8: the splicing and voting module carries out the stacking training data set X obtained from the step 3.3.3 to the step 3.3.72Combining the classification results of all the classification module groups to obtain a vector matrix of P multiplied by 6, namely a stacking vector matrix; wherein P represents a stacking training data set X2The amount of data of (c).
Step 3.3.9: inputting the stacking vector matrix obtained in the step 3.3.8 into a secondary model SVM classifier of a stacking classifier, and performing training operation to obtain a trained stacking classifier.
And finishing the training of the stacking classifier through the operation of the steps to obtain a trained integrated classification model.
And step four, preprocessing the test data. The method specifically comprises the following steps:
step 4.1: and acquiring network attack data to form an original test data set. The network attack data includes numerical features and character discrete features. The character discrete type features are as follows: protocol type, service type, and connection error identification.
Step 4.2: each piece of original test data in the original test data set is converted into a numerical type original test data feature vector. The method specifically comprises the following steps:
step 4.2.1: and extracting character discrete type features from each piece of data, and respectively coding the character discrete type features in a one-hot vector form, wherein one-hot vector is obtained by corresponding to one character discrete type feature.
Step 4.2.2: constructing a numerical characteristic vector by using the value of the numerical characteristic in each piece of data;
step 4.2.3: and merging the numerical characteristic vector in the step 4.2.2 with the one-hot vector obtained in the step 4.2.1.
And through the operation of the steps, corresponding to one piece of original test data, and obtaining a numerical value type original test data feature vector.
The original test data set subjected to one-hot coding is called a basic test data set, and the symbol X is usedtestRepresents; by the symbol xtest,ijRepresenting a base test data set XtestThe jth feature of the ith piece of data.
Step 4.3: the basic test data set X is given by the formula (5)testThe data in (1) is normalized.
Figure GDA0001883560090000081
Wherein, x'test,ijAs data xtest,ijData obtained after normalization processing; AVGjThe average value of the jth feature of all data in the basic training data set X obtained in the step 1.4 is obtained; STDjThe standard deviation of the jth feature of all data in the basic training data set X obtained in step 1.4.
After the operation of step 4, preprocessing the basic test data set to obtain a test data set, using the symbol X'testAnd (4) showing.
And step five, classifying the test data.
And inputting the test data obtained through the preprocessing in the fourth step into the integrated classification model trained in the third step for classification. The method comprises the following specific steps:
step 5.1: inputting a piece of test data obtained through the preprocessing in the step four into the GBDT classifier, and if the classification result is the DoS type, executing the operation in the step 5.2; if the classification result is of non-DoS type, the operation of step 5.3 is performed.
Step 5.2: and inputting the test data into a KNN classifier for classification to obtain and output a final classification result, and finishing the operation.
Step 5.3: and respectively inputting the test data into each RF classification module in the RF classification module group, and outputting the classification result to the splicing and voting module after classification operation. And the splicing and voting module votes the output result of the RF classification module group to determine the classification result.
Step 5.4: and respectively inputting the test data into each GBDT classification module in the GBDT classification module group, and outputting classification results to a splicing and voting module after classification operation. And the splicing and voting module votes the output result of the GBDT classification module group to determine the classification result.
Step 5.5: and respectively inputting the test data into each SVM classification module in the SVM classification module group, and outputting a classification result to a splicing and voting module after classification operation. And the splicing and voting module votes the output result of the SVM classification module group to determine the classification result.
Step 5.6: and respectively inputting the test data into each xgboost classification module in one xgboost classification module group, and outputting the classification result to a splicing and voting module after classification operation. And the splicing and voting module votes the output result of the xgboost classification module group to determine the classification result.
Step 5.7: and repeating the operation of the step 5.6 for 2 times to obtain the classification results of the other 2 xgboost classification module groups.
Step 5.8: and (5) combining the results of the steps from 5.3 to 5.7 to obtain a 1 × 6 stacking vector.
Step 5.9: and (4) inputting the 1 × 6 stacking vector obtained in the step 5.8 into a secondary model SVM classifier of the stacking classifier, performing classification operation to obtain and output a classification result of the test data, and finishing the operation.
Advantageous effects
Compared with the prior art, the network attack type identification method based on multilayer detection has the following advantages that:
the smote algorithm is adopted to carry out up-sampling on a few samples and carry out down-sampling on a plurality of samples, and the problem of unbalanced samples in a data set is solved.
Secondly, an integrated classification model is adopted, so that the detection accuracy and recall rate are improved.
And thirdly, combining a drosophila optimization algorithm (FOA) with a Support Vector Machine (SVM) to realize optimal and self-adaptive selection of parameters C and gamma in the SVM.
Drawings
Fig. 1 is an operation flowchart of a network attack type identification method based on multi-layer detection in an embodiment of the present invention.
FIG. 2 is a block diagram of an integrated classification model in accordance with an embodiment of the present invention.
Detailed Description
According to the technical scheme, the invention is described in detail by combining the drawings and the implementation examples.
The network attack type identification method based on multi-layer detection provided by the invention has the operation flow as shown in figure 1, and the specific operation steps are as follows.
Step one, acquiring original training data and preprocessing.
Step 1.1: and acquiring network attack data to form an original training data set. The experiment adopts a KDD99 data set, the data distribution in the original training data set is shown in Table 1, and the data distribution comprises Normal data, DoS data, PROBE data, U2L data, R2L data and five types of data. Wherein the distribution of the subtypes of DoS class data is shown in table 2. Each piece of normal data or attack data is composed of 41 features, as shown in table 3, in which the values of three discrete features, "pro col _ TYPE", "SERVICE", "FLAG" are character labels, and the values of the other features are numerical values.
TABLE 1 data distribution of the raw training data set of KDD99
Categories NORMAL DoS PROBE U2L R2L
Original training set 97278 391485 4107 52 1126
TABLE 2 KDD99 raw training data set DoS attack subtype data distribution
Categories Back neptune pod smurf teardrop Other
Original training set 2203 107201 264 280790 979 21
Table 3 41 characteristic components of KDD99 dataset
Figure GDA0001883560090000111
Figure GDA0001883560090000121
Figure GDA0001883560090000131
Step 1.2: each piece of original training data in the original training data set is converted into a numerical type original training data feature vector. The method specifically comprises the following steps:
step 1.2.1: three character discrete characteristics of 'PROTOCOL _ TYPE', 'SERVICE' and 'FLAG' are extracted from each piece of data and are respectively encoded in a one-hot vector form, and one character discrete characteristic corresponds to one-hot vector.
Step 1.2.2: constructing a numerical characteristic vector by using the value of the numerical characteristic in each piece of data;
step 1.2.3: and merging the numerical characteristic vector in the step 1.2.2 with all the one-hot vectors obtained in the step 1.2.1.
Through the operation of the steps, a numerical-type original training data feature vector is obtained corresponding to an original training data.
Step 1.3: the problem of unbalanced quantity of each type of data of the original training data set is solved through data down sampling and data up sampling. The method specifically comprises the following steps:
the Normal data type has a much larger number of samples and DoS types than the other types. 10000 pieces of data are randomly extracted from the Normal type data, and the number of the Normal type data is reduced. The DoS type data is composed of a plurality of subtypes, samples of smurf attack data and neptune attack data are far more than the number of other subtypes, the two subtypes of data are downsampled, the smurf samples are randomly drawn for 14000, and the neptune samples are randomly drawn for 8533.
The data amount of PROBE, U2L and R2L types in the original training data is less than that of DoS and Normal data, and the three types of data are up-sampled by adopting an SMOTE algorithm.
In the invention, the PROBE sample is expanded by 2 times for sampling, and the neighbor number in the corresponding SMOTE algorithm is set to be 3; expanding the R2L sample by 4 times for sampling, and setting the neighbor number in the corresponding SMOTE algorithm as 3; the U2L sample is expanded by 40 times the sample, and the number of neighbors in the corresponding SMOTE algorithm is set to 10. After up-sampling, 8214 samples were obtained for PROBE, 4504 samples were obtained for R2L, and 2080 samples were obtained for U2L.
An original training data set after one-hot coding, data down-sampling and data up-sampling is called a basic training data set and is represented by a symbol X; by the symbol xijThe jth feature of the ith piece of data representing the base training data set X, i ∈ [1, n [ ]]And n is 54798. The data distribution of the resulting base training data set and the data distribution of the DoS subtypes therein are shown in tables 4 and 5.
TABLE 4 data distribution of the basic training data set X
Categories NORMAL DoS PROBE U2L R2L
Basic training set 10000 30000 8214 2080 4504
TABLE 5 DoS attack subtype data distribution in basic training dataset X
Categories Back neptune pod smurf teardrop Other
Basic training set 2203 8533 264 14000 979 21
Step 1.4: the data in the basic training data set X is normalized by equation (1).
Figure GDA0001883560090000141
Wherein, x'ijAs data xijData obtained after normalization processing; AVGjCalculating the average value of the jth characteristic of all data in the basic training data set X through a formula (2); STDjThe standard deviation of the jth feature of all data in the basic training data set X is calculated by formula (3).
Figure GDA0001883560090000142
Figure GDA0001883560090000143
After the basic training data set is preprocessed through the operation of the first step, a training data set is obtained and is represented by a symbol X'.
And step two, constructing an integrated classification model.
The integrated classification model comprises a GBDT (Gradient Boost Decision Tree) classifier, a KNN classifier and a stacking classifier.
The GBDT classifier is learned by an idea of Boosting through iterative construction of Classification And Regression Trees (CART). By the symbol ft-1(x) Representing the GBDT classifier obtained by the iteration of the (t-1) th round, wherein t is a positive integer; by the symbol ft(x) Representing a GBDT classifier obtained by the t-th iteration; by the symbol L (y, f)t-1(x) Represents the loss function of the GBDT classifier obtained from the (t-1) th iteration; by the symbol L (y, f)t(x) Represents the penalty function of the GBDT classifier obtained from the t-th iteration; by the symbol ht(x) The fitting function obtained in the t-th round learning is shown. In the learning process of the GBDT classifier, the (t-1) th iteration is to find the L (y, f) in the formula (4)t(x) ) takes a minimum value of ht(x) In that respect Finding the smallest ht(x) The process of (2) adopts a method of negative gradient fitting of a loss function.
L(y,ft(x))=L(y,ft-1(x)+ht(x)) (4)
The KNN classifier is used for classifying DoS (Denial of Service) type data and predicting the subtype of the DoS type data. And setting the parameter K of the KNN classifier to be 3.
The stacking classifier is used for classifying non-DoS type data. The stacking classifier is divided into a primary classification model and a secondary classification model. The primary model is divided into an upper layer and a lower layer, and the upper layer of the primary model is formed by connecting 3 xgboost classification module groups, 1 SVM classification module group, 1 GBDT classification module group and 1 RF classification module group in parallel. Each xgboost classification module group is formed by connecting m xgboost classification modules in parallel, each SVM classification module group is formed by connecting m SVM classification modules in parallel, each GBDT classification module group is formed by connecting m GBDT classification modules in parallel, and each RF classification module group is formed by connecting m RF classification modules in parallel. In this patent, m is set to 5. The lower layer of the primary module is a splicing and voting module.
And the output ends of the 3 xgboost classification module groups, the 1 SVM classification module group, the 1 GBDT classification module group and the 1 RF classification module group at the upper layer of the primary model are respectively connected with the input ends of the splicing and voting modules at the lower layer of the primary model. In the training phase, the splicing and voting module is used for: and combining the output results of each xgboost classification module group, SVM classification module group, GBDT classification module group and RF classification module group of the upper layer of the primary model to obtain a vector matrix called stacking vector matrix.
The structure of the integrated classification model is shown in fig. 2.
In the testing stage, the splicing and voting module is used for: corresponding to a piece of test data, output results of each xgboost classification module group, SVM classification module group, GBDT classification module group and RF classification module group of the upper layer of the primary model are respectively voted, each classification module group obtains a classification result, and then the classification results are combined to obtain a 1 x 6 stacking feature vector.
The secondary model is an SVM classifier, and the input of the secondary model is the stacking feature vector generated by the primary model.
The SVM classifier adopts an FOA (Fly Optimization Algorithm) Optimization Algorithm to optimally select an SVM kernel function parameter (represented by a symbol gamma) and a penalty parameter (represented by a symbol C), and the specific operation steps are as follows:
step 2.1: initializing SVM kernel function parameter gamma and punishment parameter C, gamma belongs to [0.001, 5]],C∈[0.001,5]. In the examples, γ is set to 0.01 and C is set to 0.5, namely the initial position of the fruit fly is (C)beginbegin) In which C isbegin=C,γbegin=γ。
Step 2.2: the population size (denoted by the symbol popsize), the number of iterations (denoted by the symbol epoch), and the search distance (denoted by the symbol val) of the penalty parameter C are setCRepresentation) and the search distance of the kernel parameter y (denoted by the symbol val)γRepresentation). popsize 10, epoch 5, valC=0.1,valγ=0.001。
Step 2.3: calculating the position of the pth fruit fly at the next moment according to the formulas (6) to (7), and using the symbol (C)pp) Denotes p ∈ [1, popsize >]。
Cp=Cbegin+valC×ε (6)
γp=γbegin+valγ×ε (7)
Wherein ε is a random value in the range of [ -1,1 ].
Step 2.4: if the penalty parameter C is less than 0.001 at the moment, C is 0.001; if C >5, then C ═ 5. If γ <0.001, γ is 0.001; when γ is greater than 5, γ is 5.
Step 2.5: and (4) calculating the fitness function values of the positions of all the drosophila flies obtained in the step 2.3 according to a formula (8).
Fit(Cpp)=accuracy(Cpp) (8)
Wherein, Fit (C)pp) The fitness function value of the position of the pth fruit fly is shown; accuracy (C)pp) Representing the SVM classifier at parameter (C)qq) Upper cross validation generated accuracy, Cq=Cpq=γp
Step 2.6, finding the maximum value (using the symbol Fit) in the fitness function values corresponding to the positions of all fruit flies at the current momentmaxRepresentation), and FitmaxThe corresponding position is judged to be Fit at the momentmaxIf the fitness function value is higher than the fitness function value of the initial position, Fit is usedmaxThe corresponding position replaces the initial position while saving the FitmaxThen the next iteration is performed. If it is at that time FitmaxIf the fitness function value is lower than the fitness function value of the initial position, the step 2.3 to the step 2.6 are repeatedly executed until the iteration times reach the epoch times, and the operation is finished.
The connection relation of the integrated classification model is as follows: external data enters the integrated classification model through the input end of the GBDT classifier; the output end of the GBDT classifier is respectively connected with the input ends of the KNN classifier and the stacking classifier; and the output of the KNN classifier and the stacking classifier is used as the external output of the integrated classification model.
And step three, training an integrated classification model.
And training an integrated classification model on the basis of the operation of the step one and the operation of the step two. The method specifically comprises the following steps:
step 3.1: the GBDT classifier is trained. The method specifically comprises the following steps:
step 3.1.1: the data in the training data set X' are labeled by category. The data in the training data set X' is labeled as DoS (Denial of Service) type and other 2 types. The data distribution is shown in table 6.
TABLE 6 data distribution of GBDT classifier training set
Data type DoS non-DoS (U2L, R2L, Normal, Probe)
Number of 30000 24798
Step 3.1.2: the GBDT classifier is trained using the labeled training data set X'.
And 3.1, obtaining the trained GBDT classifier.
Step 3.2: and training the KNN classifier. The method specifically comprises the following steps:
step 3.2.1: constructing a DoS type data set by using data marked as DoS type in the training data set X ', and using symbols X'1And (4) showing.
Step 3.2.2: to DoS type data set X'1The data in (1) is marked for fine classification. The DoS type dataset, in symbol X'1The data in (1) are subdivided into: smurf attacks, neptune attacks, back attacks, teardrop attacks, pod attacks, and Other.
Step 3.2.3: to DoS type data set X'1Performing data down-sampling processing according to the subdivision type; the number of types of data of smurf and neptune is far more than that of other types of data, 5000 pieces of smurf data and 4000 pieces of neptune data are randomly extracted. The data set after data down-sampling, called KNN training data set, is represented by symbol X1And (4) showing. The data distribution is shown in table 7.
TABLE 7 training data distribution for KNN classifier
DoS subtype Back neptune pod smurf teardrop Other
Number of 2203 4000 264 5000 979 21
Step 3.2.4: training dataset X using KNN1And training the KNN classifier.
And 3.2, obtaining the trained KNN classifier.
Step 3.3: training a stacking classifier. The method specifically comprises the following steps:
step 3.3.1: constructing a stacking training data set using data labeled as other types in training data set X', denoted by symbol X2Representing, and then performing fine classification marking on data in a stacking training data set X2Is subdivided into: normal, PROBE, U2L, R2L. The data distribution is shown in table 8.
TABLE 8 training data distribution for stacking model
Data type NORMAL PROBE U2L R2L
Number of 10000 8214 2080 4504
Step 3.3.2: training data set X2Is divided evenly into 5 subsets, referred to as the 1 st subset, the 2 nd subset, … …, and the 5 th subset, respectively. The number of data per subset is denoted by the symbol M, which is a positive integer.
Step 3.3.3: the set of RF classification modules is trained. The method specifically comprises the following steps:
step 3.3.3.1: the temporary variable is denoted by the symbol t, t ∈ [1,5 ]. The initial value of t is set to 1.
Step 3.3.3.2: training data set X2As verification data, t e [1,5]]. Then, using stacking training data set X2As training data, to train an untrained RF classification module of the set of RF classification modules.
Step 3.3.3.3: and inputting the data of the t-th subset into the RF classification module trained in step 3.3.3.2 for classification, so as to obtain an mx 1 vector matrix.
Step 3.3.3.4: if t <5, the value of t is incremented by 1 and steps 3.3.3.2 through 3.3.3.4 are repeated. Otherwise, the operation of step 3.3.3.5 is performed.
Step 3.3.3.5: and merging the classification results of the 1 st subset to the 5 th subset obtained in the step 3.3.3.2 to obtain a classification result of the data of the stacking training data set in the RF classification module group, and sending the classification result to the splicing and voting module.
Completing the training of the RF classification module group through the operations from step 3.3.3.1 to step 3.3.3.5, and obtaining a stacking training data set X2The classification result of the data in the RF classification module group.
Step 3.3.4: training the SVM classification module group. The method specifically comprises the following steps:
step 3.3.4.1: the temporary variable is denoted by the symbol t, t ∈ [1,5 ]. The initial value of t is set to 1.
Step 3.3.4.2: training data set X2As verification data, t e [1,5]]. Then, using stacking training data set X2The other data is used as training data to train an untrained SVM classification module in the SVM classification module group.
Step 3.3.4.3: and inputting the data of the t-th subset into the SVM classification module trained in the step 3.3.4.2 for classification, so as to obtain an M × 1 vector matrix.
Step 3.3.4.4: if t <5, the value of t is incremented by 1 and steps 3.3.4.2 through 3.3.4.4 are repeated. Otherwise, the operation of step 3.3.4.5 is performed.
Step 3.3.4.5: merging the classification results of the 1 st subset to the 5 th subset obtained in the step 3.3.4.2 to obtain a stacking training data set X2The classification result of the data in the SVM classification module group is sent to the splicing and voting module.
Through the operations from step 3.3.4.1 to step 3.3.4.5, training of the SVM classification module group is completed, and a classification result of data of a stacking training data set in the SVM classification module group is obtained.
Step 3.3.5: and training the GBDT classification module group. The method specifically comprises the following steps:
step 3.3.5.1: the temporary variable is denoted by the symbol t, t ∈ [1,5 ]. The initial value of t is set to 1.
Step 3.3.5.2: training data set X2As verification data, t e [1,5]]. Then, using stacking training data set X2As training data, training an untrained GBDT classification module in the GBDT classification module group.
Step 3.3.5.3: and inputting the data of the t-th subset into the GBDT classification module trained in the step 3.3.5.2 for classification, so as to obtain an M × 1 vector matrix.
Step 3.3.5.4: if t <5, the value of t is incremented by 1 and steps 3.3.5.2 through 3.3.5.4 are repeated. Otherwise, the operation of step 3.3.5.5 is performed.
Step 3.3.5.5: merging the classification results of the 1 st subset to the 5 th subset obtained in the step 3.3.5.2 to obtain a stacking training data set X2And (4) the classification result of the data in the GBDT classification module group is sent to the splicing and voting module.
Through the operations from step 3.3.5.1 to step 3.3.5.5, the training of the GBDT classification module group is completed, and the classification result of the data of the stacking training data set in the GBDT classification module group is obtained.
Step 3.3.6: and training an XGBOOST classification module group. The method specifically comprises the following steps:
step 3.3.6.1: the temporary variable is denoted by the symbol t, t ∈ [1,5 ]. The initial value of t is set to 1.
Step 3.3.6.2: training data set X2As verification data, t e [1,5]]. Then, using stacking training data set X2The other data of XGBOOST classification module group is used as training data to train an untrained XGBOOST classification module in the XGBOOST classification module group.
Step 3.3.6.3: and inputting the data of the t-th subset into the XGBOOST classification module trained in the step 3.3.6.2 for classification to obtain an Mx 1 vector matrix.
Step 3.3.6.4: if t <5, the value of t is incremented by 1 and steps 3.3.6.2 through 3.3.6.4 are repeated. Otherwise, the operation of step 3.3.6.5 is performed.
Step 3.3.6.5: and merging the classification results of the 1 st subset to the 5 th subset obtained in the step 3.3.6.2 to obtain the classification result of the data of the stacking training data set in the XGBOOST classification module group, and sending the classification result to the splicing and voting module.
Completing the training of the XGBOST classification module group through the operations from step 3.3.6.1 to step 3.3.6.5, and obtaining a stacking training data set X2The data of XGBOOST is classified into the classification result of the XGBOOST classification module group.
Step 3.3.7: repeating the step 3.3.6 for 2 times to finish the training of the other 2 XGB OST classification module groups and obtain a stacking training data set X2The data in the other 2 XGBOOST classification module groups are classified into results and sent to the splicing and voting module.
Step 3.3.8: the splicing and voting module carries out the stacking training data set X obtained from the step 3.3.3 to the step 3.3.72Combining the classification results of all the classification module groups to obtain a vector matrix of P multiplied by 6, namely a stacking vector matrix; wherein P represents a stacking training data set X2The amount of data of (c).
Step 3.3.9: inputting the stacking vector matrix obtained in the step 3.3.8 into a secondary model SVM classifier of a stacking classifier, and performing training operation to obtain a trained stacking classifier.
And finishing the training of the stacking classifier through the operation of the steps to obtain a trained integrated classification model.
And step four, preprocessing the test data.
Step 4.1: and acquiring network attack data to form an original test data set. As described in step 1.1, the experiment used a KDD99 dataset with 41 signature components per test data, as shown in table 3. The values of three discrete features "PROTOCOL _ TYPE", "SERVICE" and "FLAG" are character labels, and the values of the other features are numerical values. The data distribution and DoS subtype distribution in the original test data set are shown in tables 9 and 10.
TABLE 9 data distribution of raw test data set
Type (B) NORMAL DoS PROBE U2L R2L
Number of 60593 229853 4166 228 16189
TABLE 10 original test data set DoS attack subtype data distribution
Type (B) Back neptune pod smurf teardrop Other
Number of 1098 58001 87 164091 12 6564
Step 4.2: each piece of original test data in the original test data set is converted into a numerical type original test data feature vector. The method specifically comprises the following steps:
step 4.2.1: and extracting character discrete type features from each piece of data, and respectively coding the character discrete type features in a one-hot vector form, wherein one-hot vector is obtained by corresponding to one character discrete type feature.
Step 4.2.2: constructing a numerical characteristic vector by using the value of the numerical characteristic in each piece of data;
step 4.2.3: and merging the numerical characteristic vector in the step 4.2.2 with the one-hot vector obtained in the step 4.2.1.
And through the operation of the steps, corresponding to one piece of original test data, and obtaining a numerical value type original test data feature vector.
The original test data set subjected to one-hot coding is called a basic test data set, and the symbol X is usedtestRepresents; by the symbol xtest,ijRepresenting a base test data set XtestThe jth feature of the ith piece of data.
Step 4.3: the basic test data set X is given by the formula (5)testThe data in (1) is normalized.
Figure GDA0001883560090000211
Wherein, x'test,ijAs data xtest,ijData obtained after normalization processing; AVGjThe average value of the jth feature of all data in the basic training data set X obtained in the step 1.4 is obtained; STDjThe standard deviation of the jth feature of all data in the basic training data set X obtained in step 1.4.
After the operation of step 4, preprocessing the basic test data set to obtain a test data set, using the symbol X'testAnd (4) showing.
And step five, classifying the test data.
And inputting the test data obtained through the preprocessing in the fourth step into the integrated classification model trained in the third step for classification. The method comprises the following specific steps:
step 5.1: inputting each test data obtained through the preprocessing in the step four into a GBDT classifier, and executing the operation in the step 5.2 if the classification result is the DoS type; if the classification result is of non-DoS type, the operation of step 5.3 is performed.
Step 5.2: and inputting the test data into a KNN classifier for classification to obtain and output a final classification result, and finishing the operation.
Step 5.3: and respectively inputting the test data into each RF classification module in the RF classification module group, and outputting the classification result to the splicing and voting module after classification operation. And the splicing and voting module votes the output result of the RF classification module group to determine the classification result.
Step 5.4: and respectively inputting the test data into each GBDT classification module in the GBDT classification module group, and outputting classification results to a splicing and voting module after classification operation. And the splicing and voting module votes the output result of the GBDT classification module group to determine the classification result.
Step 5.5: and respectively inputting the test data into each SVM classification module in the SVM classification module group, and outputting a classification result to a splicing and voting module after classification operation. And the splicing and voting module votes the output result of the SVM classification module group to determine the classification result.
Step 5.6: and respectively inputting the test data into each xgboost classification module in one xgboost classification module group, and outputting the classification result to a splicing and voting module after classification operation. And the splicing and voting module votes the output result of the xgboost classification module group to determine the classification result.
Step 5.7: and repeating the operation of the step 5.6 for 2 times to obtain the classification results of the other 2 xgboost classification module groups.
Step 5.8: and (5) combining the results of the steps from 5.3 to 5.7 to obtain a 1 × 6 stacking vector.
Step 5.9: and (4) inputting the 1 × 6 stacking vector obtained in the step 5.8 into a secondary model SVM classifier of the stacking classifier, and obtaining and outputting a classification result of the test data through classification operation.
And finally, evaluating the prediction result. And (4) considering the prediction result of the test set obtained in the step five on two indexes of accuracy and recall rate. The results are shown in tables 11 and 12.
TABLE 11 prediction of DoS, PROBE, U2L, R2L, NORMAL data in the test set
Type (B) NORMAL DoS PROBE U2L R2L
Rate of accuracy 75.67% 99.89% 83.59% 7.54% 84.63%
Recall rate 99.23% 97.41% 93.11% 23.24% 10.91%
TABLE 12 prediction of data for the DoS attack subtype
Type (B) smurf neptune back pod teardrop
Rate of accuracy 99.99% 99.40% 68.49% 51.50% 29.27%
Recall rate 99.98% 99.85% 100% 99.85% 100%
Table 11 shows the accuracy and recall of the classification of five data types NORMAL, PROBE, DoS, U2L and R2L by the present method. Table 11 shows the classification accuracy and recall rate of the DoS attack subtype data according to the present method. Experimental results show that the method achieves good effects on the accuracy rate and the recall rate of the test set on the data set with extremely unbalanced data and inconsistent data distribution.

Claims (2)

1. A network attack type identification method based on multilayer detection is characterized in that: the specific operation steps are as follows:
step one, acquiring original training data and preprocessing the original training data;
step 1.1: acquiring network attack data to form an original training data set; the network attack data comprises numerical characteristics and character discrete characteristics; the character discrete type features include: protocol type, service type and connection error identification;
step 1.2: converting each piece of original training data in an original training data set into a numerical original training data feature vector; the method specifically comprises the following steps:
step 1.2.1: extracting character discrete type features from each piece of data, and respectively coding the character discrete type features in a one-hot vector form, wherein one character discrete type feature corresponds to one-hot vector;
step 1.2.2: constructing a numerical characteristic vector by using the value of the numerical characteristic in each piece of data;
step 1.2.3: merging the numerical characteristic vector in the step 1.2.2 with all the one-hot vectors obtained in the step 1.2.1;
obtaining a numerical type original training data feature vector corresponding to an original training data through the operation of the step;
step 1.3: the problem of unbalanced quantity of various types of data of an original training data set is solved through data down-sampling and data up-sampling;
an original training data set after one-hot coding, data down-sampling and data up-sampling is called a basic training data set and is represented by a symbol X; by the symbol xijThe jth feature of the ith piece of data representing the base training data set X, i ∈ [1, n [ ]]N is the number of data in the basic training data set X;
step 1.4: carrying out standardization processing on data in a basic training data set X through a formula (1);
Figure FDA0003401491750000011
wherein, x'ijAs data xijNormalizedThe obtained data; AVGjCalculating the average value of the jth characteristic of all data in the basic training data set X through a formula (2); STDjCalculating the standard deviation of the jth characteristic of all data in the basic training data set X through a formula (3);
Figure FDA0003401491750000012
Figure FDA0003401491750000021
after the basic training data set is preprocessed through the operation of the first step, a training data set is obtained and is represented by a symbol X';
step two, constructing an integrated classification model;
the integrated classification model comprises a GBDT classifier, a KNN classifier and a stacking classifier;
the GBDT classifier is learned by iteratively constructing a classification regression tree CART to enhance Boosting thought; by the symbol ft-1(x) Representing the GBDT classifier obtained by the iteration of the (t-1) th round, wherein t is a positive integer; by the symbol ft(x) Representing a GBDT classifier obtained by the t-th iteration; by the symbol L (y, f)t-1(x) Represents the loss function of the GBDT classifier obtained from the (t-1) th iteration; by the symbol L (y, f)t(x) Represents the penalty function of the GBDT classifier obtained from the t-th iteration; by the symbol ht(x) Representing a fitting function obtained by learning in the t-th round; in the learning process of the GBDT classifier, the (t-1) th iteration is to find the L (y, f) in the formula (4)t(x) ) takes a minimum value of ht(x) (ii) a Finding the smallest ht(x) The process of (1) adopts a method of negative gradient fitting of a loss function;
L(y,ft(x))=L(y,ft-1(x)+ht(x)) (4)
the KNN classifier is used for classifying the DoS type data and predicting the subtype of the DoS type data; setting a parameter K of the KNN classifier to be 3;
the stacking classifier is used for classifying non-DoS type data; the stacking classifier is divided into a primary classification model and a secondary classification model; the primary classification model is divided into an upper layer and a lower layer, and the upper layer of the primary classification model is formed by connecting 3 xgboost classification module groups, 1 SVM classification module group, 1 GBDT classification module group and 1 RF classification module group in parallel; each xgboost classification module group is formed by connecting m xgboost classification modules in parallel, each SVM classification module group is formed by connecting m SVM classification modules in parallel, each GBDT classification module group is formed by connecting m GBDT classification modules in parallel, and each RF classification module group is formed by connecting m RF classification modules in parallel; m is an artificial set value, and m belongs to [3,8 ]; the lower layer of the primary module is a splicing and voting module;
the output ends of the 3 xgboost classification module groups, the 1 SVM classification module group, the 1 GBDT classification module group and the 1 RF classification module group at the upper layer of the primary classification model are respectively connected with the input ends of the splicing and voting modules at the lower layer of the primary classification model; in the training phase, the splicing and voting module is used for: combining output results of each xgboost classification module group, SVM classification module group, GBDT classification module group and RF classification module group at the upper layer of the primary classification model to obtain a vector matrix called stacking vector matrix; in the testing stage, the splicing and voting module is used for: corresponding to a piece of test data, voting output results of each xgboost classification module group, SVM classification module group, GBDT classification module group and RF classification module group at the upper layer of the primary classification model respectively, obtaining a classification result by each classification module group, and then combining the classification results to obtain a 1 x 6 stacking feature vector;
the secondary classification model is an SVM classifier, is optimized by adopting an FOA algorithm, and inputs stacking characteristic vectors generated for the primary classification model, and the specific method comprises the following steps:
step 2.1: initializing SVM kernel function parameter gamma and punishment parameter C, gamma belongs to [0.001, 5]],C∈[0.001,5](ii) a Setting the starting position of the fruit fly as (C)beginbegin) In which C isbegin=C,γbegin=γ;
Step 2.2: setting the search distance val of the population size popsize, the iteration number epoch and the penalty parameter CCSearch distance val of sum kernel parameter γγ(ii) a Wherein, popsize is E [8,15 ]],epoch≥5,valC∈[0.05,0.5],valγ∈[0.001,0.01];
Step 2.3: calculating the position of the pth fruit fly at the next moment according to the formulas (6) to (7), and using the symbol (C)pp) Denotes p ∈ [1, popsize >];
Cp=Cbegin+valC×ε (6)
γp=γbegin+valγ×ε (7)
Wherein ε is a random value in the range of [ -1,1 ];
step 2.4: if the penalty parameter C is less than 0.001, C is 0.001; if C >5, then C is 5; if gamma is less than 0.001, then gamma is 0.001; when gamma is more than 5, the gamma is 5;
step 2.5: calculating fitness function values of all the positions of the fruit flies obtained in the step 2.3 according to a formula (8);
Fit(Cpp)=accuracy(Cpp) (8)
wherein, Fit (C)pp) The fitness function value of the position of the pth fruit fly is shown; accuracy (C)pp) Representing the SVM classifier at parameter (C)qq) Upper cross validation generated accuracy, Cq=Cpq=γp
Step 2.6, finding out the maximum Fit in the fitness function values corresponding to the positions of all fruit flies at the current momentmaxAnd FitmaxThe corresponding position is judged to be Fit at the momentmaxIf the fitness function value is higher than the fitness function value of the initial position, Fit is usedmaxThe corresponding position replaces the initial position while saving the FitmaxThen, the next iteration is carried out; if it is at that time FitmaxIf the fitness function value is lower than the fitness function value of the initial position, the step 2.3 to the step 2.6 are repeatedly executed until the iteration times reach the epoch times, and the operation is ended;
the connection relation of the integrated classification model is as follows: external data enters the integrated classification model through the input end of the GBDT classifier; the output end of the GBDT classifier is respectively connected with the input ends of the KNN classifier and the stacking classifier; the output of the KNN classifier and the stacking classifier is used as the external output of the integrated classification model;
step three, training an integrated classification model;
training an integrated classification model on the basis of the operation of the first step and the operation of the second step; the method specifically comprises the following steps:
step 3.1: training a GBDT classifier; the method specifically comprises the following steps:
step 3.1.1: making category labels on data in the training data set X'; marking the data in the training data set X' as a DoS type and other 2 types;
step 3.1.2: training a GBDT classifier by using the marked training data set X';
obtaining a trained GBDT classifier through the operation of the step 3.1;
step 3.2: training a KNN classifier; the method specifically comprises the following steps:
step 3.2.1: constructing a DoS type data set by using data marked as DoS type in the training data set X ', and using symbols X'1Represents;
step 3.2.2: to DoS type data set X'1The data in (1) is marked by fine classification; the DoS type dataset, in symbol X'1The data in (1) are subdivided into: smurf attacks, neptune attacks, back attacks, teardrop attacks, pod attacks, and Other attacks;
step 3.2.3: to DoS type data set X'1Performing data down-sampling processing according to the subdivision type to solve the DoS type data set X'1The quantity of each subdivision type data is unbalanced; the data set after data down-sampling, called KNN training data set, is represented by symbol X1Represents;
step 3.2.4: training dataset X using KNN1Training a KNN classifier;
obtaining a trained KNN classifier through the operation of the step 3.2;
step 3.3: training a stacking classifier; the method specifically comprises the following steps:
step 3.3.1: constructing a stacking training data set using data labeled as other types in training data set X', denoted by symbol X2Representing, and then performing fine classification marking on data in a stacking training data set X2Is subdivided into: normal, Probe, U2L, R2L;
step 3.3.2: training data set X2The data of (1) is uniformly divided into m subsets which are respectively called as a 1 st subset, a 2 nd subset, … … and an m th subset; the data quantity of each subset is represented by the symbol M, and M is a positive integer;
step 3.3.3: training a group of RF classification modules; the method specifically comprises the following steps:
step 3.3.3.1: the temporary variable is represented by the symbol h, h ∈ [1, m ]; setting the initial value of h to 1;
step 3.3.3.2: training data set X2As verification data; then, using stacking training data set X2As training data, training an untrained RF classification module in the set of RF classification modules;
step 3.3.3.3: inputting the data of the h-th subset into the RF classification module trained in the step 3.3.3.2 for classification to obtain an M x 1 vector matrix;
step 3.3.3.4: if h < m, increasing the value of h by 1, and repeating the steps 3.3.3.2 to 3.3.3.4; otherwise, the operation of step 3.3.3.5 is performed;
step 3.3.3.5: combining the classification results of the 1 st subset to the m th subset obtained in the step 3.3.3.2 to obtain a classification result of the data of the stacking training data set in the RF classification module group, and sending the classification result to the splicing and voting module;
completing the training of the RF classification module group through the operations from step 3.3.3.1 to step 3.3.3.5, and obtaining a stacking training data set X2The classification result of the data in the RF classification module group;
step 3.3.4: training an SVM classification module group; the method specifically comprises the following steps:
step 3.3.4.1: the temporary variable is represented by the symbol h, h ∈ [1, m ]; setting the initial value of h to 1;
step 3.3.4.2: training data set X2As verification data; then, using stacking training data set X2Taking other data as training data, and training an untrained SVM classification module in the SVM classification module group;
step 3.3.4.3: inputting the data of the h-th subset into the SVM classification module trained in the step 3.3.4.2 for classification to obtain an MX 1 vector matrix;
step 3.3.4.4: if h < m, increasing the value of h by 1, and repeating the steps 3.3.4.2 to 3.3.4.4; otherwise, the operation of step 3.3.4.5 is performed;
step 3.3.4.5: merging the classification results of the 1 st subset to the m th subset obtained in the step 3.3.4.2 to obtain a stacking training data set X2The classification result of the data in the SVM classification module group is sent to a splicing and voting module;
completing training of the SVM classification module group through operations from step 3.3.4.1 to step 3.3.4.5, and obtaining a classification result of data of a stacking training data set in the SVM classification module group;
step 3.3.5: training a GBDT classification module group; the method specifically comprises the following steps:
step 3.3.5.1: the temporary variable is represented by the symbol h, h ∈ [1, m ]; setting the initial value of h to 1;
step 3.3.5.2: training data set X2As verification data; then, using stacking training data set X2The other data is used as training data to train a GBDT classification module which is not trained in the GBDT classification module group;
step 3.3.5.3: inputting the data of the h-th subset into the GBDT classification module trained in the step 3.3.5.2 for classification to obtain an Mx 1 vector matrix;
step 3.3.5.4: if h < m, increasing the value of h by 1, and repeating the steps 3.3.5.2 to 3.3.5.4; otherwise, executing the operation of step 3.3.5.5;
step 3.3.5.5:merging the classification results of the 1 st subset to the m th subset obtained in the step 3.3.5.2 to obtain a stacking training data set X2The classification result of the data in the GBDT classification module group is sent to a splicing and voting module;
completing the training of the GBDT classification module group through the operations from step 3.3.5.1 to step 3.3.5.5, and obtaining the classification result of the data of the stacking training data set in the GBDT classification module group;
step 3.3.6: training an XGBOOST classification module group; the method specifically comprises the following steps:
step 3.3.6.1: the temporary variable is represented by the symbol h, h ∈ [1, m ]; setting the initial value of h to 1;
step 3.3.6.2: training data set X2As verification data; then, using stacking training data set X2The other data is used as training data to train an untrained XGBOOST classification module in the XGBOOST classification module group;
step 3.3.6.3: inputting the data of the h-th subset into the XGBOOST classification module trained in the step 3.3.6.2 for classification to obtain an Mx 1 vector matrix;
step 3.3.6.4: if h < m, increasing the value of h by 1, and repeating the steps 3.3.6.2 to 3.3.6.4; otherwise, the operation of step 3.3.6.5 is performed;
step 3.3.6.5: merging the classification results of the 1 st subset to the m th subset obtained in the step 3.3.6.2 to obtain the classification result of the data of the stacking training data set in the XGBOOST classification module group, and sending the classification result to the splicing and voting module;
completing the training of the XGBOST classification module group through the operations from step 3.3.6.1 to step 3.3.6.5, and obtaining a stacking training data set X2The data of the XGBOOST is classified into a classification result in the XGBOOST classification module group;
step 3.3.7: repeating the step 3.3.6 for 2 times to finish the training of the other 2 XGB OST classification module groups and obtain a stacking training data set X2The data in the other 2 XGBOOST classification module groups are classified into results and sent to a splicing and voting module;
and 3.3.8: the splicing and voting module carries out the stacking training data set X obtained from the step 3.3.3 to the step 3.3.72Combining the classification results of all the classification module groups to obtain a vector matrix of P multiplied by 6, namely a stacking vector matrix; wherein P represents a stacking training data set X2The number of data of (a);
step 3.3.9: inputting the stacking vector matrix obtained in the step 3.3.8 into a secondary classification model SVM classifier of a stacking classifier, and performing training operation to obtain a trained stacking classifier;
completing training of a stacking classifier through the operation of the steps to obtain a trained integrated classification model;
step four, preprocessing the test data; the method specifically comprises the following steps:
step 4.1: acquiring network attack data to form an original test data set; the network attack data comprises numerical characteristics and character discrete characteristics; the character discrete type features are as follows: protocol type, service type and connection error identification;
step 4.2: converting each piece of original test data in the original test data set into a numerical original test data feature vector; the method specifically comprises the following steps:
step 4.2.1: extracting character discrete type features from each piece of data, and respectively coding the character discrete type features in a one-hot vector form, wherein one character discrete type feature corresponds to one-hot vector;
step 4.2.2: constructing a numerical characteristic vector by using the value of the numerical characteristic in each piece of data;
step 4.2.3: merging the numerical characteristic vector in the step 4.2.2 with the one-hot vector obtained in the step 4.2.1;
obtaining a numerical value type original test data characteristic vector corresponding to an original test data through the operation of the step;
the original test data set subjected to one-hot coding is called a basic test data set, and the symbol X is usedtestRepresents; by the symbol xtest,ijRepresenting a base test data set XtestThe jth feature of the ith piece of data of (1);
step 4.3: the basic test data set X is given by the formula (5)testCarrying out standardization processing on the data in (1);
Figure FDA0003401491750000071
wherein, x'test,ijAs data xtest,ijData obtained after normalization processing; AVGjThe average value of the jth feature of all data in the basic training data set X obtained in the step 1.4 is obtained; STDjThe standard deviation of the jth feature of all data in the basic training data set X obtained in the step 1.4;
after the operation of step 4, preprocessing the basic test data set to obtain a test data set, using the symbol X'testRepresents;
step five, classifying the test data;
inputting the test data obtained through the pretreatment in the fourth step into the integrated classification model trained in the third step for classification; the method comprises the following specific steps:
step 5.1: inputting a piece of test data obtained through the preprocessing in the step four into the GBDT classifier, and if the classification result is the DoS type, executing the operation in the step 5.2; if the classification result is the non-DoS type, executing the operation of the step 5.3;
step 5.2: inputting the test data into a KNN classifier for classification to obtain and output a final classification result, and finishing the operation;
step 5.3: respectively inputting the test data into each RF classification module in the RF classification module group, and outputting classification results to a splicing and voting module after classification operation; the splicing and voting module votes the output result of the RF classification module group to determine a classification result;
step 5.4: respectively inputting the test data into each GBDT classification module in the GBDT classification module group, and outputting classification results to a splicing and voting module after classification operation; the output result of the GBDT classification module group is voted by the splicing and voting module to determine a classification result;
step 5.5: respectively inputting the test data into each SVM classification module in the SVM classification module group, and outputting classification results to a splicing and voting module after classification operation; the splicing and voting module votes the output result of the SVM classification module group to determine a classification result;
step 5.6: respectively inputting the test data into each xgboost classification module in an xgboost classification module group, and outputting a classification result to a splicing and voting module after classification operation; the splicing and voting module votes the output result of the xgboost classification module group to determine a classification result;
step 5.7: repeating the operation of the step 5.6 for 2 times to obtain the classification results of the other 2 xgboost classification module groups;
step 5.8: combining the results of the steps 5.3 to 5.7 to obtain a 1 × 6 stacking vector;
step 5.9: inputting the 1 × 6 stacking vector obtained in the step 5.8 into a secondary classification model SVM classifier of the stacking classifier, performing classification operation to obtain and output a classification result of the test data, and ending the operation.
2. The network attack type identification method based on multi-layer detection as claimed in claim 1, characterized in that: in step 1.3, the problem of unbalanced quantity of each type of data in the original training data set is solved through data down-sampling and data up-sampling, specifically:
case 1: if the number of data of a certain type a in the original training data set is much greater than that of data of other types, the number of types a is reduced by using a data down-sampling method, specifically: randomly extracting a part of data from the data of the type A so as to reduce the data of the type A;
case 2: if the number of a certain type B in the original training data set is far lower than that of other types of data, increasing the number of the type B data by adopting a data up-sampling method;
the data up-sampling algorithm is a SMOTE algorithm.
CN201811146113.7A 2018-06-15 2018-09-29 Network attack type identification method based on multi-layer detection Active CN109299741B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2018106201892 2018-06-15
CN201810620189 2018-06-15

Publications (2)

Publication Number Publication Date
CN109299741A CN109299741A (en) 2019-02-01
CN109299741B true CN109299741B (en) 2022-03-04

Family

ID=65165024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811146113.7A Active CN109299741B (en) 2018-06-15 2018-09-29 Network attack type identification method based on multi-layer detection

Country Status (1)

Country Link
CN (1) CN109299741B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109903840B (en) * 2019-02-28 2021-05-11 数坤(北京)网络科技有限公司 Model integration method and device
CN110213222B (en) * 2019-03-08 2021-12-10 东华大学 Network intrusion detection method based on machine learning
CN109994216A (en) * 2019-03-21 2019-07-09 上海市第六人民医院 A kind of ICD intelligent diagnostics coding method based on machine learning
CN110162558B (en) * 2019-04-01 2023-06-23 创新先进技术有限公司 Structured data processing method and device
CN110802601B (en) * 2019-11-29 2021-02-26 北京理工大学 Robot path planning method based on fruit fly optimization algorithm
CN111431849B (en) * 2020-02-18 2021-04-16 北京邮电大学 Network intrusion detection method and device
CN111680742A (en) * 2020-06-04 2020-09-18 甘肃电力科学研究院 Attack data labeling method applied to new energy plant station network security field
CN113408617A (en) * 2021-06-18 2021-09-17 湘潭大学 XGboost and Stacking model fusion-based non-invasive load identification method
CN113625319B (en) * 2021-06-22 2023-12-05 北京邮电大学 Non-line-of-sight signal detection method and device based on ensemble learning
CN113922985B (en) * 2021-09-03 2023-10-31 西南科技大学 Network intrusion detection method and system based on ensemble learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598813A (en) * 2014-12-09 2015-05-06 西安电子科技大学 Computer intrusion detection method based on integrated study and semi-supervised SVM
CN106899440A (en) * 2017-03-15 2017-06-27 苏州大学 A kind of network inbreak detection method and system towards cloud computing
CN106973038A (en) * 2017-02-27 2017-07-21 同济大学 Network inbreak detection method based on genetic algorithm over-sampling SVMs
CN108154178A (en) * 2017-12-25 2018-06-12 北京工业大学 Semi-supervised support attack detection method based on improved SVM-KNN algorithms

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10733530B2 (en) * 2016-12-08 2020-08-04 Resurgo, Llc Machine learning model evaluation in cyber defense

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598813A (en) * 2014-12-09 2015-05-06 西安电子科技大学 Computer intrusion detection method based on integrated study and semi-supervised SVM
CN106973038A (en) * 2017-02-27 2017-07-21 同济大学 Network inbreak detection method based on genetic algorithm over-sampling SVMs
CN106899440A (en) * 2017-03-15 2017-06-27 苏州大学 A kind of network inbreak detection method and system towards cloud computing
CN108154178A (en) * 2017-12-25 2018-06-12 北京工业大学 Semi-supervised support attack detection method based on improved SVM-KNN algorithms

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于软件定义网络的DDoS攻击检测方法和缓解机制的研究;李鹤飞;《中国优秀硕士学位论文全文数据库信息科技辑》;20151215;第21-43页 *

Also Published As

Publication number Publication date
CN109299741A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN109299741B (en) Network attack type identification method based on multi-layer detection
CN107633255B (en) Rock lithology automatic identification and classification method under deep learning mode
CN110287983B (en) Single-classifier anomaly detection method based on maximum correlation entropy deep neural network
CN109492026B (en) Telecommunication fraud classification detection method based on improved active learning technology
CN110213222A (en) Network inbreak detection method based on machine learning
CN111181939A (en) Network intrusion detection method and device based on ensemble learning
CN109902740B (en) Re-learning industrial control intrusion detection method based on multi-algorithm fusion parallelism
CN108958217A (en) A kind of CAN bus message method for detecting abnormality based on deep learning
CN109446804B (en) Intrusion detection method based on multi-scale feature connection convolutional neural network
CN111062036A (en) Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment
CN113542241B (en) Intrusion detection method and device based on CNN-BiGRU hybrid model
CN111507385B (en) Extensible network attack behavior classification method
CN109886284B (en) Fraud detection method and system based on hierarchical clustering
CN112560596B (en) Radar interference category identification method and system
CN113922985A (en) Network intrusion detection method and system based on ensemble learning
CN110245693B (en) Key information infrastructure asset identification method combined with mixed random forest
CN110414587A (en) Depth convolutional neural networks training method and system based on progressive learning
CN114492768A (en) Twin capsule network intrusion detection method based on small sample learning
CN115801374A (en) Network intrusion data classification method and device, electronic equipment and storage medium
CN109583519A (en) A kind of semisupervised classification method based on p-Laplacian figure convolutional neural networks
CN115577357A (en) Android malicious software detection method based on stacking integration technology
CN111178196B (en) Cell classification method, device and equipment
CN113067798A (en) ICS intrusion detection method and device, electronic equipment and storage medium
CN110581840B (en) Intrusion detection method based on double-layer heterogeneous integrated learner
CN109617864B (en) Website identification method and website identification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant