CN112784881A

CN112784881A - Network abnormal flow detection method, model and system

Info

Publication number: CN112784881A
Application number: CN202110013425.6A
Authority: CN
Inventors: 史增树; 杜怡曼; 杨滨茂; 麻文刚
Original assignee: Beijing Southwest Jiaotong University Shengyang Technology Co ltd
Current assignee: Beijing Southwest Jiaotong University Shengyang Technology Co ltd
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2021-05-11
Anticipated expiration: 2041-01-06
Also published as: CN112784881B

Abstract

The invention provides a network abnormal flow detection method, a model and a system based on residual error gated-cycle unit (Re-GRU) and integrated dynamic Extreme Learning (ELM) optimization. Firstly, establishing a Fisher Score and a characteristic optimization method of a maximum information coefficient; secondly, an original GRU candidate hidden state activation function is changed into an unsaturated activation function, and a residual error structure is introduced into the GRU candidate hidden state, so that the problem of gradient disappearance is avoided, the network is more sensitive to gradient change, and the purpose of relieving network degradation is achieved. Then, the model is optimally designed into a bidirectional residual GRU structure, so that the performance of the model for extracting the network flow characteristic is more excellent; and finally, providing a two-step game integrated dynamic ELM network flow detection method, relieving the over-fitting problem by utilizing the full connection layer and the Dropout layer so as to improve the detection precision and output the detection result. According to the invention, the effectiveness is verified according to different parameter comparison results by building an experimental simulation model, and compared with the traditional detection method, the method has better detection effect and accuracy rate when detecting the abnormal flow of the network.

Description

Network abnormal flow detection method, model and system

Technical Field

The invention belongs to the technical field of internet, and relates to a method and a model for monitoring network abnormal flow, and a method and a system for training the model.

Background

With the rapid development of network technology, the network structure tends to be more and more complex, so the risk of network intrusion and abnormal traffic attack is greater and greater, and how to identify various network intrusions becomes a problem of high concern to people. The continuous increase and increase of network size, network rate and intrusion types make intrusion detection techniques face more and more challenges. Therefore, how to design a network intrusion detection method facing to a complex network environment and future intellectualization, and meanwhile, the method improves the detection precision of intrusion detection, reduces the report missing rate and improves the detection performance becomes a core problem concerned in the related fields.

For different network environments, many detection methods have been proposed, wherein a classification decision method based on traffic characteristic extraction optimization is a current mainstream detection method, and the method includes the loop of characteristic optimization, sample training, classification decision and the like, and considering that network traffic data is large and difficult to be directly used for abnormal traffic classification, data preprocessing and characteristic selection are generally performed after the data is obtained, and then the abnormal detection classification of the network traffic is performed by combining a classification technology through continuous training of a training model. The current popular feature extraction and sample training is deep neural network, convolutional neural network and cyclic neural network.

The recurrent neural network has the capability of capturing long-order dependence, so that the recurrent neural network is widely applied to the fields of data feature classification, sample training classification, machine translation and the like. However, the conventional recurrent neural network becomes unstable due to the problems of Gradient disappearance (decaying Gradient) and Gradient explosion (expanding Gradient), and researchers propose a neural network of Long Short Term Memory (LSTM) to improve the above problems. Although LSTM is indeed effective, its threshold is complex and results in poor detection result, and the detection performance gets worse as the number of network layers increases. In the existing method, the high-network method can relieve the degradation problem of the network, but the method can increase the network parameter quantity and the training time. In addition, the SRU network which has been paid attention in recent two years also includes a similar Highway-Networks structure, and the SRU omits time parameters in the cyclic unit, so that the SRU can also perform deeper network training when running fast, but the problem of gradient disappearance still occurs when detecting in a complex network environment. Furthermore, as the discontinuity in the data stream arrives, the imbalance of the initial samples is difficult to be constant. In the current network flow sample interval, the phenomena of serious label unbalance and unbalanced degree deflection also occur, and the new sample reliability is difficult to ensure by adopting a neighborhood sample resampling and undersampling data processing mode, and the phenomenon of overfitting is possibly caused by newly added data information. Particularly, aiming at multi-class label classification, the multi-class problem is decomposed into a plurality of two-class problems, so that the problems of model redundancy, difficulty in calculation and the like are caused. If the relationship between the model and the data cannot be established, the classifier is limited by a plurality of parameter categories, the optimal weight is difficult to obtain, and the stability of the model cannot be ensured.

The complex Network flow has a plurality of types of sample labels, and the problems of gradient disappearance and Network degradation easily occur in the conventional Recurrent Neural Network (RNN) during Network anomaly detection, so that the detection accuracy is low and the false negative rate is high. Therefore, the method for improving the disappearance of the neural network gradient is designed and invented, and has important significance for network flow detection, and the subsequent detection is more effective and has more excellent detection performance.

Disclosure of Invention

In order to achieve the above purpose, the invention provides a method, a model, a method and a system for training a model for detecting network traffic abnormal information, so that the network traffic abnormal information can be monitored, and the detection accuracy is improved.

In view of this, a first aspect of the present invention provides a method for detecting abnormal network traffic, including:

acquiring target traffic characteristic information according to target network traffic data, wherein the target network traffic characteristic information is obtained by preprocessing target original transceiving data, and the target original transceiving data belong to the target network traffic data;

determining target network traffic abnormal information corresponding to the target network traffic characteristic information through a target network abnormal traffic detection model, wherein the target network abnormal traffic detection model is generated by carrying out detection accuracy training on traffic information;

and generating a network abnormal flow detection result according to the target network flow information.

Specifically, the preprocessing includes a feature screening step, the feature screening step includes sorting feature importance, deleting redundant information, and maximizing information, the maximizing information processing uses a Support Vector Machine (SVM) model to predict an abnormal result of the target network traffic, and determines whether the target network traffic feature set is optimal according to a prediction result of the abnormal result of the target network traffic.

Specifically, the detection accuracy training comprises a feature extraction step, a data generation step and a classification step, wherein a bidirectional residual GRU model is adopted in the feature extraction step, a candidate hidden state activation function of a gating cycle Unit (GRU) is changed into a Linear rectification function (RecU), the problem of gradient disappearance is effectively overcome, and meanwhile, the problem of network degradation is effectively relieved by using a residual structure. Specifically, each GRU processing unit has only two thresholds, and the GRU unit has only one timing output, so the GRU has fewer parameters under the condition of ensuring that the timing related information can be effectively transmitted. The invention solves the problems of gradient disappearance and network degradation by improving GRUs, and the final output of the hidden state of the GRUs is determined by the output value of the GRUs and the output value of the hidden state at the previous moment, so the improvement of the invention mainly aims at a GRU candidate hidden state formula, changes a GRU candidate hidden state activation function into a linear rectification function, adds a residual structure by referring to a residual network mode in CNN, is connected with the GRU activation function, and applies Batch Normalization (BN) property, thereby eliminating gradient disappearance and network degradation in the traditional GRUs.

Specifically, the data generation step adopts a form of generating a countermeasure network, two dynamic integrated dynamic Extreme Learning (ELM) are used for mutual countermeasure, a minority sample fragment is generated, distribution of samples of different types is balanced, the integral fitting degree is quantized by using information entropy, and the generated minority sample fragment is screened according to a Principal Component Analysis (PCA) result of the sample data after the characteristic extraction step, so that the problems of data label balance degree deviation and inaccurate reconstruction data are solved; in the classification step, an Adaboost ensemble learning method is adopted, the individual prediction models are combined into a strong classifier, the individual prediction models are ELM structures and are also called as base classifiers, namely the classification step combines the base classifiers into the strong classifier.

Specifically, the invention utilizes the unsaturated activation function to overcome the gradient disappearance problem, meanwhile, by taking the residual structure in the Convolutional Neural Network (CNN) as a reference to relieve the Network degradation, and designs an improved residual Gated cyclic Unit (Re-GRU) and integrated dynamic limit learning (ELM) optimized Network abnormal flow detection method on the basis of the Gated cyclic Neural Network (GRU).

The invention provides a network abnormal flow detection model in a second aspect, which comprises a preprocessing model, a feature extraction model, a data generation model and a classification model;

the preprocessing model selects target network traffic characteristics based on an improved Fisher Score and a maximum information coefficient, predicts an abnormal result of the target network traffic by using an SVM (support vector machine) model, and judges whether the target network traffic characteristic set is optimal or not according to the prediction result of the abnormal result of the target network traffic;

the feature extraction model extracts the target network traffic features based on a bidirectional residual GRU model;

the data generation model adopts a form of generating a countermeasure network (GAN), generates minority sample segments by mutual countermeasure of two dynamic ELMs, quantifies the overall fitting degree by using information entropy, and screens the generated minority sample segments according to the principal component analysis result of the sample data after the characteristic extraction step;

the classification model adopts an ensemble learning method to predict the target network traffic abnormal result by using a plurality of base classifiers, wherein the base classifiers are ELM structures.

Specifically, the bidirectional residual GRU model changes the original GRU candidate hidden state activation function into an unsaturated activation function, preferably changes the original GRU candidate hidden state activation function into a linear rectification function, so that the problem of gradient disappearance is effectively solved, and a residual structure is introduced into the GRU candidate hidden state, so that the residual structure is more robust to long sequence characteristics, and the problem of network degradation is effectively alleviated.

Specifically, after the residual error structure is optimized with the GRU, the present invention proposes different GRU networks for each training sequence, which is a bidirectional residual error GRU structure. The optimized bidirectional residual GRU model can provide input sequence history information for an output layer and future information for each time node of an input sequence, and comprises six unique weights, each weight is repeatedly utilized in each time sequence, and the weights are respectively: input layer to forward hidden layer (w)₁) Forward to the hidden layer (w)₂) From the input layer to the backward hidden layer (w)₃) Forward hidden layer to output layer (w)₄) Backward hidden layer to hidden layer (w)₅) Backward hidden layer to output layer (w)₆) As shown in fig. 4. In a specific embodiment, the invention preferably uses a 128-kernel bidirectional residual GRU model to extract network data set features, and the specific process is as follows: the method comprises the steps that feature selection is carried out on a network data set through a Fisher Score and maximum information coefficient preprocessing method to obtain a redundancy-free and optimal feature subset, then feature set data enter a bidirectional residual GRU network structure to carry out feature extraction, a dropout layer is added to prevent overfitting and accelerate a training process, then multidimensional input is carried out through 1 Flatten layer to carry out one-dimensional operation, and finally all network layers are integrated through 2 full-connection layers. The first layer parameter is 128 cores, the activation function is ReLU, and the second layer is selected2 cores of consistent size with the output dimension, and the activation function is sigmoid.

Specifically, the data generation model adopts a form of generating a countermeasure network (GAN), generates a minority sample fragment by mutual countermeasure of two dynamic ELMs, quantifies the overall fitting degree by using information entropy, and screens the generated minority sample fragment according to a principal component analysis result of the sample data after the characteristic extraction step.

Specifically, the data generation model obtains sample data of 'false and false' by mutually gaming the generation model and the discrimination model. The generator G obtains a data sequence similar to the original data fragment through characteristic representation; the discriminator D discriminates between the real data and the generated data by binary classification. The two games resist each other to improve the model mapping capability until the generator G and the discriminator D can not improve the self performance continuously, enough real generation samples are obtained, and the effect of data generation and amplification is achieved.

The third aspect of the present invention provides a method for training a network abnormal traffic detection model, including:

acquiring target network traffic data, preprocessing the target network traffic data to acquire target traffic characteristic information, wherein the preprocessing selects target network traffic characteristics based on an improved Fisher Score and a maximum information coefficient, predicts an abnormal result of the target network traffic by using an SVM (support vector machine) model, judges whether a target network traffic characteristic set is optimal according to the predicted result of the abnormal result of the target network traffic, and screens the target characteristic set;

acquiring flow characteristic optimization information, wherein the flow characteristic optimization information adopts a bidirectional residual GRU model to perform characteristic extraction on the target network flow characteristic, the bidirectional residual GRU model changes an original GRU candidate hidden state activation function into a non-saturated activation function, and preferably, the original GRU candidate hidden state activation function is changed into a linear rectification function, so that the problem of gradient disappearance is effectively overcome, and a residual structure is introduced into a GRU candidate hidden state, so that the problem of network degradation is effectively alleviated;

generating minority sample fragments, wherein the minority sample fragments are generated according to the flow characteristic optimization information and are generated through mutual confrontation of two dynamic ELMs, so that the problems of data label balance degree deviation and inaccurate reconstruction data are solved;

and training the flow characteristic optimization information and the minority sample fragments to obtain integrated dynamic ELM model parameters and generate a target network flow abnormity detection model, wherein the target network flow abnormity detection model is used for detecting flow abnormity information.

In a specific embodiment, the invention preferably uses a 128-kernel bidirectional residual GRU model to extract network data set features, and the specific process is as follows: the method comprises the steps that feature selection is carried out on a network data set through a Fisher Score and maximum information coefficient preprocessing method to obtain a redundancy-free and optimal feature subset, then feature set data enter a bidirectional residual GRU network structure to carry out feature extraction, a dropout layer is added to prevent overfitting and accelerate a training process, then multidimensional input is carried out through 1 Flatten layer to carry out one-dimensional operation, and finally all network layers are integrated through 2 full-connection layers. The first layer parameter is 128 cores, the activation function is ReLU, the second layer selects 2 cores with the same size as the output dimension, and the activation function is sigmoid.

A fourth aspect of the present invention provides a system for detecting network abnormal traffic, including:

an acquisition module that acquires target network traffic data;

the preprocessing module is used for preprocessing the target network traffic data to acquire target traffic characteristic information;

the characteristic extraction module acquires flow characteristic optimization information, wherein the flow characteristic optimization information adopts a bidirectional residual GRU model to extract the characteristics of the target network flow characteristics;

the data generation module generates a minority sample fragment according to the flow characteristic optimization information, wherein the minority sample fragment is generated by two dynamic ELMs in a mutual confrontation mode based on the bidirectional residual GRU model extraction result;

the detection module trains the flow characteristic optimization information and the minority sample fragments to obtain integrated dynamic ELM model parameters and generate a target network flow anomaly detection model, and the target network flow anomaly detection model is used for detecting flow anomaly information.

Specifically, the preprocessing module selects a target network traffic characteristic based on the improved Fisher Score and the maximum information coefficient, predicts an abnormal result of the target network traffic by using an SVM model, judges whether the target network traffic characteristic set is optimal according to the prediction result of the abnormal result of the target network traffic, and screens a target characteristic set.

A fifth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of the above-described aspects.

According to the target network flow anomaly detection method, the original characteristics are screened through the preprocessing step to obtain a target characteristic set, the GRU model is improved to perform characteristic extraction on the bidirectional residual GRU model, so that the GRU model is more robust to long sequence characteristics, the problem of network degradation is relieved, meanwhile, the dynamic ELM game is introduced, a small number of sample segments are generated to realize sample data generation and amplification, the problem caused by sample imbalance is relieved, and the target network flow anomaly detection precision and the detection accuracy are improved.

The network abnormal flow detection system improves the detection precision of network flow through the combination of the preprocessing model, the characteristic extraction model, the data generation model and the classification model, particularly, the preprocessing model can effectively screen flow characteristics, the characteristic extraction model can further extract screening results, the data generation model reduces the problem caused by sample unbalance by generating a small number of sample fragments, the classification model combines a plurality of base classifiers in an integrated learning mode to form a strong classifier, and the classification accuracy is improved; the plurality of models used for detecting the network abnormal flow are combined into an organic whole, and the technical effect of improving the accuracy of detecting the network abnormal flow is achieved together.

In a specific embodiment, the invention utilizes the unsaturated activation function to overcome the gradient disappearance problem, simultaneously relieves the network degradation by means of a residual structure in the Convolutional Neural Network (CNN), and designs an improved network abnormal flow detection method optimized by a residual gated-cyclic unit (Re-GRU) and an integrated dynamic Extreme Learning Machine (ELM) on the basis of a gated-cyclic neural network (GRU).

Firstly, an original GRU candidate hidden state activation function is changed into an unsaturated activation function, and residual error information is introduced into the GRU candidate hidden state, so that the problem of gradient disappearance caused by the saturated activation function is avoided, and meanwhile, a network is more sensitive to gradient change, the purpose of relieving network degradation is achieved, and the network is more robust to long sequence characteristics. On the basis, the model is optimized continuously and designed into a bidirectional residual GRU structure, so that the performance of the model for extracting the network flow characteristic is superior.

Secondly, aiming at the problem that the classification precision of various sample labels is low in complex network flow, a two-step game integrated dynamic ELM method is designed, namely the two dynamic ELMs are mutually confronted, and the problems of data label balance degree deviation and inaccurate reconstruction data are solved. The strategy adopts a dynamic ELM game model to generate a few types of sample fragments, balances the distribution of different types of samples, ensures the authenticity of each sample fragment, simultaneously quantifies the overall fitting degree by using information entropy, establishes the relation between the weight and the loss degree, adopts a game theory set model to calculate the combined weight, forms a stable network architecture, and improves the fitting effect of the model on the rapidly changing data;

thirdly, the full connection layer and the Dropout layer are used for relieving the overfitting problem, so that the detection precision is further improved, and a final detection result is output;

fourthly, on the basis of the NSL-KDD data set, firstly, an optimal feature subset is obtained according to the improved Fisher Score and the maximum information coefficient feature selection method, an accuracy rate comparison experiment under different training sets and test set combinations is utilized to obtain an optimal training combination, and meanwhile, the training combination is compared with different feature selection methods, and the effectiveness and superiority of the provided feature selection preprocessing method are verified. Then, selecting a part of sample data for training, respectively obtaining ROC curves under two classes and multiple classes, and simultaneously obtaining index parameter comparison analysis of different classification methods for network flow characteristic detection classes;

and finally, the effectiveness of the method is further verified by analyzing the PPL value, the time consumption and the stability of various deep learning methods in the detection process.

According to the final parameter comparison, the detection method based on the bidirectional residual GRU and the integrated dynamic ELM optimization has the advantages of good performance and proper time complexity, and can be effectively used for detecting the network flow characteristics.

In a specific embodiment, the method for detecting abnormal network traffic of the present invention includes the following steps.

1. Network traffic feature preprocessing

According to the invention, the redundant features of the obtained network traffic data are less by carrying out screening pretreatment on the original network traffic features, and the feature classification result can be higher in precision when the method is used for subsequent data training. Specific pretreatment steps include the following.

And (3) combining the Fisher Score and the Maximum Information Coefficient (MIC) to construct a new network traffic characteristic optimization selection method. Firstly, considering the problems of uneven distribution and overlapping of network flow characteristics, constructing a characteristic index importance ranking rule by adopting a Fisher Score calculation method; then, on the basis of considering the influence of the redundant features on the effective feature representation, the relevance evaluation among the features is constructed by using a maximum information coefficient method, and the redundant features are updated and sorted; on the basis, the classification accuracy is used as a judgment basis, and a network traffic characteristic optimization selection method based on the Fisher Score and the maximum information coefficient is established.

1.1Fisher Score correlation theory

Suppose there is one sample set x_k∈R^mAnd k is 1,2,3,.. times, m, wherein the number of samples in s class and q class is n respectively_sAnd n_q. Fisher Score of the ith feature is defined as:

wherein: u. of_iIs the average of the ith feature over the data set,

and

mean values over the s and q class data sets, respectively;

and

the characteristic values of the ith characteristic of the kth s class and the ith characteristic of the q class are respectively. The larger the Fisher Score value is, the stronger the feature discrimination is, but when the model is used for the two-classification problem, the consistency problem of two types of features is not considered, so the idea of cross coefficients is proposed as follows:

M_k＝m_sk+m_qk-m_sqk (2)

wherein: m_kRepresenting features x of two classes, s and q_kNumber of samples of (1), m_skDenotes s class x_kNumber of samples of (1), m_qkRepresenting q classes x_kNumber of samples of (1), m_sqkAnd the number of samples with the same values of the characteristics of the s type and the q type is represented. Aiming at the condition of uniform distribution, a cross coefficient calculation method can be utilized, and aiming at the condition of non-uniform distribution, a method for calculating the inter-class divergence among multiple classes is also provided:

wherein:

representing the selection of all possible combinations of the two classes s and q, and summing, n_sAnd n_qRespectively representing the number of samples in the s-th class and the q-th class, N is the total number of samples,

and

respectively represent the mean values of the kth feature on the s-class and q-class samples. The comprehensive characteristics are represented by the conditions of overlapping and uneven distribution, and the method for calculating the multi-class Fisher Score value is optimized, so that the method comprises the following steps:

wherein: n is the total number of samples representing the removed duplicate characteristic, N_jIs the number of samples in the jth class,

for the characteristic value of the ith characteristic of the kth j class,

for the mean value of the kth feature on the ith sample, through the process, a feature index importance ranking rule can be constructed, and then the features are optimally ranked to obtain feature importance.

1.2 maximum information coefficient update ordering

Although the Fisher Score model can evaluate the importance of features, the correlation between features and redundant features in a feature set cannot be determined. Therefore, the invention utilizes the method of the metric standard-the maximum information coefficient in the information theory to mine the non-functional dependence relationship among the characteristics, and the method specifically comprises the following steps:

suppose there is a finite set of ordered pairs D { (x)_i,y_i) 1,2,3,.., n }, will be given as followsX in set D_iAnd y_iForming scattered points, carrying out X Y dot matrix form division, respectively calculating I (X: Y) in each dot matrix, and selecting the maximum value of I (X: Y) in different division modes as a mutual information value in the division X Y dot matrix. Recording the maximum I (X: Y) as Max (I (X: Y)); after the maximum mutual information value is obtained, normalization is completed according to the operation of the formula (5), and a maximum information coefficient, which is recorded as Mic (I (X: Y)), is obtained as follows:

wherein: max (I (x, y)) represents the maximum mutual information value, and B is a growth function of the upper limit value of the trellis division x × y, which varies with the number of data samples n. The correlation between features is determined according to Mic (x, y), so n sample feature sets F is given as F₁,f₂,...,f_kK, any two kinds of features f of the feature set_i,f_jThe correlation is denoted Mic (f)_i,f_j)。 Mic(f_i,f_j) The larger the value, the more characteristic f is specified_iAnd f_jThe stronger the redundancy between them; when Mic (f)_i,f_j) When the value is 0, the characteristic f is indicated_iAnd f_jAre independent of each other. The redundant features are thus defined as follows: for the feature set F, feature F_iAnd characteristic f_jFisher Score value of F_i＞F_jAnd Mic (f)_i,f_j) If > 0.8, see f_iIs f_jThe redundancy feature of (1).

1.3 improved Fisher Score and maximum information coefficient based feature selection method

The process of the Fisher Score and the feature selection model of the maximum information coefficient designed by the invention is shown in figure 1 and mainly comprises two stages.

As shown in fig. 1, the feature subset selected in two stages is used as the final feature subset, and the specific steps are as follows:

inputting: characteristic data set F ═ F₁,f₂,...,f_k}

And (3) outputting: optimal feature subset F_out。

The first stage is as follows: feature importance analysis

step 1: sequentially calculating corresponding Fisher Score values F in the feature data set F by the formula (4)_k；

step 2: for feature set F as F_kAnd performing descending arrangement.

And a second stage: feature redundancy analysis

step 1: according to the F sequence, traversing the F, and successively selecting the F_kCharacteristic f of larger value_iTo which is compared with F_kCharacteristic f of small value_jCalculate Mic (f)_i,f_j)；

step 2: judging Mic (f)_i,f_j) > 0.8, if greater than 0.8, the feature f_jSequence F^·Adjusting to the tail end, updating F and sequencing;

step 3: complete traversal, output F_out。

2 network flow characteristic extraction optimization based on bidirectional residual GRU

The method can effectively overcome gradient disappearance by using a non-saturated activation function, and by taking the characteristic that a residual error structure in a Convolutional Neural Network (CNN) can effectively relieve Network degradation, the invention provides a residual error Gated-loop Unit (Re-GRU) on the basis of a Gated Recurrent Unit (GRU) to relieve the problems of gradient disappearance and Network degradation, simultaneously continuously optimizes a model, introduces a residual error structure, and ensures that the performance of extracting Network flow characteristics by the optimized bidirectional residual error GRU structure is more superior.

2.1 residual gated cyclic Unit

2.1.1 gated cycle cell

The GRU is a simplified improvement of LSTM, each GRU processing unit has only two thresholds, and the GRU unit has only one timing output, so the GRU has less parameter quantity under the condition of ensuring that the timing related information can be effectively transmitted, the unit structure is as shown in fig. 2, and the formula is defined as follows:

z_t＝σ(E_zx_t+R_zh_t-1+Q_z) (6)

s_t＝σ(E_sx_t+R_sh_t-1+Q_s) (7)

a_t＝Tanh(E_aR_a(h_t-1*s_t)+Q_a) (8)

h_t＝(1-z_t)*h_t-1+z_t*a_t (9)

wherein x is_tInput value, h, representing the time t of the current layer_t-1Is the output value of the state at time t-1, z_tAnd s_tThe activation functions sigma of the update gate and the reset gate are Sigmoid functions, a_tFor candidate hidden states at time t, h_tA state vector representing the current time, t Tanh being a hyperbolic tangent activation function (Tanh) of the candidate hidden states, and a model weight parameter E_z、E_s、E_aAnd R_z、R_s、R_aOffset vector is Q_z、Q_s、Q_a。

2.1.2 residual optimization

Gradient vanishing and network degradation are particularly severe in recurrent neural networks, and the present invention is addressed by improvements to GRUs. In the GRU algorithm, the output value and the output value of the hidden state at the previous time jointly determine the final output of the hidden state of the GRU. Therefore, the improvement of the invention mainly aims at a GRU candidate hidden state formula, and is mainly divided into the following 3 points:

(1) unsaturated activation function

The GRU candidate hidden state activation function is changed into a Linear rectification function (ReLU), so that the improved network can well avoid gradient disappearance caused by a saturation function, further deeper network training can be responded, the ReLU activation function can enable information transmission to be more direct, compared with the saturation activation function, the ReLU does not have the gradient disappearance problem caused by the saturation activation function, and the transfer of residual information can be better matched, therefore, a_tThe formula is changed as follows:

a_t＝ReLU(E_ax_t+R_a(h_t-1*s_t)+Q_a) (10)

wherein, a_tCandidate hidden states at the time t; e_a、R_aIs a model weight parameter; x is the number of_tRepresenting the input value at the moment t of the current layer; h is_t-1Is the output value of the state at the time t-1; s_tA reset gate at time t; q_aIs a bias vector.

(2) Adding residual connections

The GRU is improved by referring to a residual network mode in the CNN, so that the problems of gradient disappearance and network degradation in the GRU are solved. In particular, residual concatenation is introduced into the optimized a_tIn the expression. Residual information is a candidate hidden state value of the previous layer which is not activated

Each layer of Re-GRU has residual connection, and the hidden state after improvement is as follows:

wherein the content of the first and second substances,

a candidate hidden state output representing the time instant of k layers t,

candidate hidden state values for the k-1 layer that have not been activated,

is an input value of k layers at the time t;

weight parameters of k layers;

representing the k-layer state vector at time t-1,

for not yet activated candidate hidden states of k-layer, U^kThe matrix is matched for the dimension of the k-th layer,

for the updated k-th layer reset vector,

for the updated k-th layer offset vector, when the dimensionalities of the upper layer and the lower layer of the network are the same, the dimensionality matching matrix is not needed.

(3) Batch standardization

The gradient explosion problem is alleviated by normalizing the pre-activation mean and variance of each training small batch. The gradient vanishing and network degradation in the traditional GRU can be eliminated by changing the GRU activation function and adding the residual connection and applying Batch Normalization (BN) property. The cellular formula of Res-GRU layer 1 is shown in formulas (13) to (17), and the unit structure is shown in FIG. 3. Since the batch normalization property is to eliminate bias, the bias vectors in the above are ignored.

Wherein the content of the first and second substances,

an update vector at the t moment of the k layer;

candidate hidden state output representing the t moment of the k layer;

is an input value of k layers at the time t;

k layer state vectors at time t;

is an update vector of k layers;

a candidate reset vector for k layers;

is a reset vector for k layers; sigma is a Sigmoid function;

is a model weight parameter of the k layer; u shape^kMatching a matrix for the dimension of the k-th layer;

candidate hidden states for k layers that have not been activated;

candidate hidden state values for the k-1 layer that have not been activated.

2.2 bidirectional GRU deformation optimization

Optimizing through the residual error structureAfter the GRUs are formed, different GRU networks are respectively provided for each training sequence, and the GRU networks are of a bidirectional residual GRU structure. The optimization model can not only provide input sequence historical information for an output layer, but also provide future information for each time node of an input sequence, and comprises six unique weights, each weight is repeatedly utilized in each time sequence, and the weights are respectively: input layer to forward hidden layer (w)₁) Forward to the hidden layer (w)₂) From the input layer to the backward hidden layer (w)₃) Forward hidden layer to output layer (w)₄) Backward hidden layer to hidden layer (w)₅) Backward hidden layer to output layer (w)₆) As shown in fig. 4.

Therefore, a 128-kernel bidirectional residual GRU is finally used for extracting network data set features, and the specific process is as follows: the method comprises the steps that feature selection is carried out on a network data set through a Fisher Score and maximum information coefficient preprocessing method to obtain a redundancy-free and optimal feature subset, then feature set data enter a bidirectional residual GRU network structure to carry out feature extraction, a dropout layer is added to prevent overfitting and accelerate a training process, then multidimensional input is carried out through 1 Flatten layer to carry out one-dimensional operation, and finally all network layers are integrated through 2 full-connection layers. The first layer parameter is 128 cores, the activation function is Relu, the second layer selects 2 cores with the same size as the output dimension, and the activation function is sigmoid. The number of training iterations is 350 batches, the batch size d is 2500, the learning rate is set to 0.005, the optimizer is selected to Adam, and the specific optimization process structure diagram is shown in fig. 5.

3-Integrated dynamic ELM optimization

After the optimization of the network traffic characteristic model is completed, the invention designs a two-step game integrated dynamic ELM method aiming at the problem of low classification precision caused by the unbalanced sequence of the multi-sample labels of the network traffic data set, and solves the problems of data label balance degree deviation and inaccurate reconstructed data, thereby improving the detection precision.

3.1 Game-based ELM data Generation

And generating a few segments by utilizing the game theory idea and mutually playing the game through the two dynamic ELMs. After data generation, redundant data is removed in conjunction with Principal Component Analysis (PCA), slowing the probability of over-fitting occurring, the framework of which is shown in fig. 6.

The model adopts a generated countermeasure network (GAN), and sample data of 'false and true' is obtained by mutually playing the generated model and the discriminant model. The generator G obtains a data sequence similar to the original data fragment through characteristic representation; the discriminator D discriminates between the real data and the generated data by binary classification. The two game each other to improve the model mapping capability until the generator G and the discriminator D can not improve the performance of the generator G and the discriminator D, and the generated sample is real enough. Taking into account the current block of samples

The minority class sample overall characteristics are expressed as

Z_iIs shown and

noise segments of the same size. Cumulative number of samples N by the current time period_mIs composed of

Taking the expected mean value of each label as a judgment standard to distinguish a majority class from a minority class, specifically expressed as follows:

wherein L is_mIs the number of label types after the k-th sample fragment arrives, at which time the imbalance IR is_kIs the minimum number of samples

And maximum number of samples

Can be expressed as

Each class of minority class⁺Number of generation of

Determined by the sample distribution as

Wherein the content of the first and second substances,

indicating the number of samples of the largest class within the kth sample fragment,

indicating the number of current minority class samples,

to round down. The two extreme learning machine structures respectively form a generator G and a discriminator D, the discriminator D realizes a classic ELM structure, an input vector is a generated sample, and an output vector is an original label. The number of hidden nodes is set to be equal to the number of input nodes, so that the main target of the final minimum output difference decision device is as follows:

wherein, Y_DTo correspond to the actual value of the data segment, Y_kOutput labels for k sample fragments; h_DA random matrix output for hidden layers. The solution process calculates model parameters through M-P generalized inverse, i.e.

Wherein beta is_DAs output layer weights, H_DIs a random matrix output by the hidden layer of the extreme learning machine,

is H_DM-P generalized inverse of (1). The number of hidden layer nodes of the generator G is increased step by step from the initial number L to N_kChange when the generator performs N_kArbiter D executes once L times. When the generator G inputs a random noise sequence

The output is the generation characteristic X_GThrough neural network mapping, the incidence relation between the two is obtained as

X_G＝H_Gβ_G (23)

Wherein H_GFor G hidden layer output matrix at initial time, in the L +1 generation process, adding a hidden layer node, its output form is represented by d, d is N_kDimension random variable, new generator G expression can be obtained

H_G,L+1＝[H_G,L d] (24)

In the whole data generation process, the generator G adopts a matrix transformation mode to carry out operation, so that the complex solving process of the dynamic ELM model is avoided, and the convergence of the model is ensured. G and D are subjected to joint optimization, the objective function is the minimum cross entropy expectation, and the specific expression is

Wherein, P_data(X_k) For the probability distribution of the original minority class data fragments, P_Z(Z) is the probability distribution obeying Gaussian white noise. The generator G and the arbiter D are optimized alternately, the generator G randomly generates a group of data under the initial condition, and then the optimal arbiter D is calculated:

wherein, P_g(Z) for the minimum maximum strategy to obtain the global optimal solution generator data probability distribution, the objective function is further described as

Output matrix H for arbitrary hidden layers_lAll satisfy the basic ELM mapping relationship, G restores the data fragment distribution, the optimal generator G is determined by the approximate error, and the analytical expression can be expressed as:

wherein l represents the number of hidden nodes, and f (-) is an indicator function. In discriminator D, the function returns 0 if the predicted value matches the category label, otherwise the function returns 1. If the D recognition rate is close to 0.5 (within the interval of 0.45 and 0.55), the discriminator is considered to be incapable of recognizing the real data and the generated data, the condition that the generated data is real enough is met, and transient characteristic fragment data are obtained

For the original data segment, the principal component boundary determines the maximum spatial distance, and the kth sample segment characteristics can be written in the form of a matrix as follows:

the variables in the matrix are normalized as follows:

wherein x is_ijBeing elements of a matrix, μ_j，δ_jRespectively is the jth eigenvector mean and variance, and a standardized feature matrix is obtained after processing

Corresponding covariance matrix

Decomposing the characteristic value to obtain a characteristic vector

And respectively calculating the principal component scores, namely:

wherein the content of the first and second substances,

is the ith sample of the standard feature matrix, θ_jIs the jth feature vector. Selecting two principal components p corresponding to the maximum eigenvalues₁And p₂The rectangular coordinate axis of the two-dimensional plane describes the characteristic space distribution, and the maximum value p of two groups of principal components_1,maxAnd p_2,maxAs a spatially distributed boundary. Generating segment principal component scores

Following the boundary constraints, the following:

and is

The above formula shows that: the maximum principal component score of each sample of the generated fragment must be respectively smaller than the two maximum principal component scores of the corresponding target fragmentsThe value is obtained. Reconstructing the sample set which does not meet the boundary constraint condition, and replacing the generated samples until the generated sample space reaches N_kCompleting the generation process to obtain new feature segment

Therefore, through the above analysis, the generator G and the discriminator D of the present invention are structurally designed as follows: g and D both need to remove deconvolution, and only a common convolution layer is reserved; up-down sampling is realized by UpSamplling 2D and AvgPooling 2D; the convolution kernel size is 3 × 3 with a step size of 1. G, except the last layer, using a Tanh activation function, and the rest is ReLU; d uses ReLU as the activation function. The Batch Normalization (BN) layer in the generator G model will normalize the hidden layer input, and the specific model diagrams are shown in fig. 7 and 8, respectively.

3.2 Integrated dynamic ELM model Classification prediction

After the above contents are generated by using two dynamic ELM data, the invention continues to adopt Adaboost integrated learning thought to combine the individual prediction models into a strong classifier. The individual prediction model is an ELM structure, also known as a base learner. Assume an initial sample fragment of

The mapping relationship between the feature x and the label y is as follows:

wherein: n is a radical of₀Representing the number of samples in the initial stage, L representing the number of hidden nodes, y_iIs the ith label; beta is a_PJudging the weight; b_pIs the bias of the discriminator; selecting different weights a and offsets b to obtain an expression of the individual prediction model, which is as follows:

Y₀＝H₀β₀ (34)

wherein, satisfy

In the formula: h, H₀As a random matrix, beta₀Is a weight parameter.

According to the approximate error theory, an output vector expression is obtained as

In the formula:

is M-P generalized inverse, meets the necessary condition of least square solution norm, and has the expression

According to the error rate χ₀Describing the evaluation of the current model, in particular

Where f (-) is a typical indicator function with a return value of 0 or 1, N₀Represents the number of samples in the initial stage; y is₀Outputting the data as the original; therefore, there are R basis learners, and the prediction error of each basis learner, the minimum error χ of the R (R ═ 1,2, 3.., R.) basis learners is calculated respectively_0,rComprises the following steps:

in the Adaboost strategy, each base learner has a weight of ω_0,rConsidering both the case of correct and wrong predictionThe overall Loss function Loss is as follows:

wherein, χ_0,rIndicates the base learner error rate, H_0,rIs a random matrix of the r-th basis learner, beta_0,rIs a weight parameter of the r-th basis learner, χ₀As error rate, χ_0,rFor the error rate of the r-th basis learner, when the global loss function Los is minimized, the above equation is applied to ω_0,rPartial derivative of

Zero, the weight expression corresponding to the r-th base learner can be obtained as

After the weights of all R base classifiers are calculated, a final output expression is obtained by utilizing weighted learning:

in the formula: omega_0,rBased on the learner weight, H_0,rIs a random matrix of the r-th basis learner, beta_0,rIs the weight parameter of the r-th basis learner.

When data fragment

And when the model weight is reached, adjusting the model weight according to the prediction result of the set model. For all R basis learners, the current k-th segment prediction error can be expressed as follows

Wherein, χ_k,rFor the prediction error of the kth segment of the r-th basis learner,

prediction error for the ith segment of the ith base learner;

for the predicted value of the kth sample at the r-th base classifier, Y_k,rFor the corresponding true value, i ═ 1,2,3_kRepresenting the sample index, each element of the error matrix obtained by normalization being

Using information entropy IE at this time_rQuantifying the integral fitting degree, analyzing the influence of errors on the model, and obtaining the concrete expression

In the formula: e.g. of the type_i,rTo normalize the elements in the resulting error matrix.

All the individual prediction model information entropies are synthesized, the weight vector expression is determined, and the weight vector expression can be obtained

While based on the current basis learner weight vector ω_k,rThe weight vector ω_χ,rAnd omega_k,rAs two decision parameters for the overall model weight, the basic weight is obtained as

Wherein s is_pTo optimize the coefficients, P denotes the decision parameter setThe number of the game is determined by using the game theory, the two parameter playing processes can be converted into an optimization process, and the optimization target is the minimum value of the weight difference, which is as follows:

further on s_pCarrying out standardization to obtain

The final overall model weight is

On the basis of the current parameters, for the k +1 th segment feature set X_k+1And tag set Y_k+1The output layer vector expression is

Wherein the content of the first and second substances,

for R base classifiers, the updated classification prediction model is obtained as

In summary, the invention realizes the detection and classification of the multi-class network traffic characteristic data through a two-step game strategy, and the flow chart of the specific implementation steps of the proposed model is shown as 9 in consideration of the characteristics of the data generation process and the model updating process.

3.3 bidirectional residual GRU and integrated dynamic ELM network abnormal flow detection

The invention finally designs the bidirectional residual GRU and the integrated dynamic ELMThe network abnormal flow detection model is shown in fig. 10, after the characteristics of the network data set are selected by improving the Fisher Score and the maximum information coefficient method, the output is used as the input of the SVM to judge the accuracy rate, so as to determine whether the output is the optimal subset; inputting the output Y of the optimal subset into a bidirectional residual GRU framework, performing feature detection and extraction, and introducing residual information to make the network more sensitive to gradient change, thereby achieving the purpose of relieving network degradation; finally, according to an integrated dynamic ELM method based on two-step game, the problems of data label balance degree deviation and inaccurate reconstructed data are solved, the defect of poor model adaptability is avoided, a proper generator and a discriminator are designed to improve the fitting effect of the model on the rapidly changing data, meanwhile, Batch Normalization and activation function relu are utilized to further optimize, the defect of a deep network is reduced, and finally, a classification result Y is obtained^*. In the training process, a batch training method is adopted to alternately train G and D. And during training G, fixing the parameters of D, acquiring random noise Z and random condition vectors with the size of batch-size, inputting the random noise Z and the random condition vectors into G to generate a generated sample with the size of batch-size (64), firstly transmitting the generated sample into ELM based on dynamic integration, then entering D for training, reversely propagating and updating the parameters of G. And during D training, fixing the parameter G, acquiring data with the size of the batch-size from a training sample, inputting the data into the D to obtain a corresponding loss function, acquiring the noise Z and the random condition vector with the size of the batch-size, generating a sample with the size of the batch-size generated in the input G, transmitting the generated sample into the D to obtain the corresponding loss function, and performing back propagation and updating the parameter D. And repeating the alternate training of G and D until the network training is finished, finishing one round of training by training the sample once in a complete cycle, and storing the relevant parameters of G and D.

The invention has the beneficial effects that:

(1) the invention provides a feature screening method which is used for preprocessing network flow features. The first stage, evaluating all feature importance in the feature set through Fisher Score, and sorting the features according to the feature importance; in the second stage, the maximum information coefficient is used for evaluating the correlation between the features, so that redundant features are determined, and the sequencing result is adjusted again; finally, the feature subset is selected according to the classification accuracy of the SVM learning algorithm, compared with the traditional method that only normalization processing is used for network flow features, the method provided by the invention overcomes the problems of overhigh time complexity, local optimal defects and the like, the obtained network flow data has fewer redundant features, and the classification result accuracy can be higher when the method is used for data training.

(2) The invention improves the GRU structure, effectively solves the problems of gradient disappearance and network degradation of the original GRU by modifying the candidate hidden state activation function and adding residual connection, and makes the GRU structure more robust to long sequence characteristics, thereby having more excellent detection performance to network flow characteristics. Meanwhile, for the potential gradient explosion hazard possibly caused by the use of the unsaturated activation function, the method adopts a batch standardization method to solve the problem, and on the basis, the model is continuously optimized and designed into a bidirectional residual GRU structure for extracting the network flow characteristics. The gradient disappearance brought by the saturation activation function can influence the detection and extraction performance of the neural network, although the LSTM and the GRU have the capability of relieving the gradient disappearance problem compared with the traditional RNN, the relief is very limited, and compared with the traditional technology, the method has more excellent performance of extracting the network flow characteristic.

(3) Aiming at the problem of low classification precision caused by unbalanced network data classification prediction, the invention provides a dynamic ELM method based on two-step game integration for data generation and amplification. The method adopts a combined strategy of data processing and model updating so as to automatically match the structural change of the sample; in the data processing stage, a small number of sample fragments are generated by utilizing a generation confrontation dynamic ELM model, so that the distribution of different types of samples is balanced, and the problem caused by uneven samples is solved; the data generation method disclosed by the invention integrates a game countermeasure strategy and principal component analysis threshold judgment to ensure the authenticity of each sample segment; in the model updating stage, the relationship between the new weight and the individual model is established according to the loss degree and the initial weight, the combined weight is calculated by adopting the set model in the playing theory, a stable network system structure is formed, the fitting effect of the model on the rapid change data is improved, and the final classification precision is further improved.

(4) Aiming at the detection of network abnormal flow, the detection precision of the network flow is improved through the combination of a preprocessing model, a characteristic extraction model, a data generation model and a classification model, specifically, the preprocessing model can effectively screen flow characteristics, the characteristic extraction model can further extract a screening result, the data generation model reduces the problem caused by sample unbalance by generating a few types of sample fragments, the classification model combines a plurality of base classifiers in an integrated learning mode to form a strong classifier, and the classification accuracy is improved; the plurality of models used for detecting the network abnormal flow are combined into an organic whole, and the technical effect of improving the accuracy of detecting the network abnormal flow is achieved together.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow diagram of a network feature traffic pre-processing option;

FIG. 2 is a schematic diagram of a GRU unit structure;

FIG. 3 is a schematic diagram of a residual GRU unit structure;

FIG. 4 is a schematic diagram of a bidirectional GRU deformation configuration;

FIG. 5 is a schematic representation of a bi-directional residual GRU feature detection flow;

FIG. 6 is a schematic diagram of an integrated dynamic data generation framework;

FIG. 7 is a schematic diagram of a generator architecture;

FIG. 8 is a schematic diagram of a discriminator architecture;

FIG. 9 is a schematic diagram of an integrated dynamic ELM sample classification flow;

FIG. 10 is a schematic diagram of network anomaly traffic detection and classification for bidirectional residual GRU and integrated dynamic ELM;

FIGS. 11a-d are graphs comparing accuracy for different combinations of training set and test set in accordance with an embodiment of the present invention, wherein FIG. 11a is a graph comparing accuracy for 50% training set + 50% test set, FIG. 11b is a graph comparing accuracy for 60% training set + 40% test set, FIG. 11c is a graph comparing accuracy for 70% training set + 30% test set, and FIG. 11d is a graph comparing accuracy for 75% training set + 25% test set;

FIGS. 12a-d are graphs showing the relationship between the learning rate and the weight of the base learner for 10, 20, 25 and 30 training iteration rounds, respectively, wherein FIG. 12a is a graph showing the relationship between the learning rate and the weight of the base learner for 10 training iteration rounds and the relationship between the learning rate and the weight of the base learner for 20 and 30 training iteration rounds and the relationship between the learning rate and the weight of the base learner for 30 training rounds and the relationship between the learning rate and the weight of the base learner for 25 training iteration rounds and the relationship between the learning rate and the weight of the base learner for 30 training iteration rounds and the relationship between the learning rate and the weight of the base learner for the training iteration rounds and the loss rate of the model, respectively;

FIGS. 13a-b show ROC curves of two traffic types (Normal traffic Normal and Attack traffic Attack) under different detection methods, respectively, where FIG. 13a shows ROC curves of the Normal traffic Normal under different detection methods, and FIG. 13b shows ROC curves of the Attack traffic Attack under different detection methods;

fig. 14a-d respectively show ROC curve comparison diagrams of Probe, DoS, R2L and U2R attack traffic data in NSL-KDD data sets under different detection methods, wherein fig. 14a shows ROC curve comparison diagrams of Probe attack traffic data under different detection methods, fig. 14b shows ROC curve comparison diagrams of DoS attack traffic data under different detection methods, fig. 14c shows ROC curve comparison diagrams of R2L attack traffic data under different detection methods, and fig. 14d shows ROC curve comparison diagrams of U2R attack traffic data under different detection methods.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only illustrative of the present invention and should not be construed as limiting the scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

With reference to fig. 10, a method for monitoring abnormal traffic in a network according to an embodiment of the present invention is described below, and an embodiment of the method for monitoring abnormal traffic in an embodiment of the present invention includes

101. Experimental dataset selection

The simulation experiment data is selected from an NSL-KDD data set, the inherent data redundancy problem of the KDDCup99 data set can be effectively solved by the data set, the number of records in a training set and a test set is reasonable, a new differential level attribute is added, each connection is in inverse proportion to the proportion of the records in the KDD data set by the attribute, the classification rate difference of different machine learning methods is highlighted, and the efficiency of different learning technologies can be evaluated more conveniently. The training set comprises a total of more than 500 ten thousand network connection records in the first 7 weeks and is stored as binary TCPdump compressed data, the TCPdump data exceeds 4GB, the test set comprises more than 200 ten thousand network connection records in the last 2 weeks, and the network data flows to dst according to src. The method is specifically classified as follows: if it is a normal connection record, it is marked as normal, otherwise it is marked as an intrusion of a definite type (attack), as shown in table 1.

TABLE 1NSL-KDD dataset

102. Feature optimization selection and dataset preprocessing

The first stage, analyzing the feature importance, respectively calculating the Fisher Score value of each feature according to an improved algorithm and carrying out feature sorting according to the Fisher Score value; and in the second stage, the redundancy among the features is evaluated by utilizing the maximum information coefficient, the features are reordered, the result is added to a support vector machine, a forward adding strategy is used for expanding the feature subset and adding the feature subset to the last feature, and the classification accuracy is used as the basis of the feature selection subset. The feature subset is selected as the final feature subset through two stages.

For the NSL-KDD data set, verification analysis is respectively carried out under different training set and test set combinations. The method specifically comprises the following steps: selecting corresponding sample characteristics by each characteristic selection algorithm on the training set, then only keeping the selected characteristics in the test set, predicting the test sample by a Support Vector Machine (SVM), and calculating the accuracy of the predicted sample to obtain a corresponding experimental classification result. In this experiment, 5 comparison algorithms are selected, which are respectively: the method comprises the following steps of efficient robust feature selection algorithm (RFS), Laplacian score method (Laplacian score), local sensitivity semi-supervised feature selection (LSDF), correlation and redundancy standard-based semi-supervised feature selection (RRPC), and linear regression readjustment semi-supervised feature selection algorithm (RLSR). Meanwhile, since samples are randomly selected, which may cause unstable classification accuracy, each experiment is performed 20 times to obtain an experiment result with high reliability, and the average value is used as a comparison result, fig. 11a-d show an accuracy comparison graph of 50% training set + 50% testing set, 60% training set + 40% training set + 30% training set + 25% training set + testing set.

11a-d show the accuracy comparison for different training set and test set combinations. It can be seen from fig. 11a-d that the feature selection algorithm proposed by the present invention is superior to the compared feature selection algorithm under four different training set and test set combinations. However, comparing fig. 11a-d, it can be seen that when 75% of the training set is combined with 25% of the test set, the accuracy obtained by each method is better, and meanwhile, with the increase of the number of selected features, the classification accuracy of the method proposed by the present invention is also improved, which indicates that the improved Fisher Score and maximum information coefficient feature selection method can utilize the optimal feature subset selection to improve the classification accuracy, and the effectiveness of the method of the present invention is verified.

103. Bidirectional residual GRU extraction network flow characteristic

Feature selection is carried out on a network data set through a Fisher Score and maximum information coefficient preprocessing method to obtain a feature subset without redundancy and with the best feature, then feature set data enter a bidirectional residual GRU network structure to carry out feature extraction detection, a dropout layer is added to prevent overfitting and accelerate a training process, then multidimensional input is subjected to one-dimensional operation through 1 Flatten layer and used for transition from a convolutional layer to a full-link layer, and finally all the network layers are integrated through 2 full-link layers.

104. Integrated dynamic ELM optimization

FIGS. 12a-d show graphs comparing the learning rate versus the contribution of the basis learner weights to the model training loss rate for 10, 20, 25, and 30 training iteration rounds, respectively. As can be seen from fig. 12a-d, different training iteration rounds and learning rate parameters have a larger influence on model training, when the number of training rounds is small (fig. 12a), the model has an under-fit condition, thus resulting in a larger loss rate, and when the number of training rounds is large (fig. 12d), the model has an over-fit condition, also resulting in a larger loss rate; it can also be seen from the four models in fig. 12a-d that when the learning rate is too small, although the convergence of the model can be ensured, the optimization speed is reduced, and more iterations are required to achieve a more ideal optimization effect. When the learning rate is too high, parameters may move back and forth on two sides of the optimum value, the optimum value cannot be found, and the model training is poor and the loss rate is high; meanwhile, as seen from fig. 12a-d, the influence of the weight parameters of the base learner on the model training is large in the proposed integrated dynamic ELM, and when the weight of the base learner is small, the model loss is large in the training process, so that the proper selection of the weight is very important for the model training, and according to the above experimental analysis, the number of training rounds designed by the method is 25, the model training loss is minimum at this time, and the training effect is best.

105. Network abnormal traffic classification prediction

(1) Two-class verification

In the experiment, four kinds of attacks of the NSL-KDD data set are combined into Attack, Normal flow is recorded as Normal, a two-classification comparison experiment is carried out, and ROC curves of two flow types under different detection methods are shown in FIGS. 13 a-b.

As can be seen from fig. 13a-b, in each algorithm, no matter whether normal data or attack type data is detected, the AUC values detected by the method of the present invention are all better, wherein the AUC values of the 4 traditional machine learning methods are all lower, which is that the traditional machine learning method has a general processing effect on network traffic characteristic data, has fewer features that can be detected and extracted, and has a poor classification effect on traffic types. In most cases, the AUC values of the KNN, DT and RF methods are lower than the SVM method, because the SVM method is more suitable for two classification tasks; in 9 deep learning methods, the AUC value obtained by CNN is worst, and the AUC values obtained by RNN, DBN and LSTM are similar, but all are below 0.85; the three methods are easy to generate the problem of gradient disappearance in the classification process, so that the classification effect is poor; GRU, ELM, FARF-OSKBIELM and Hessian-ELM are classification methods based on semi-supervised learning, so that the obtained AUC values are better than the methods, but are slightly lower than the AUC values of the method. This fully illustrates the impact of ELM's potential parametric problems on classification performance, highlighting the necessity of the method of the present invention. The method for detecting the abnormal flow of the integrated dynamic ELM network based on the bidirectional residual GRU is verified to have better detection performance for the two-classification intrusion detection.

Table 2 shows experimental comparisons of the evaluation of the accuracy, the true rate, the false positive rate, the F value, and the AUC value of each algorithm for the 2-class classification. As can be seen from table 2, for comparison of the test accuracy, the accuracy of the four conventional machine learning methods is lower, wherein the highest accuracy is the SVM method, the average accuracy is 81.679%, but 9.61% lower than that of the present invention method, and the worst accuracy is the KNN method, the average accuracy is only 65.609%, 25.68% lower than that of the present invention method. Although the DT and RF methods have higher accuracy in detecting the attach flow, the detection accuracy for Normal flow is lower, so the stability is general and the classification effect is not good. Among the 9 deep learning methods, the CNN and RNN methods, which do not achieve 80% of detection accuracy, have average detection rates of 70.315% and 77.274%, and both methods have poor detection results due to training loss of their own neural networks. Although the accuracy rate of the rest 7 methods is over 80 percent, the method with the highest detection accuracy rate is the method, and the average detection accuracy rate is 91.289 percent which is higher than that of other deep learning methods. For the comparison of the recall rate, the four machine learning methods only have the better recall rate of the SVM method, the average recall rate is 82.161%, which is higher than the CNN and RNN detection methods, but lower than the rest 7 deep learning methods. Among the 7 deep learning methods, the recall rates obtained by the DBN, LSTM and GRU methods are relatively similar, and the average recall rates are 84.752%, 84.982% and 88.719%, which are respectively lower than 8.456%, 8.266% and 4.489% compared with the method of the invention. The ELM method, the FARF-OSKBIELM method and the Hessian-ELM method are all related to the ELM method, wherein the Hessian-ELM method is the highest in average recall rate, the average recall rate is 91.076%, and the method is lower than the method by 2.132%, so that the method comprehensively compares the recall rates and has great superiority. For the comparison situation of the false alarm rate, the false alarm rate obtained by each method when detecting the Attack flow is lower than that of Normal flow data, because the test set is formed by randomly selecting data of the data set, the Attack flow accounts for a larger proportion than the Normal flow, compared with other detection methods, the method disclosed by the invention has the lowest false alarm rate, the average rate is 1.846%, which is far lower than that of 4 traditional machine learning methods and two deep learning methods of CNN and RNN, and simultaneously, the method is lower than 1.918% and 1.607% of the best GRU method among three methods of DBN, LSTM and GRU, and the best Hessian-ELM method among ELM, FARF-OSKBIM and Hessian-ELM. The invention selects the network flow according to the improved Fisher Score and the maximum information coefficient characteristic selection method, and updates and optimizes the data generation according to the game integration dynamic ELM method, so the total false alarm rate is lowest. For comparison of harmonic mean values F-measure, values obtained by several traditional machine learning methods are general, wherein the KNN method is the worst, the average is only 69.557%, the F-measure value obtained by the model is the best, and the average harmonic value is the largest in 9 deep learning methods, the average is 92.239%, and the average is higher than that of other detection methods. For the comparison of AUC values, it can be seen from FIGS. 13a-b and Table 2 that the AUC values are better in the present invention because the present invention selects a proper number of training rounds during the training process, so that the data loss rate is smaller. In conclusion, by comparing several types of parameters, the superiority of the method of the invention to the two-classification task can be obtained through verification, and thus the overall performance superiority of the method of the invention in the network abnormal flow detection is also verified.

TABLE 2 Performance evaluation index for each algorithm

(2) Multi-class verification

In order to verify the classification performance of the invention in the multi-classification task, experiments show that Normal, Probe, DoS, R2L and U2R in the NSL-KDD dataset are classified into one type, and ROC curve contrast charts (for reasons of space, ROC curve graphs of Normal are not given) of four attack flow data under different detection methods are given, as shown in fig. 14 a-d.

From fig. 14a-d, it can be seen that, of the four attach traffic types, the AUC value obtained by Probe type traffic is the best, and the AUC value obtained by R2L traffic type is the worst, because R2L traffic type itself belongs to unauthorized access Attack from remote, its characteristics are difficult to detect, and useful characteristics that can be obtained are less, so the classification effect is poor. Meanwhile, it can be seen from ROC curves of four attach traffic types that AUC values obtained by the traditional machine learning method are all lower, for the reason that the AUC values are consistent with those discussed in the two classification tasks. In 9 deep learning methods, the AUC values obtained by the method are optimal, except for R2L type flow, the AUC values of the other three Attack flow types are all above 0.85, the effectiveness of the method is highlighted, and the effectiveness and superiority of multi-classification tasks are verified.

Tables 3-7 show experimental comparisons of accuracy, true rate, false alarm rate, F value and AUC value of 5 flow types in different algorithms. As can be seen from table 3, for Normal flow type data, the parameter values obtained by the 4 machine learning methods are all low, wherein the worst performance is the KNN method, wherein the accuracy and the recall rate are only 53.483% and 69.277%, respectively, the false alarm rate is as high as 28.121%, the average accuracy of the remaining three machine learning methods is 68.845%, the recall rate is 77.264% on average, but the false alarm rate is 16.240% on average, the F-measure value is 72.764% on average, and the AUC value is 0.625 on average, so the classification effect is general. The classification performance of the method is the best among the 9 deep learning methods, the obtained parameter values are all the best, and the performance is greatly improved, so that the classification effect on Normal flow is better.

TABLE 3 Performance evaluation index of each algorithm on Normal

As can be seen from table 4, for the Probe flow detection, among the four machine learning methods, the performance of the SVM method is superior, wherein the detection accuracy and the recall rate respectively reach 82.336% and 82.139%, the F-measure and AUC values respectively reach 82.237% and 0.837, the classification performance is better, but the false alarm rate can also be seen to be as high as 9.667%, and therefore, the classification stability is not good. The classification index parameters obtained by the 9 deep learning methods are all higher than those obtained by the 4 machine learning methods, which are characterized in that after the neural network training is used, the characteristics of Probe flow types can be extracted to the maximum extent, but in comparison, the detection accuracy rates of the CNN, RNN, DBN and LSTM methods are all lower than 90%, the average detection accuracy rate is 87.723%, the detection accuracy rate is 5.401% lower than that of the method, the average recall rate is 88.199%, the detection accuracy rate is 4.17% lower than that of the method, the average F-measure is 87.960%, the detection accuracy rate is 4.785% lower than that of the method, the average AUC is 0.888%, and the detection accuracy rate is 0.059% lower than that of the method. This is because the four neural network methods have overfitting during training, which results in poor training results and poor classification performance. The detection accuracy rates of the four methods, namely GRU, ELM, FARF-OSKBIELM and Hessian-ELM, are higher than 90%, wherein the detection accuracy rate and AUC are the Hessian-ELM method which is the best and are 92.363% and 0.934 respectively, are lower than the detection accuracy rates of the method by 0.761% and 0.13, the recall rate and F-measure are the best and are the FARF-OSKBIELM method which is 92.362% and 92.173% respectively, and are lower than the detection accuracy rates of the method by 0.07% and 0.572%, and meanwhile the false alarm rate of the method is the lowest and is 0.696%, so that the effectiveness of Probe flow detection is verified.

TABLE 4 Performance evaluation index of each algorithm on Probe

As can be seen from Table 5, for the detection of Dos flow, the performance of the four machine learning methods is poor, the detection accuracy, the recall rate and the F-measure value are all below 80%, and the AUC value is also below 0.8, so that the classification performance is generally lower than that of 9 deep learning methods, especially far lower than that of the method of the present invention. The average detection accuracy rate of the four machine learning methods is 70.477%, the average recall rate is 73.904%, the average F-measure value is 72.178%, the average AUC value is 0.745, which is respectively lower than the method of the invention by 22.509%, 19.454%, 20.994% and 0.186, and the false alarm rate of the four machine learning methods is higher, and is as high as 16.907% on average and is 15.824% higher than the method of the invention. Among the 9 deep learning methods, the detection performances of five methods, namely CNN, RNN, DBN, LSTM and GRU, are relatively similar, and are consistent with the reason, the methods have overfitting and network degradation conditions in the training process, so the classification performance is poorer than that of other four methods, the average detection accuracy is 84.377%, the average recall rate is 86.334%, the average F-measure value is 85.344%, and the average AUC value is 0.867, which are all lower than that of the method. The accuracy, recall rate and F-measure value of the three methods of ELM, FARF-OSKBIELM and Hessian-ELM are all over 90 percent, but are lower than that of the method of the invention, and the false alarm rate of the three methods is average 3.078 percent and is 1.995 percent higher than that of the method of the invention, so the stability is weaker than that of the method of the invention. The method optimizes the detection model according to the residual structure, improves the overfitting and network degradation conditions of the detection model, and therefore has the best classification performance on the Dos flow.

TABLE 5 Performance evaluation index of each algorithm on Dos

As can be seen from table 6, for the U2R flow detection, among the 4 conventional machine learning methods, the KNN method has the worst performance, where the accuracy and the recall ratio are only 59.332% and 65.071%, the F-measure value and the AUC value are 62.069% and 0.667, respectively, and the false alarm rate is as high as 22.692%, and the classification performance parameters of the remaining three machine learning methods are also lower than those of the 9 deep learning methods, so that the stability is poor and the classification performance is not ideal. The detection accuracy, the recall rate and the F-measure value of 9 counter-looking deep learning methods are all over 80 percent, but the relatively worst method is a CNN method, the accuracy is 83.618 percent, which is lower than the method of the invention by 8.401 percent, and other parameters are far lower than the method of the invention because the CNN method falls into a local optimal solution along with the deepening of the number of layers of a training network and deviates from a global optimal value, so that the training effect is poor and the classification performance is poor. The performances of the RNN, DBN, LSTM, GRU and ELM are similar, the average accuracy is 87.681%, the average recall rate is 87.125%, the average F-measure value is 87.401% and the average AUC value is 0.884, which are all lower than those of the method of the invention; although the classification performance parameters of the FARF-OSKBIELM method and the Hessian-ELM method are good, when the number of detection samples is small, the phenomenon of unstable detection can occur, the false alarm rate is too high, and meanwhile, the necessity of introducing integrated dynamic ELM to generate data is proved, and the detection effectiveness of U2R is verified.

Table 6 performance evaluation index of each algorithm on U2R

As can be seen from table 7, for the R2L traffic type detection situation, no matter the 4 machine learning methods or the 9 deep learning methods, the detection performance is general, which means that the total number of R2L attacks is small, and many R2L intrusions are disguised as legitimate user identities to attack, which makes the characteristics similar to normal data packets, and makes the detection of R2L attacks difficult, and the average detection accuracy of the four machine learning methods is 56.599%, the average recall rate is 60.187%, the average F-measure value is 58.328%, and the average AUC value is 0.657, which are all far lower than that of the present invention method, but are close to that of the CNN method, for the reason consistent with the foregoing. The accuracy rate of the 9 deep learning methods is lower than 90%, but compared with other methods, the method provided by the invention has the best performance and the ideal classification effect.

Table 7 performance evaluation index of each algorithm on R2L

In summary, from the comparison of various performance parameters in tables 3 to 7, the detection model of the present invention obtains better performance index values for Normal data and four attack data, and can effectively classify the data set of NSL-KDD, thereby verifying the effectiveness and superiority of the method of the present invention for multi-classification tasks.

(3) Time complexity analysis

To further verify the superiority of the method of the present invention in time complexity, experiments select Perplexity (PPL) values (the smaller the value is, the better the PPL) of 9 deep learning methods at different network layers to compare with time consumption (the shorter the time is, the better the PPL) as shown in table 8.

Time complexity contrast of 89 deep learning methods in table

It can be found from table 8 that the number of layers of the neural network affects the performance and time consumption of the training model, wherein the three methods, CNN, RNN and DBN, do not perform well in this experimental task. Compared with the test results of GRU and LSTM, the performance of CNN is poor, the PPL value reaches 144.36, when the number of network layers is increased to 5, although the PPL value is reduced to a certain extent, the PPL value is still large, and is 135.19, and the time consumption is the highest in 9 methods. The PPL values of the remaining methods are relatively small, but the time is increasing as the number of network layers increases, but the PPL value changes little, and thus the effect is not ideal. Among the several methods, the time consumption is the GRU method, but the PPL value is too large, and the time consumption of the method of the present invention is higher than that of the GRU and ELM methods, because the method of the present invention uses the residual GRU to optimize the network, and uses two-step game ELM to generate and update data, the time complexity is higher, but among the 9 methods, the PPL value is the lowest, and the performance is the best. Meanwhile, it can be seen that when the number of network layers is continuously increased, the PPL value is correspondingly decreased, but the time consumption is also increased, so that the overall comparison is optimal when the number of network layers is 5, and at this time, the time consumption of the method of the present invention is 43.13S, which is within the acceptable range, and the complexity is more suitable.

(4) Generalization capability and robustness verification

In order to further verify the robustness and the complex environment applicability of the method, the data attribute feature destruction rates of the experiments are respectively set to be 0.1, 0.2 and 0.3 under the noise interference of the measured sample in different degrees, the flow detection accuracy rates and the mean square error (RMSE) of different models of 9 deep learning methods under a multi-classification scene are compared, each experiment is independently repeated for 30 times and averaged, and the experiment results are shown in a table 9.

TABLE 9 accuracy of noisy flow detection for multiple classes of scenes using different models

As can be seen from table 9, when the measured data characteristics are damaged, the accuracy of the CNN-based anomaly detection model flow detection is the worst, and the RMSE value is the largest, so the influence is the largest, and as the damage rate increases, the accuracy thereof continuously decreases, and the RMSE value continuously increases; the detection accuracy based on RNN, DBN, LSTM and GRU is reduced by a smaller amount than the former, but the RMSE value is still larger, so that the detection model is unstable. Compared with the methods, the stability of the ELM, the FARF-OSKBIELM and the Hessian-ELM is better, but the accuracy of the ELM, the FARF-OSKBIELM and the Hessian-ELM can be further reduced by increasing the characteristic failure rate.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention, and not for limiting the same; while the invention has been described in detail and with reference to the foregoing examples, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting abnormal network traffic is characterized in that,

acquiring target traffic characteristic information according to target network traffic data, wherein the target network traffic characteristic information is obtained by preprocessing target original transceiving data, and the target original transceiving data belong to the target network traffic data; the preprocessing comprises a feature screening step, the feature screening step comprises feature importance sorting, redundant information deleting and information maximization processing, the information maximization processing uses an SVM model to predict the abnormal result of the target network traffic, and whether the target network traffic feature set is optimal or not is judged according to the prediction result of the abnormal result of the target network traffic;

2. The network abnormal traffic detection method of claim 1, wherein the detection accuracy training includes a feature extraction step, a data generation step, and a classification step,

the characteristic extraction step adopts a bidirectional residual GRU model, and changes a candidate hidden state activation function of the GRU into a linear rectification function;

the data generation step adopts a form of generating a countermeasure network, generates minority sample fragments through mutual countermeasure of two dynamic ELMs, quantifies the overall fitting degree by using information entropy, and screens the generated minority sample fragments according to the principal component analysis result of the sample data after the characteristic extraction step;

and the classification step adopts an ensemble learning method to combine the base classifiers into a strong classifier, wherein the base classifier is of an ELM structure.

3. A network abnormal flow detection model is characterized by comprising a preprocessing model, a characteristic extraction model, a data generation model and a classification model;

the preprocessing model selects target network traffic characteristics based on an improved Fisher Score and a maximum information coefficient, predicts an abnormal result of the target network traffic by using an SVM (support vector machine) model, and judges whether the target network traffic characteristic set is optimal or not according to the predicted result of the abnormal result of the target network traffic;

the data generation model generates minority sample fragments by adopting a form of generating a countermeasure network and through mutual countermeasure of two dynamic ELMs, and simultaneously quantifies the overall fitting degree by utilizing information entropy and screens the generated minority sample fragments according to the principal component analysis result of the sample data after the characteristic extraction step;

4. The network abnormal traffic detection model of claim 3, wherein the bidirectional residual GRU model changes an original GRU candidate hidden state activation function to a non-saturated activation function, preferably changes an original GRU candidate hidden state activation function to a linear rectification function, and introduces a residual structure in the GRU candidate hidden state.

5. The network abnormal traffic detection model of claim 3, wherein the data generation model is in a form of generating a countermeasure network, generates minority sample fragments through mutual countermeasure of two dynamic ELMs, quantifies the overall fitting degree by using information entropy, and screens the generated minority sample fragments according to the principal component analysis result of the sample data after the feature extraction step.

6. A method for training a network abnormal flow detection model is characterized by comprising the following steps:

acquiring flow characteristic optimization information, wherein the flow characteristic optimization information adopts a bidirectional residual GRU model to perform characteristic extraction on the target network flow characteristic, the bidirectional residual GRU model changes an original GRU candidate hidden state activation function into a non-saturated activation function, preferably, the original GRU candidate hidden state activation function into a linear rectification function, and a residual structure is introduced into the GRU candidate hidden state;

generating a minority sample fragment, wherein the minority sample fragment is generated according to the flow characteristic optimization information and is generated by two dynamic ELMs in a mutual confrontation manner;

7. A system for detecting abnormal traffic in a network, comprising:

an acquisition module that acquires target network traffic data;

the preprocessing module is used for preprocessing the target network traffic data to acquire target traffic characteristic information, preferably, the preprocessing module is used for selecting target network traffic characteristics based on improved Fisher Score and a maximum information coefficient, predicting an abnormal result of the target network traffic by using an SVM (support vector machine) model, judging whether the target network traffic characteristic set is optimal according to the prediction result of the abnormal result of the target network traffic, and screening a target characteristic set;

the characteristic extraction module acquires flow characteristic optimization information, wherein the flow characteristic optimization information adopts a bidirectional residual GRU model to perform characteristic extraction on the target network flow characteristics;

the detection module trains the flow characteristic optimization information and the minority sample fragments to obtain integrated dynamic ELM model parameters and generate a target network flow abnormity detection model, and the target network flow abnormity detection model is used for detecting flow abnormity information.

8. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 3, or perform the method of claim 6.