CN116055209A

CN116055209A - Network attack detection method based on deep reinforcement learning

Info

Publication number: CN116055209A
Application number: CN202310109721.5A
Authority: CN
Inventors: 姚琳; 田子缘; 吴国伟; 崔桂彰
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-05-02

Abstract

A network attack detection method based on deep reinforcement learning. The method comprises the steps of preprocessing an original data set, and constructing an Agent, wherein the method comprises the steps of initializing the environment of the Agent, prescribing the interaction mode of an Agent and the environment, training strategies and cost functions. And selecting the features according to the states, and inputting the selected features into a detection model for prediction. And (3) returning the detection result as feedback to the Agent training module, calculating Q (s, a) of the action, and refreshing the Q table. Repeating until the feature number contained in the optimal feature subset reaches the maximum, namely converging the model; or the training step is completed, and the optimal feature subset is generated. The processing method aiming at the novel feature design can reflect the importance of the novel feature on intrusion attack detection, and if the importance is important, the exclusive optimal feature subset of the novel feature can be deployed, so that the flexibility of the optimal feature subset is reflected, and corresponding measures can be spontaneously taken aiming at different attack situations.

Description

Network attack detection method based on deep reinforcement learning

Technical Field

The invention relates to a network attack detection method based on deep reinforcement learning, and belongs to the technical field of information security.

Background

The research on the network attack detection method is countless, but most research methods do not pay attention to the characteristic processing of the original data, so that the improvement of the detection algorithm is emphasized. The invention develops research aiming at the characteristic processing problem of the original data under the principle that the data and the characteristics determine the upper limit of machine learning and the model and the algorithm only approach the upper limit. The feature selection (Feature Selection) method aims at the original data and features, extracts an optimal feature subset by eliminating uncorrelated, redundant, abnormal features and features with small significance, further improves model training precision, reduces running time and resource consumption, and belongs to the problem of search optimization. At present, the traditional feature selection method is mainly divided into three major categories, such as a Filter filtering method, a Wrapper packaging method and an Embedded embedding method, specifically comprises methods of Pearson (Pearson) correlation coefficient, chi-square verification, distance measurement, variance selection and the like, and is mainly realized by combining multiple disciplines of search technology, statistics and the like based on mathematical features. Although considerable research results are achieved, corresponding disadvantages exist, such as relatively complex calculation process, and data characteristics tend to increase exponentially due to the increase of dimensions; cannot adapt to the development change of data, and belongs to a static method. Therefore, in the era of dynamic development and change of data features, the optimal feature set should be flexibly selected, and furthermore, a determined optimal feature set should not be dust-free, but different methods should be proposed to optimize and update the data feature set in view of the actual situation.

At present, in the big data age, the traditional feature selection method is difficult to meet the actual demands facing mass data and high-dimensional data. With the development of data mining, machine learning and other technologies, the selection criteria for features have remained on the mathematical calculation of the correlation between the data features. Meanwhile, in the face of dynamically changing network environments, more and more intrusion attacks with novel features begin to appear, and an attacker can start with the novel features and design attack means to bypass the existing defense detection measures, so that serious information leakage is caused. For example, a network with a dynamically changed topological structure such as a vehicle-mounted network, a satellite communication network and the like, and large-scale training data can generate extremely slow searching process of the optimal feature set, so that the speed of environmental change cannot be completely kept up. In addition, if the novel features are not considered, even if the optimal feature set is obtained according to the existing features, an attacker can still design intrusion attack by utilizing the novel features, and further sensitive data are acquired.

In summary, the conventional feature selection method cannot well select the optimal feature set, and thus cannot provide resistance well. Later scholars put forward feature selection algorithms based on simple machine learning, feature selection based on traditional deep learning models and related improved algorithms to solve the safety problems caused by a large number of high-dimensional features and novel features, but certain defects exist, such as Nugroho et al in AReview of intrusion detection system in IoT with machine learning approach: current andfuture research analyze, analyze and arrange the performance of various machine learning algorithms when intrusion detection is researched in Internet of things equipment for nearly 5 years, wherein the Support Vector Machine (SVM) and the Artificial Neural Network (ANN) have the most investment in intrusion classification process, good yield and the like; kilinger et al in Machine learning methods for cyber security intrusion detection: datasets and comparative study developed research on a variety of intrusion detection open source data sets, classified using K-nearest neighbor (KNN) and Decision Tree (DT) algorithms, achieved more successful research results that were thought to help use machine learning methods on the basis of artificial intelligence to help study intrusion detection mechanisms; in New hybrid method for attack detection using combination of evolutionary algorithms, SVM, and ANN, hosseini et al, in order to improve training effect and performance, combine SVM with ANN, ANN with Decision Tree (DT) successively, and finally succeed in feature dimension reduction and optimizing training time. Although the methods are fast, they cannot extract deep network data information, cannot identify new network attacks, and cannot be applied to networks with highly-changed environments; in addition, akhtar et al in Deep learning-based framework for the detection of Cyberattack using feature engineering, a Convolutional Neural Network (CNN) classification model is adopted to detect the DoS attack, and the research fully extracts the data characteristics, so that higher accuracy is obtained; the Mehedi et al in Deep transfer learning based intrusion detection system for electric vehicular networks proposes a LeNet model based on deep transfer learning, which greatly improves the accuracy of intrusion detection and is accompanied by better security performance compared with the mainstream machine learning, deep learning and reference deep transfer learning models. Deep learning can be used to extract deep features of the original data using a multi-layer neural network and identify network attacks through continuous iterative training. Although mass data features can be effectively processed to a certain extent, the essence of the method still belongs to a static model, and the processing of unknown novel features still has defects.

Disclosure of Invention

In order to effectively solve the selection problem of massive features and novel unknown features, the invention provides a network attack detection method based on deep reinforcement learning, which is mainly applied to the detection field of intrusion attacks, wherein the flexibility is embodied in the processing mode of the novel features. According to the scheme, firstly, a method combining feature selection and anomaly detection is provided, a result of anomaly detection is used as feedback of an Agent, a reward mechanism is designed according to the feedback, the Agent sets corresponding rewards for each feature, and therefore an optimal feature set can be directly selected according to training experience after model convergence, and at the moment, the optimal feature set is set to be a general optimal feature subset (the length of the set is fixed to be max). Then, the invention carries out relevant processing aiming at the novel characteristics, and can prevent an attacker from achieving the aim of sensitive information leakage through the novel characteristics. When a new feature appears, it is first assumed to be a member of the optimal feature subset (taking the new feature as the first feature in view of the correlation between features), and then step one is repeated to select max-1 features to make up the optimal feature subset. In this way, compared with the general optimal feature subset, if the detection index is obviously improved, the optimal feature subset is set as a proprietary optimal feature subset of a new feature or a new intrusion attack with the new feature as a core; otherwise, the novel feature is considered to have no great research significance, can be temporarily ignored, and the subsequent detection still adopts a general feature subset. Finally, because the algorithm combines the perception capability of deep learning and the decision capability of reinforcement learning, the optimal feature subset selected by the method is flexible, and can determine whether to design the special feature subset for the novel features in a meaningful way, so the method is applicable to dynamically-changed network environments and detects novel intrusion attacks.

The technical scheme of the invention is as follows:

a network attack detection method based on deep reinforcement learning comprises the following steps:

(1) Feature selection agent environment state model construction;

the environment state model, namely the environment required by the agent, comprises definition of a reward function and design of an interactive feedback rule, and comprises the following specific contents:

(1.1) first use U _t Representing a discounted future jackpot earned by the agent at time t, then considering a specific application context;

discounted future jackpot U _t : the intelligent agent can sense the state of the environment and provide feedback signals r according to the environment _t Maximizing a discount future cumulative award by learning a selection action; because the randomness of the environment causes the state of the intelligent agent and the randomness of taking action to be continuously increased along with the increase of the step number, in order to reduce the uncertainty and the randomness, the discount factor gamma is introduced to reduce the strong relevance among the steps, and the discount future jackpot U is discounted _t As a discounted future jackpot, the expression is:

wherein R is _t Is feedback accepted by the agent at the t moment, and gamma is [0,1 ]]Is a discount coefficient for promoting the instant prize rather than the delayed prize;

when γ approaches 0, this indicates that the current return is more emphasized; when γ approaches 1, this means that future returns are more emphasized; because the application background is intrusion detection, the network traffic is discrete and mutually independent, and belongs to the problem of discrete values of category data, gamma is as close to 0 as possible, so that the continuity between the network traffic is weakened;

(1.2) after defining the rewarding function, designing an interactive feedback rule of the Agent and the environment, namely a rewarding mechanism;

designing a double rewarding mechanism, wherein the considered evaluation indexes comprise accuracy, precision, recall and running time of a model of a detection result, and designing a rewarding formula as follows:

wherein ω represents a weight matrix for measuring the corresponding evaluation index, and is used for measuring the importance, setting preference degree, priority and the like of each evaluation index; r represents a reward matrix, and each evaluation index corresponds to one reward component; r_a represents feedback of accuracy, r_p represents feedback of accuracy, r_r represents feedback of recall, and r_t represents feedback of running time; it should be noted that the false alarm rate and the false alarm rate index are not considered in the formula, because they are linearly related to the accuracy rate and the recall rate, but if a separate study is desired, the false alarm rate and the false alarm rate index can be added into the formula, and the corresponding weights can be set.

At each iteration, adding the newly selected feature to the selected feature set, and if the index trained by the agent using the new feature set is reduced, setting the reward of the new feature to-100 (ensuring that the following training agent can fully avoid the feature); if various indexes are improved, firstly, recording feedback corresponding to the improved various detection indexes, namely, accuracy rate r_a, accuracy rate r_p, recall rate r_r and running time r_t, and then calculating new reward according to weights corresponding to the various indexes; for example, if accuracy is raised to 90%, r_a=0.9. It should be noted that the flexible processing can be performed according to the actual situation, and if the accuracy of the detection result is very concerned, the weight corresponding to other indexes can be set to be a number tending to 0, even 0; if a plurality of detection indexes are concerned at the same time, reasonable weights are set according to requirements.

The advantages are that: different scenes can be applied to wider scenes aiming at different importance degrees and priorities of three indexes, and evaluation standards are selected according to actual requirements, so that generalization is improved; a separate study may also be performed for a certain index, such as finding the feature subset with the highest accuracy.

(2) Feature selection agent cost function construction;

the cost function is a hope of rewarding, is mainly used for evaluating the quality of different states and guiding the selection of the actions of the intelligent agent, and is also used for evaluating the quality of the intelligent agent in a state s at a certain time t, and the specific contents are as follows:

(2.1) first calculate a cost function Q (s, a) for evaluating the expected return of the current agent from state s, performing action a and obeying policy epsilon, as follows:

Q _π (s,a)＝E _π [U _t |S _t ＝s,A _t ＝a]

wherein S is _t Indicating the state of the intelligent agent at the t moment, A _t Representing actions performed by the agent at time t, E _π Training strategies of the intelligent agent;

after obtaining the value of all possible actions a corresponding to the current state s, in combination with training strategies, the agent needs to select the optimal action, i.e. based on strategy pi, take the maximum value in all Q (s, a), which is action a _t The formula is as follows:

wherein Q is ^* (s _t ,a _t ) Is the maximum value of all actions at the t-th time step;

(2.2) through the construction of the cost function, the intelligent agent regularly selects all possible actions of the current state according to a given strategy; accordingly, the strategy is defined as follows:

since in reinforcement learning there are two very important concepts-development and exploration, the former representing the principle that an agent is in the "maximizing action value", selecting the optimal action from the known actions; which indicates that the agent has selected other unknown actions. Then, in a certain state, the agent can only execute one action, and the two actions cannot be performed simultaneously, and the strategy is used for balancing development and exploration.

Selecting a greedy strategy, wherein the greedy strategy represents that when an agent makes a decision, the probability of the existence of E (0 < E < 1) randomly selects an unknown action, and the probability of the remaining 1E selects the action with the largest value in the existing actions; when an agent selects a feature and finally adds it to the optimal feature set, it needs to remove or reset its Q value from the action space, that is, the Q value corresponding to the action of selecting the feature is reduced as much as possible, so as to ensure that the feature is not selected as much as possible in subsequent training.

The invention has the beneficial effects that: although the traditional feature selection method has a certain effect, a large amount of calculation force is consumed due to the adoption of a large amount of mathematical calculation methods, and the novel feature cannot be effectively processed, particularly the novel feature structure intrusion attack is prevented, and a series of security threats are brought. Therefore, the invention provides a network attack detection method based on deep reinforcement learning.

Deep reinforcement learning is actually a flexible decision-making capability combining mass data processing capability of deep learning with reinforcement learning. Aiming at reinforcement learning unfolding construction, firstly, an environment where an agent is located is defined, an interaction rule and a reward mechanism are designed, secondly, a cost function and a training strategy are defined, unique standards for selection actions are defined for the agent, finally, the agent can select features with high rewards through repeated iterative training, simultaneously, features with low rewards are avoided, and finally, an optimal feature subset is generated.

When the novel features are faced, the novel features can be directly added into the optimal feature set, the novel optimal feature set is used as training data to train the model again, and then whether a proprietary optimal feature subset is required to be set for the novel features/novel intrusion attack is determined according to various detection indexes.

Drawings

FIG. 1 is a schematic diagram of an agent detection model based on deep reinforcement learning according to the present invention.

FIG. 2 is a flow chart of feature selection based on deep reinforcement learning according to the present invention.

Fig. 3 is a flow chart of a novel feature of the process of the present invention.

Detailed Description

For the purpose of clarity, technical solutions and the expression of advantages of the present invention, the present invention will be further described in detail below by means of implementation steps and drawings.

A network attack detection method based on deep reinforcement learning includes how to preprocess original data, how to embody interaction process of intelligent agent and environment and how to process novel characteristic process.

Referring to fig. 2, a specific operation procedure of how to embody interaction of an agent with an environment and select features is as follows:

initializing an environment where an intelligent agent is located, defining a state space to represent selected features, enabling an action space to represent available actions of selecting features from original features, and setting a training strategy and the maximum number max of the optimal feature subsets.

And 2, selecting actions according to a greedy strategy. The probability of epsilon selects the action with the maximum Q value at the current moment, and the probability of 1 epsilon randomly explores the possible actions in the environment.

Step 3, obtaining the next moment of shape after the agent performs the actionState S _t+1 。

And 4, predicting whether abnormal behaviors exist by the detection model according to the state characteristics selected by the intelligent agent currently and combining a corresponding machine learning algorithm.

And 5, calculating a detection index according to the predicted value, and recording the detection index as feedback of the intelligent agent, wherein the formula is as follows:

wherein ω represents a weight matrix for measuring importance of each index, setting preference degree, priority, and the like; r_a, r_p, r_r and r_t respectively correspond to the accuracy, precision, recall and run-time of the detection result.

And 6, adding the features corresponding to the state into the optimal feature subset.

And 7, calculating the Q value of the previous moment action. The formula is as follows:

Q _π (s,a)＝E _π [U _t |S _t ＝s,A _t ＝a]

where pi represents the training strategy and Q value represents the state at the current time and the expected future accumulated rewards of discounts corresponding to actions.

Step 8, scanning the optimal feature subset, and ending when the feature quantity contained in the optimal feature subset reaches the maximum value max, namely the model converges; otherwise, returning to the step 2 to repeat.

Referring to fig. 3, how to process the novel features, the specific operation process is as follows:

and 9, converting the novel characteristics into state variables through normalization, single thermal coding and other modes.

Step 10, assuming that the feature set belongs to the optimal feature subset and adding the feature set to form new training data, and inputting the new training data into a detection model.

And step 11, observing whether the detection indexes (accuracy, precision, recall and running time) are improved relative to the existing optimal feature set.

And step 12, if the detection index is improved, a label is given to the novel feature, and a proprietary optimal feature subset of the novel feature/novel intrusion attack is marked.

And step 13, if the detection index is not improved, the novel feature is considered to have small meaning, and the follow-up detection is performed by adopting a general optimal feature subset.

Claims

1. A network attack detection method based on deep reinforcement learning is characterized by comprising the following steps:

(1) Feature selection agent environment state model construction;

discounted future jackpot U _t : the intelligent agent can sense the state of the environment and provide feedback signals r according to the environment _t Maximizing a discount future cumulative award by learning a selection action; reducing strong correlation between steps by introducing a discount factor gamma, discounting future jackpot U _t As a discounted future jackpot, the expression is:

when γ approaches 0, this indicates that the current return is more emphasized; when γ approaches 1, this means that future returns are more emphasized;

wherein ω represents a weight matrix for measuring the corresponding evaluation index, and is used for measuring the importance, setting preference degree, priority and the like of each evaluation index; r represents a reward matrix, and each evaluation index corresponds to one reward component; r_a represents feedback of accuracy, r_p represents feedback of accuracy, r_r represents feedback of recall, and r_t represents feedback of running time;

adding the newly selected feature to the selected feature set at each iteration, and setting the reward of the new feature to-100 if the metrics trained by the agent using the new feature set are reduced; if various indexes are improved, firstly, recording feedback corresponding to the improved various detection indexes, namely, accuracy rate r_a, accuracy rate r_p, recall rate r_r and running time r_t, and then calculating new reward according to weights corresponding to the various indexes;

(2) Feature selection agent cost function construction;

Q _π (s,a)＝E _π [U _t |S _t ＝s,A _t ＝a]

after obtaining all the corresponding possibilities of the current state sThe value of actionable a, combined with training strategies, the agent needs to choose the optimal action, i.e., based on strategy pi, take the maximum value among all Q (s, a), which is actionable a _t The formula is as follows:

selecting a greedy strategy, wherein the greedy strategy indicates that when an agent makes a decision, the probability of epsilon randomly selects an unknown action, 0< epsilon <1, and the remaining probability of 1 epsilon selects the action with the largest value in the existing actions; when an agent selects a feature and finally adds it to the optimal feature set, it needs to remove or reset its Q value from the action space, that is, the Q value corresponding to the action of selecting the feature is reduced as much as possible, so as to ensure that the feature is not selected as much as possible in subsequent training.