CN110958135A

CN110958135A - Method and system for eliminating DDoS (distributed denial of service) attack in feature self-adaptive reinforcement learning

Info

Publication number: CN110958135A
Application number: CN201911071642.XA
Authority: CN
Inventors: 李重; 孔玉波; 邵浩; 吴梅梅; 庄慧敏
Original assignee: Donghua University
Current assignee: Donghua University; National Dong Hwa University
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2020-04-03
Anticipated expiration: 2039-11-05
Also published as: CN110958135B

Abstract

The embodiment of the invention discloses a method and a system for eliminating a feature self-adaptive reinforcement learning DDoS attack, wherein a better simplified feature subset is extracted according to collected historical data information, a reinforcement learning model is established according to a potential and predictable traffic flow space-time rule in an internet of vehicles, a feature suitable for the current DDoS attack type is selected according to a Q-learning intelligent body trained by the reinforcement learning model, and meanwhile, a strategy pi is obtained by an asynchronous training DDQN intelligent body_DDQN(s_t) And guiding the selection of the Q-learning agent action. By learning attack characteristics in a self-adaptive manner, the purpose of detecting the unknown DDoS attack in the Internet of vehicles is achieved by a small amount of priori knowledge, dependence on labeled data is eliminated, and therefore the DDoS attack eliminating method is obtained, and the requirements of low time delay and high accuracy in the Internet of vehicles are met.

Description

Method and system for eliminating DDoS (distributed denial of service) attack in feature self-adaptive reinforcement learning

Technical Field

The embodiment of the invention relates to the technical field of vehicle networking, in particular to a method and a system for eliminating DDoS (distributed denial of service) attack in feature adaptive reinforcement learning.

Background

With the development of the 5G technology, a Mobile Edge Computing (MEC) technology is introduced into a vehicle networking to meet the requirement of real-time data processing, where each base station serves as an MEC service station, so that information forwarding time and additional network operations can be reduced, but the base stations are also vulnerable to DDoS (Distributed denial of service) attacks initiated by vehicles, and these DDoS attackers can send data far greater than the processing capability of the MEC service station, consume the resources of the MEC service station and further cause denial of service, and at this time, normal vehicles cannot establish normal connection with the base station. In the application of the car networking, due to the high maneuverability of the car, the DDoS attack is difficult to capture in real time. In addition, in recent years, the types of DDoS attacks are diversified, and according to a fixed feature set and historical experience, it is difficult to effectively distinguish the attack connections from normal connections. Therefore, there is a need for an intelligent method that can effectively detect DDoS attacks in a car networking environment.

At present, many researches on DDoS detection exist, and most of the methods are based on supervised learning and unsupervised learning. On one hand, the attack feature library is required to be continuously updated based on a supervised learning mode, so that the flexibility is poor, and the novel attack cannot be timely responded; while the unsupervised methods have many types of attacks, false attack reports sometimes occur, and the algorithm performance depends heavily on manually selected features, so that the prior knowledge of a designer is required. However, in some methods based on reinforcement learning, the design method is still constrained by the priori knowledge, and only single stream detection can be performed, which causes a relatively large delay. To date, no research has been dedicated to DDoS attack detection in the internet of vehicles. Therefore, an intelligent DDoS attack detection strategy is needed in the vehicle networking environment to learn attack characteristics in a self-adaptive manner, and the effect of distributed and batch processing of abnormal connection is achieved.

Disclosure of Invention

Therefore, the embodiment of the invention provides a method and a system for eliminating a DDoS attack by feature adaptive reinforcement learning, so as to solve the problems of prior knowledge constraint, large time delay and low detection accuracy of a DDoS attack detection method in the existing vehicle networking environment.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

according to a first aspect of the embodiments of the present invention, a method for eliminating a feature adaptive reinforcement learning DDoS attack is provided, where the method includes:

in the current k-th time period, dynamically acquiring interactive information data between the base station and the vehicle according to a preset time interval and respectively adding the interactive information data to the historical data set D'_kAnd a current data set D_kThen, pre-selecting characteristics;

constructing a reinforcement learning model, including the construction of a model state space, an action space and a reward function;

and according to the pre-selected feature set, using the reinforcement learning model to self-adaptively select features suitable for the current DDoS attack type through a plurality of times of iterative training reinforcement learning intelligent agents in a limited time, and detecting DDoS attack connection and disconnection.

Further, in the current kth time period, interactive information data between the base station and the vehicle are dynamically acquired according to a preset time interval and are respectively added to the historical data set D'_kAnd a current data set D_kThen, pre-selecting the characteristics, specifically comprising:

according to the historical data set D'_kObtaining an initial simplified feature subset F', a candidate feature subset CS and a redundant feature pair subset PS;

reduced feature subset F' storage satisfies δ₁＜ref(F_i,L)＜δ₂Characteristic F of_iThe candidate feature subset CS stores the satisfied ref (F)_i,L)＞δ₂Redundant feature of (F)_iThe redundant feature pair set PS stores satisfying ref (F)_i,F_j)＞δ₃Redundant feature pair (F)_i,F_j) And ref (F) thereof_i,F_j) A value;

F_iand F_jCorrelation between them

Characteristic F_iCorrelation with data item labels L

E(F_i) Is characterized by_iInformation entropy of (1), I (F)_i；F_j) Is characterized by_iAnd F_jThe | F | represents the number of all features in the feature set.

Further, the state space in the reinforcement learning model is defined as:

s_(t,k)＝{s_a,s_bit,s_occ,s_err}_(t,k)

where the subscript (t, k) denotes the t-th iteration, s, in the course of a k time period_aNumber of connections between vehicle and base station, s, representing k-1 time period_bitIndicating the average number of bytes, s, of data transmitted from the vehicle to the base station during the k-1 time period_occRepresenting the proportion of all connections, s, of connections exceeding a time threshold e_errIndicating the proportion of all connections having "SYN" errors.

Further, the action space in the reinforcement learning model is defined as:

wherein the sub-action a in the action space_add、a_del、a_keep、c_delAnd c_keepIs defined as follows:

action a_addRepresenting reinforcement learning agent to select a feature F from a subset of candidate features CS_addAdding a reduced feature subset F', where the feature F to be added is selected according to a confidence capped UCB algorithm_addThen F '← F' ∪ { F }is performed_add}、CS←CS\{F_addUpdating the reduced feature subset F' and the candidate feature subset CS;

action a_delRepresenting reinforcement learning agent according to probability P_d(F_i) Removing a feature F from the reduced feature subset F_iAnd added into the candidate feature subset CS, and then F ' ← F ' \ { F ' is performed_del}、CS←CS∪{F_delUpdating the reduced feature subset F' and the candidate feature subset CS, wherein the probability P is_d(F_i) The calculation rule is as follows:

for computational convenience, if feature pairs (F)_i,F_j) In the redundant feature set PS, ref (F) is set_i,F_j) Is 0;

action a_keepThe reinforcement learning agent keeps the current reduced feature subset F' and the candidate feature subset CS unchanged;

action c_delRepresenting a reinforcement learning agent to randomly select a clique c_iAt this time, the ball c is broken_iAll of the connections in (1);

action c_keepIndicating that the reinforcement learning agent keeps the current clustering and access connections unchanged.

Further, the reward function in the reinforcement learning model is established as follows:

the following vehicle networking traffic flow state space model is established by a linear time-invariant discrete system,

X_k＝ΓX_k-1+w_k

Y_k＝HX_k+v_k

wherein, X_kIs a system state vector, Y_kIs a systematic measurement vector, Γ is a state transition matrix, H is a measurement matrix, w_kRepresenting process noise, v, associated with randomness to traffic flow fluctuations in the Internet of vehicles and prediction model inaccuracies_kRepresenting the measurement noise during data collection, assuming w_kAnd v_kIs an uncorrelated, 0-mean gaussian white noise process, now with

Wherein, W_kAnd V_kAre respectively the noise w_kAnd v_kThe covariance matrix of (a);

obtaining a state vector in a k-1 time period according to Kalman filtering prediction

And state vector estimates for the t-th iteration in a curtain over a period of k time

The reinforcement learning agent executes corresponding actions according to the current state, and obtains the corresponding reward values after transferring to a new state as follows:

wherein the content of the first and second substances,

further, the predicted value

And the estimated value

The calculation method comprises the following steps:

the kalman filter prediction and update formula is as follows:

P_k|k-1＝ΓP_k-1Γ^T+W_k-1

G_k＝P_k|k-1H^T(HP_k|k-1H^T+V_k)^-1

P_k＝P_k|k-1-G_kHP_k|k-1

wherein the content of the first and second substances,

and

predicted and estimated values representing the state of the system, estimated values

Including predictive value

And prediction error

G_kAs a weighting factor, representing a Kalman gain matrix, P_k|k-1Representing the prior estimation error covariance matrix, P_kRepresenting a covariance matrix of a posterior estimation error;

state vector estimation for the t-th iteration in a curtain over a k time period

Comprises the following steps:

wherein, Y_(t+1,k)Is the measurement of the t-th iteration in one screen in the k period.

Further, according to the historical data set D'_kAnd a current data set D_kThe reinforcement learning model is used for adaptively selecting the characteristics suitable for the current DDoS attack type in a limited time through a plurality of times of iterative training reinforcement learning agent, detecting DDoS attack connection and disconnection, and specifically comprises the following steps:

according to a strategy of_DDQN(s_t) And strategy pi_Q(s_t) Selection action a_t；

Performing action a_tSub-action a in_add、a_delAnd a_keepUpdating the feature set F' and the CS, Q-learning agent to obtain the data set D according to the reduced feature subset F_tClustering result C of_t＝{c₁,c₂,...,c_n,...}_tThen perform action a_tSub-action c of_delOr c_keepAnd further obtain an updated data set D_t+1State s_t+1And the estimated value

According to the obtained state s_tAnd action a_tAnd the state after the transition s_t+1To obtain a reward value r_t+1Updating the Q-table according to a Q-learning algorithm and updating the jackpot R_eHere, the Q-table update formula is:

Q(s_t,a_t)←Q(s_t,a_t)+α(z_t+1-Q(s_t,a_t))

wherein α ∈ (0, 1) represents the learning rate, γ ∈ [0,1] represents the discount factor;

the jackpot is updated as follows:

R_e←R_e+r_t+1；

for each one of c_m∈C_t+1If, if

Indicating that the current screen has reached an end condition, wherein,

and

respectively representing deletion clusters c_mThen, the set of the remaining clusters and the system state estimation value are obtained;

comparing the cumulative award R after the front screen reaches the end condition_eMaximum jackpot with first e-1 curtains

To obtain better DDoS attack cancellation results if the reward is accumulated

Q-learning agent updates data set D_k+1Maximum jackpot value

And system state estimation

Otherwise, keeping the state unchanged;

until the action-state value function Q (s, a) converges or the end of the current time period is reached, the Q-learning agent ends the whole iteration process, at which point the data set D is_k+1As a normal connection, D_k-D_k+1Otherwise, the feature set F' and the candidate feature set CS are reset and the next screen is restarted, and the process is restarted after the next time period is reached.

Further, action a_tThe selection rules are as follows:

q-learning agent according to strategy pi_DDQN(s_t) An action a is selected_tStrategy n_DDQN(s_t) Is obtained by asynchronously training DDQN agent, if the current time period is continuously obtained by multiple curtains, the maximum accumulated reward is obtained

No longer changed, Q-lerning agent decommission strategy pi_DDQN(s_t) Selection action a_tAnd during the remaining time, the Q-learning agent follows strategy π_Q(s_t) And the epsilon-greedy method to select action a_t。

Further, a strategy pi is obtained by asynchronously training DDQN agent_DDQN(s_t) The method specifically comprises the following steps:

asynchronously training DDQN agent to obtain strategy pi_DDQN(s_t) Asynchronous here refers to training DDQN agent by using other processes to get strategy pi_DDQN(s_t) Therefore, the asynchronous effect is achieved, and the normal operation of the Q-learning algorithm is not interfered;

method and strategy pi for DDQN agent to use epsilon-greedy_DDQN(s_t) Selection actions

And perform actions

DDQN agent transitions to a new state

And obtain an immediate prize value

At this time, the DDQN intelligence body will

Storing the data into a playback buffer M, and then updating and optimizing the neural network according to a DDQN algorithm by adopting a small-batch gradient descent method

Wherein the loss function is defined as:

wherein the content of the first and second substances,M_bindicating the size of the number of data items for a batch process,

is the target value for the DDQN agent:

wherein the content of the first and second substances,

γ∈[0,1]is a discount factor, and in addition, the DDQN agent copies the value of the parameter theta to the parameter every tau steps

Performing the following steps;

when the termination condition is reached, the DDQN agent ends the current screen, the iterative updating continues until the current time period is ended, and the DDQN iterative updating process restarts after the k +1 time period is reached.

According to a second aspect of the embodiments of the present invention, a system for eliminating a feature adaptive reinforcement learning DDoS attack is provided, where the system includes:

a data preprocessing module, configured to dynamically obtain interaction information data between the base station and the vehicle according to a preset time interval in a current kth time period, and add the interaction information data to the historical data set D'_kAnd a current data set D_kThen, pre-selecting characteristics;

the model building module is used for building a reinforcement learning model and comprises a model state space, an action space and a reward function;

and the DDoS attack eliminating module is used for adaptively selecting the characteristics suitable for the current DDoS attack type in a limited time by using the reinforcement learning model through repeated iteration training of the reinforcement learning intelligent body according to the pre-selected characteristic set, and detecting the connection and disconnection of the DDoS attack.

The embodiment of the invention has the following advantages:

according to the method and the system for eliminating the feature self-adaptive reinforcement learning DDoS attack, provided by the embodiment of the invention, in the current kth time period, interactive information data between a base station and a vehicle are obtained and are respectively added to a historical data set D'_kAnd a current data set D_kThen, pre-selecting characteristics; constructing a reinforcement learning model, including the construction of a model state space, an action space and a reward function; and according to the pre-selected feature set, using the reinforcement learning model to self-adaptively select features suitable for the current DDoS attack type through a plurality of times of iterative training reinforcement learning intelligent agents in a limited time, and detecting DDoS attack connection and disconnection. By learning attack characteristics in a self-adaptive manner, the purpose of detecting the DDoS attack of unknown type in the Internet of vehicles is achieved with a small amount of priori knowledge, dependence on labeled data is eliminated, the DDoS attack eliminating method is obtained, the requirements of low time delay and high accuracy in the Internet of vehicles are met, and a new thought and a new detecting method are provided for the DDoS attack detecting problem in the Internet of vehicles.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

Fig. 1 is a schematic flowchart of a method for eliminating DDoS attack through feature adaptive reinforcement learning according to embodiment 1 of the present invention;

fig. 2 is a schematic view of a DDoS attack elimination flow of a vehicle networking according to a feature adaptive reinforcement learning DDoS attack elimination method provided in embodiment 1 of the present invention;

fig. 3 shows a policy pi in the car networking of a feature adaptive reinforcement learning DDoS attack elimination method according to embodiment 1 of the present invention_DDQN(s_t) The flow diagram of the asynchronous training phase of (1);

fig. 4 is a schematic structural diagram of a feature adaptive reinforcement learning DDoS attack elimination system provided in embodiment 2 of the present invention.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment 1 of the invention provides a method for eliminating DDoS attack by feature self-adaptive reinforcement learning, which mainly comprises three stages, namely a data preprocessing stage, a reinforcement learning model establishing stage and a DDoS attack eliminating stage by using reinforcement learning.

Specifically, the time axis is divided into equal time periods, the time periods are appropriately divided, a larger time period cannot meet the dynamic requirement of the internet of vehicles, a smaller time period cannot meet the requirement of obtaining a high-accuracy machine learning model, the reinforcement learning agent is trained through multiple iterations in each time period, and after the divided time periods are ended, the following three stages are restarted.

As shown in fig. 1 in detail, the method includes the following steps:

step 110, in the current kth time period, dynamically obtaining interactive information data between the base station and the vehicle according to a preset time interval and respectively adding the interactive information data to a historical data set D'_kAnd a current data set D_kThen, a pre-selection of the features is performed.

The method comprises a data preprocessing stage, which is responsible for collecting current traffic flow data and analyzing historical data. And extracting a better simplified feature subset according to the collected historical data information.

In the divided current kth time period, the base station dynamically collects interactive information data of the base station and the vehicle at regular intervals and respectively adds the interactive information data to a historical data set D'_kAnd a current data set D_kIn (1). Historical data set D'_kIs a collection of historically stored data sets, data set D'_k＝D₁∪D₂∪…∪D_kIt is dynamically changing, and information for each time segment is stored to historical data set D'_kTherefore, it is said that the data set collected in the past time period is recorded. However, not all data, because the longer the history time, the more memory space is needed, generally only history information of a long time period is saved, not data set information of all time periods. In this case, a predetermined memory space (here, the history data set D 'may be provided)'_kSize 100M) to store the data, the new data will overwrite the original data from the beginning.

The method comprises the steps of acquiring interaction information data between a base station and a vehicle, extracting a plurality of characteristics to represent the information, and adding the characteristics into a data set, wherein the data characteristics comprise 41 in total, as shown in the following table.

In the embodiment, the data is according to a historical data set D'_kTo calculate the correlation between features, then remove those redundant and irrelevant features, put the remaining features into a reduced feature set F', put the removed features into a candidate feature set CS, and pair redundant features (F)_i,F_j) Put into the set PS of redundant feature pairs.

Further, according to the historical data set D'_kObtaining an initial simplified feature subset F', a candidate feature subset CS and a redundant feature pair subset PS;

F_iand F_jCorrelation between them

Characteristic F_iCorrelation with data item labels L

Wherein, delta₁、δ₂、δ₃Is constant, in this example, take δ₁＝0.6、δ₂＝3.5、δ₃0.3; feature F herein_i、F_jRefer to any two of the 41 features previously mentioned, such as feature duration, protocol _ type, service, and so on.

And 120, constructing a reinforcement learning model, including the construction of a model state space, an action space and a reward function.

And secondly, a reinforcement learning model establishing stage, namely establishing a reinforcement learning model according to the potential and predictable traffic flow space-time law in the Internet of vehicles, and establishing a corresponding state space, action space and reward function aiming at the DDoS attack of the Internet of vehicles environment.

In the Internet of vehicles, under normal or non-attack conditions, some statistical characteristics in the traffic flow, such as the number x of the connections between the vehicles and the base station_aAverage number of data bytes x sent by vehicle to base station_bitThe ratio of long connections to all connections x_occStation, stationWith a proportion x of "SYN" errors in the connection_errMay change steadily and these statistical properties may mutate when the base station is subjected to DDoS attacks. In the application, in the reinforcement learning model establishing stage, the change of the state value at the next moment is predicted by introducing Kalman filtering according to the potential and predictable traffic flow space-time law in the Internet of vehicles.

Specifically, the method comprises the following steps:

the state space in the reinforcement learning model is defined as:

s_(t,k)＝{s_a,s_bit,s_occ,s_err}_(t,k)

In the process of establishing the TCP connection, the TCP protocol of the transport layer is a connection-oriented protocol, and normal data interaction needs to be performed after the connection is established, and this process of establishing the connection is three-way handshake in the TCP protocol: firstly, a request end (client) sends a TCP message containing a 'SYN' mark, SYN is synchronization, and the synchronization message indicates a port used by the client and an initial serial number of TCP connection; and secondly, after receiving the SYN message of the client, the server returns a SYN + ACK message to indicate that the request of the client is accepted. And thirdly, after the client receives the message sent by the server, the client also returns an acknowledgement message ACK to the server, and the TCP serial number is added by one until the TCP connection is completed. Whereas a "SYN" error occurs in the TCP three-way handshake process described above. In the TCP connection establishment phase, after sending the SYN + ACK packet, the server cannot receive an Acknowledgement (ACK) packet from the client, so that the connection cannot be established.

The action space in the reinforcement learning model is defined as:

(1) action a_addRepresenting reinforcement learning agent to select a feature F from a subset of candidate features CS_addAdding a reduced feature subset F', where the feature F to be added is selected according to the UCB (upper confidence) algorithm_addThen F '← F' ∪ { F }is performed_add}、CS←CS\{F_addAnd updating the reduced feature subset F' and the candidate feature subset CS, wherein the feature F_addThe selection rule is as follows:

calculate UCB value for each feature:

here, N (F)_iAnd t) represents a feature F_iTotal number of times selected, R (F)_iT) represents the selection of feature F by t iterations_iThe average prize value obtained is calculated as follows:

herein, the

Shows that at j-1 iterations, feature F is added_iThe benefit to be obtained is that the user has,

indicates whether feature F was selected in j-1 iterations_iAnd added to the reduced feature set F', where,

the reward earned in relation to performing an action in the action space is proportional, i.e. it is

Coefficient of performance

Representing a weight factor for the metric selection feature F_iAct a of_addFor the bonus value r_t+1Is usually set to 0.5, in which case the feature to be added is

(2) Action a_delRepresenting reinforcement learning agent according to probability P_d(F_i) Removing a feature F from the reduced feature subset F_iAnd added into the candidate feature subset CS, and then F ' ← F ' \ { F ' is performed_del}、CS←CS∪{F_delUpdating the reduced feature subset F' and the candidate feature subset CS, wherein the probability P is_d(F_i) The calculation rule is as follows:

(3) action a_keepThe reinforcement learning agent keeps the current reduced feature subset F' and the candidate feature subset CS unchanged;

(4) action c_delRepresenting a reinforcement learning agent to randomly select a clique c_iAt this time, the ball c is broken_iAll of the connections in (1);

(5) action c_keepIndicating that the reinforcement learning agent keeps the current clustering and access connections unchanged.

Reinforced learning modelReward function r in model_t+1The establishment is as follows:

X_k＝ΓX_k-1+w_k

Y_k＝HX_k+v_k

wherein the system state vector is defined as

Here, x_aIndicates the number of connections between the vehicle and the base station, x_bitAverage number of data bytes, x, transmitted by vehicle to base station_occDenotes the proportion of long connections to all connections, x_errRepresenting the proportion of all connections with "SYN" errors, the system measurement vector is defined as

And Y is_kOf (2) with X_kWhere Γ is the state transition matrix, H is a measurement matrix, and w is_kRepresenting process noise, v, associated with randomness to traffic flow fluctuations in the Internet of vehicles and prediction model inaccuracies_kRepresenting the measurement noise during data collection, assuming w_kAnd v_kIs an uncorrelated, 0-mean gaussian white noise process, now with

prediction value

And the estimated value

The calculation method comprises the following steps:

the kalman filter prediction and update formula is as follows:

P_k|k-1＝ΓP_k-1Γ^T+W_k-1

G_k＝P_k|k-1H^T(HP_k|k-1H^T+V_k)^-1

P_k＝P_k|k-1-G_kHP_k|k-1

wherein the content of the first and second substances,

and

Including predictive value

And prediction error

Comprises the following steps:

wherein the content of the first and second substances,

in the following steps, the subscript k is omitted and the subscript t is used instead of (t, k).

And step 130, according to the pre-selected feature set, using a reinforcement learning model to train a reinforcement learning agent to self-adaptively select features suitable for the current DDoS attack type through multiple iterations, and detecting DDoS attack connection and disconnection.

And finally, using reinforcement learning to eliminate a DDoS attack stage, and being responsible for training a reinforcement learning intelligent agent according to the reinforcement learning model obtained in the previous stage, adaptively selecting attack characteristics in a limited time, and removing DDoS attack connection as far as possible so as to eliminate DDoS attack in the Internet of vehicles. In this stage, the method mainly includes training a Q-learning agent to select characteristics suitable for the current DDoS attack type, and only a small amount of experience knowledge is needed to realize self-adaptive processing of the DDoS attack of unknown type, and asynchronously training a DDQN (deep double Q network) agent to obtain a strategy pi_DDQN(s_t) By strategy pi_DDQN(s_t) And guiding the selection of the Q-learning agent action.

The method comprises the following steps of training a Q-learning intelligent agent to select characteristics suitable for the current DDoS attack type:

step 131, according to the strategy pi_DDQN(s_t) And strategy pi_Q(s_t) Selection action a_t。

Action a_tThe selection rules are as follows: q-learning agent according to strategy pi_DDQN(s_t) An action a is selected_tStrategy pi here_DDQN(s_t) Is obtained by asynchronously training DDQN agent, if the current time period is continuously obtained by multiple curtains, the maximum accumulated reward is obtained

Q-learning agent ceases to use strategy pi without any change_DDQN(s_t) Selection action a_tAnd during the remaining time, the Q-learning agent follows strategy π_Q(s_t) And the epsilon-greedy method to select action a_t。

Step 132, perform action a_t。

In particular, action a is performed_tSub-action a in_add、a_delAnd a_keepUpdating the feature set F' and the CS, Q-learning agent obtaining the data set D by using the DBSCAN clustering algorithm according to the reduced feature subset F_tClustering result C of_t＝{c₁,c₂,...,c_n,...}_tThen perform action a_tSub-action c of_delOr c_keepAnd further obtain an updated data set D_t+1State s_t+1And the estimated value

Step 133, update Q-table and jackpot R_e。

In particular, according to the obtained state s_tAnd action a_tAnd rotatingShifted state s_t+1To obtain a reward value r_t+1Updating the Q-table according to a Q-learning algorithm and updating the jackpot R_eHere, the Q-table update formula is:

Q(s_t,a_t)←Q(s_t,a_t)+α(z_t+1-Q(s_t,a_t))

cumulative prize R_eThe update is as follows:

R_e←R_e+r_t+1。

and step 134, judging whether the current screen reaches the termination condition.

In particular, for each c_m∈C_t+1If, if

Indicating that the current screen has reached an end condition, wherein,

and

respectively representing deletion clusters c_mA set of clusters and system state estimates then remain.

Step 135, update data set D_k+1Maximum jackpot value

And system state estimation

Specifically, after the front screen reaches the end condition, the cumulative award R is compared_eMaximum jackpot with first e-1 curtains

To obtain better DDoS attack cancellation results if the reward is accumulated

Q-learning agent updates data set D_k+1Maximum jackpot value

And system state estimation

Otherwise it remains unchanged.

In this embodiment, the symbol D is used_k+1To indicate the end result of the current time period, i.e. to store the normal connections to the data set D_k+1In (D)_k+1Initially, it is an empty set, and D is updated for each iteration in the algorithm_k+1. In particular, D here_k+1And not the data set representing the next k +1 time period.

Step 136, determine whether Q (s, a) converges or reaches the end of the current time period.

The Q-learning agent finishes the whole iteration process until the action-state value function Q (s, a) converges or the end time of the current time period is reached, wherein Q (s, a) converges to be the Q-learning algorithm, and at the moment, the data set D_k+1As a normal connection, D_k-D_k+1The connection in (3) is taken as abnormal connection and is disconnected, so that the purpose of eliminating DDoS attack is achieved, otherwise, the feature set F' and the candidate feature set CS are reset, and the next screen is restarted. After the next time period is reached, the process is restarted.

Obtaining a policy pi by asynchronously training a DDQN agent_DDQN(s_t) The method specifically comprises the following steps:

asynchronously training DDQN agent to obtain strategy pi_DDQN(s_t) Asynchronous here refers to training DDQN agent by using other processes to get strategy pi_DDQN(s_t) Therefore, the asynchronous effect is achieved and the normal operation of the Q-learning algorithm is not interfered. Here, by adding an over-dash line (e.g., state) to the original symbol

Movement of

And a reward function

) To distinguish from the symbols in the Q-learning algorithm, but the symbols are defined as above.

And perform actions

In particular

Execution is as in step 132, with the DDQN agent transitioning to a new state

And obtain an immediate prize value

At this time, the DDQN intelligence body will

Storing the data into a playback buffer zone M (replaybuffer M), and then updating and optimizing an online neural network according to a DDQN algorithm by using a small-batch gradient descent (mini-batch) method

Wherein the loss function is defined as:

wherein M is_bIndicating the size of the number of data items for a batch process,

is the target value for the DDQN agent:

wherein the content of the first and second substances,

Performing the following steps;

The DDoS attack elimination method based on the combination of Kalman filtering, Q-learning and DDQN algorithm is based on a data set D'_kObtaining an initial simplified feature subset F', a candidate feature subset CS and a redundant feature pair set PS, and simultaneously obtaining a predicted value of the current system state by using Kalman filtering

According to the current state s_tThe reinforcement learning intelligent agent executes the action a_tTransition to a new state s_t+1Obtaining a new reward value r according to the obtained predicted value_t+1. After a plurality of iterations in a time period, a maximum accumulated reward R is finally output_eCorresponding data set D_k+1. At this time, data set D_k+1The connection involved is the normal connection, D_k-D_k+1The connection is DDoS attack connection and disconnection, so that DDoS attack in the Internet of vehicles is eliminated.

According to the method for eliminating the DDoS attack of the feature self-adaptive reinforcement learning, firstly, a better simplified feature subset is extracted according to collected historical data information, then, a reinforcement learning model is established according to the potential and predictable traffic flow space-time law in the internet of vehicles, finally, according to the reinforcement learning model, a Q-learning intelligent body is trained to select the features suitable for the current DDoS attack type, only a small amount of experience knowledge is needed, the DDoS attack of the unknown type can be processed in a self-adaptive mode, and meanwhile, the DDQN intelligent body is trained asynchronously to obtain a strategy pi_DDQN(s_t) By strategy pi_DDQN(s_t) And guiding the selection of the Q-learning agent action. By learning attack characteristics in a self-adaptive manner, the purpose of detecting the DDoS attack of unknown type in the Internet of vehicles is achieved with a small amount of priori knowledge, dependence on labeled data is eliminated, the DDoS attack eliminating method is obtained, the requirements of low time delay and high accuracy in the Internet of vehicles are met, and a new thought and a new detecting method are provided for the DDoS attack detecting problem in the Internet of vehicles.

Corresponding to the foregoing embodiment 1, an embodiment 2 of the present invention provides a system for eliminating a DDoS attack through feature adaptive reinforcement learning, where the system includes:

a data preprocessing module 210, configured to dynamically obtain interaction information data between the base station and the vehicle according to a preset time interval in a current kth time period, and add the interaction information data to the historical data set D'_kAnd a current data set D_kThen, pre-selecting characteristics;

the model construction module 220 is used for constructing a reinforcement learning model, including the construction of a model state space, an action space and a reward function;

and the DDoS attack elimination module 230 is configured to adaptively select, according to the pre-selected feature set, features suitable for the current DDoS attack type through multiple iterative training of the reinforcement learning agent by using a reinforcement learning model within a limited time, and detect connection and disconnection of DDoS attacks.

The functions executed by each component in the feature adaptive reinforcement learning DDoS attack elimination system provided by the embodiment of the present invention have been described in detail in the above embodiment 1, and therefore, redundant description is not repeated here.

According to the system for eliminating the DDoS attack of the feature self-adaptive reinforcement learning, firstly, a better simplified feature subset is extracted according to collected historical data information, then, a reinforcement learning model is established according to the potential and predictable traffic flow space-time law in the internet of vehicles, finally, according to the reinforcement learning model, a Q-learning intelligent body is trained to select the features suitable for the current DDoS attack type, only a small amount of experience knowledge is needed, the DDoS attack of the unknown type can be processed in a self-adaptive mode, and meanwhile, the DDQN intelligent body is trained asynchronously to obtain a strategy pi_DDQN(s_t) By strategy pi_DDQN(s_t) And guiding the selection of the Q-learning agent action. By learning attack characteristics in a self-adaptive manner, the purpose of detecting the DDoS attack of unknown type in the Internet of vehicles is achieved with a small amount of priori knowledge, dependence on labeled data is eliminated, the DDoS attack eliminating method is obtained, the requirements of low time delay and high accuracy in the Internet of vehicles are met, and a new thought and a new detecting method are provided for the DDoS attack detecting problem in the Internet of vehicles.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A method for eliminating DDoS attack of feature adaptive reinforcement learning is characterized by comprising the following steps:

in the current k time period, the base station is dynamically acquired according to the preset time intervalInteraction information data with the vehicle are added to the historical data set D 'respectively'_kAnd a current data set D_kThen, pre-selecting characteristics;

2. The method for eliminating the DDoS attack of feature self-adaptive reinforcement learning according to claim 1, wherein interaction information data between a base station and a vehicle is dynamically acquired according to a preset time interval in a current kth time period and is respectively added to a historical data set D'_kAnd a current data set D_kThen, pre-selecting the characteristics, specifically comprising:

F_iand F_jCorrelation between them

Characteristic F_iCorrelation with data item labels L

E(F_i) Is characterized by_iInformation entropy of (1), (I)F_i；F_j) Is characterized by_iAnd F_jThe | F | represents the number of all features in the feature set.

3. The method for eliminating the feature adaptive reinforcement learning DDoS attack according to claim 2, wherein the state space in the reinforcement learning model is defined as:

s_(t,k)＝{s_a,s_bit,s_occ,s_err}_(t,k)

4. The method for eliminating the feature adaptive reinforcement learning DDoS attack according to claim 2, wherein the action space in the reinforcement learning model is defined as:

action a_delRepresenting reinforcement learning agent according to probability P_d(F_i) Removing one from the reduced feature subset FCharacteristic F_iAnd added into the candidate feature subset CS, and then F ' ← F ' \ { F ' is performed_del}、CS←CS∪{F_delUpdating the reduced feature subset F' and the candidate feature subset CS, wherein the probability P is_d(F_i) The calculation rule is as follows:

5. The method for eliminating the DDoS attack in the feature self-adaptive reinforcement learning according to claim 2, wherein the reward function in the reinforcement learning model is established as follows:

X_k＝ΓX_k-1+w_k

Y_k＝HX_k+v_k

wherein the content of the first and second substances,

6. the method for eliminating the feature adaptive reinforcement learning DDoS attack according to claim 5, wherein the predicted value is

And the estimated value

The calculation method comprises the following steps:

the kalman filter prediction and update formula is as follows:

P_k|k-1＝ΓP_k-1Γ^T+W_k-1

G_k＝P_k|k-1H^T(HP_k|k-1H^T+V_k)^-1

P_k＝P_k|k-1-G_kHP_k|k-1

wherein the content of the first and second substances,

and

Including predictive value

And prediction error

Comprises the following steps:

7. The method for eliminating feature adaptive reinforcement learning DDoS attack according to claim 2, characterized by being according to a historical data set D'_kAnd a current data set D_kThe reinforcement learning model is used for adaptively selecting the characteristics suitable for the current DDoS attack type in a limited time through a plurality of times of iterative training reinforcement learning agent, detecting DDoS attack connection and disconnection, and specifically comprises the following steps:

Q(s_t,a_t)←Q(s_t,a_t)+α(z_t+1-Q(s_t,a_t))

the jackpot is updated as follows:

R_e←R_e+r_t+1；

for each one of c_m∈C_t+1If, if

Indicating that the current screen has reached an end condition, wherein,

and

To obtain better DDoS attack cancellation results if the reward is accumulated

Q-learning agent updates data set D_k+1Maximum jackpot value

And system state estimation

Otherwise, keeping the state unchanged;

8. The method of feature adaptive reinforcement learning DDoS attack elimination according to claim 7, wherein the action a_tThe selection rules are as follows:

9. The method of claim 8, wherein the strategy pi is obtained by asynchronously training DDQN agent_DDQN(s_t) The method specifically comprises the following steps: