CN110958135A - Method and system for eliminating DDoS (distributed denial of service) attack in feature self-adaptive reinforcement learning - Google Patents

Method and system for eliminating DDoS (distributed denial of service) attack in feature self-adaptive reinforcement learning Download PDF

Info

Publication number
CN110958135A
CN110958135A CN201911071642.XA CN201911071642A CN110958135A CN 110958135 A CN110958135 A CN 110958135A CN 201911071642 A CN201911071642 A CN 201911071642A CN 110958135 A CN110958135 A CN 110958135A
Authority
CN
China
Prior art keywords
feature
reinforcement learning
ddqn
action
ddos attack
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911071642.XA
Other languages
Chinese (zh)
Other versions
CN110958135B (en
Inventor
李重
孔玉波
邵浩
吴梅梅
庄慧敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
National Dong Hwa University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201911071642.XA priority Critical patent/CN110958135B/en
Publication of CN110958135A publication Critical patent/CN110958135A/en
Application granted granted Critical
Publication of CN110958135B publication Critical patent/CN110958135B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1458Denial of Service
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the invention discloses a method and a system for eliminating a feature self-adaptive reinforcement learning DDoS attack, wherein a better simplified feature subset is extracted according to collected historical data information, a reinforcement learning model is established according to a potential and predictable traffic flow space-time rule in an internet of vehicles, a feature suitable for the current DDoS attack type is selected according to a Q-learning intelligent body trained by the reinforcement learning model, and meanwhile, a strategy pi is obtained by an asynchronous training DDQN intelligent bodyDDQN(st) And guiding the selection of the Q-learning agent action. By learning attack characteristics in a self-adaptive manner, the purpose of detecting the unknown DDoS attack in the Internet of vehicles is achieved by a small amount of priori knowledge, dependence on labeled data is eliminated, and therefore the DDoS attack eliminating method is obtained, and the requirements of low time delay and high accuracy in the Internet of vehicles are met.

Description

Method and system for eliminating DDoS (distributed denial of service) attack in feature self-adaptive reinforcement learning
Technical Field
The embodiment of the invention relates to the technical field of vehicle networking, in particular to a method and a system for eliminating DDoS (distributed denial of service) attack in feature adaptive reinforcement learning.
Background
With the development of the 5G technology, a Mobile Edge Computing (MEC) technology is introduced into a vehicle networking to meet the requirement of real-time data processing, where each base station serves as an MEC service station, so that information forwarding time and additional network operations can be reduced, but the base stations are also vulnerable to DDoS (Distributed denial of service) attacks initiated by vehicles, and these DDoS attackers can send data far greater than the processing capability of the MEC service station, consume the resources of the MEC service station and further cause denial of service, and at this time, normal vehicles cannot establish normal connection with the base station. In the application of the car networking, due to the high maneuverability of the car, the DDoS attack is difficult to capture in real time. In addition, in recent years, the types of DDoS attacks are diversified, and according to a fixed feature set and historical experience, it is difficult to effectively distinguish the attack connections from normal connections. Therefore, there is a need for an intelligent method that can effectively detect DDoS attacks in a car networking environment.
At present, many researches on DDoS detection exist, and most of the methods are based on supervised learning and unsupervised learning. On one hand, the attack feature library is required to be continuously updated based on a supervised learning mode, so that the flexibility is poor, and the novel attack cannot be timely responded; while the unsupervised methods have many types of attacks, false attack reports sometimes occur, and the algorithm performance depends heavily on manually selected features, so that the prior knowledge of a designer is required. However, in some methods based on reinforcement learning, the design method is still constrained by the priori knowledge, and only single stream detection can be performed, which causes a relatively large delay. To date, no research has been dedicated to DDoS attack detection in the internet of vehicles. Therefore, an intelligent DDoS attack detection strategy is needed in the vehicle networking environment to learn attack characteristics in a self-adaptive manner, and the effect of distributed and batch processing of abnormal connection is achieved.
Disclosure of Invention
Therefore, the embodiment of the invention provides a method and a system for eliminating a DDoS attack by feature adaptive reinforcement learning, so as to solve the problems of prior knowledge constraint, large time delay and low detection accuracy of a DDoS attack detection method in the existing vehicle networking environment.
In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:
according to a first aspect of the embodiments of the present invention, a method for eliminating a feature adaptive reinforcement learning DDoS attack is provided, where the method includes:
in the current k-th time period, dynamically acquiring interactive information data between the base station and the vehicle according to a preset time interval and respectively adding the interactive information data to the historical data set D'kAnd a current data set DkThen, pre-selecting characteristics;
constructing a reinforcement learning model, including the construction of a model state space, an action space and a reward function;
and according to the pre-selected feature set, using the reinforcement learning model to self-adaptively select features suitable for the current DDoS attack type through a plurality of times of iterative training reinforcement learning intelligent agents in a limited time, and detecting DDoS attack connection and disconnection.
Further, in the current kth time period, interactive information data between the base station and the vehicle are dynamically acquired according to a preset time interval and are respectively added to the historical data set D'kAnd a current data set DkThen, pre-selecting the characteristics, specifically comprising:
according to the historical data set D'kObtaining an initial simplified feature subset F', a candidate feature subset CS and a redundant feature pair subset PS;
reduced feature subset F' storage satisfies δ1<ref(Fi,L)<δ2Characteristic F ofiThe candidate feature subset CS stores the satisfied ref (F)i,L)>δ2Redundant feature of (F)iThe redundant feature pair set PS stores satisfying ref (F)i,Fj)>δ3Redundant feature pair (F)i,Fj) And ref (F) thereofi,Fj) A value;
Fiand FjCorrelation between them
Figure BDA0002261132150000031
Characteristic FiCorrelation with data item labels L
Figure BDA0002261132150000032
E(Fi) Is characterized byiInformation entropy of (1), I (F)i;Fj) Is characterized byiAnd FjThe | F | represents the number of all features in the feature set.
Further, the state space in the reinforcement learning model is defined as:
s(t,k)={sa,sbit,socc,serr}(t,k)
where the subscript (t, k) denotes the t-th iteration, s, in the course of a k time periodaNumber of connections between vehicle and base station, s, representing k-1 time periodbitIndicating the average number of bytes, s, of data transmitted from the vehicle to the base station during the k-1 time periodoccRepresenting the proportion of all connections, s, of connections exceeding a time threshold eerrIndicating the proportion of all connections having "SYN" errors.
Further, the action space in the reinforcement learning model is defined as:
Figure BDA0002261132150000033
wherein the sub-action a in the action spaceadd、adel、akeep、cdelAnd ckeepIs defined as follows:
action aaddRepresenting reinforcement learning agent to select a feature F from a subset of candidate features CSaddAdding a reduced feature subset F', where the feature F to be added is selected according to a confidence capped UCB algorithmaddThen F '← F' ∪ { F }is performedadd}、CS←CS\{FaddUpdating the reduced feature subset F' and the candidate feature subset CS;
action adelRepresenting reinforcement learning agent according to probability Pd(Fi) Removing a feature F from the reduced feature subset FiAnd added into the candidate feature subset CS, and then F ' ← F ' \ { F ' is performeddel}、CS←CS∪{FdelUpdating the reduced feature subset F' and the candidate feature subset CS, wherein the probability P isd(Fi) The calculation rule is as follows:
Figure BDA0002261132150000034
for computational convenience, if feature pairs (F)i,Fj) In the redundant feature set PS, ref (F) is seti,Fj) Is 0;
action akeepThe reinforcement learning agent keeps the current reduced feature subset F' and the candidate feature subset CS unchanged;
action cdelRepresenting a reinforcement learning agent to randomly select a clique ciAt this time, the ball c is brokeniAll of the connections in (1);
action ckeepIndicating that the reinforcement learning agent keeps the current clustering and access connections unchanged.
Further, the reward function in the reinforcement learning model is established as follows:
the following vehicle networking traffic flow state space model is established by a linear time-invariant discrete system,
Xk=ΓXk-1+wk
Yk=HXk+vk
wherein, XkIs a system state vector, YkIs a systematic measurement vector, Γ is a state transition matrix, H is a measurement matrix, wkRepresenting process noise, v, associated with randomness to traffic flow fluctuations in the Internet of vehicles and prediction model inaccuracieskRepresenting the measurement noise during data collection, assuming wkAnd vkIs an uncorrelated, 0-mean gaussian white noise process, now with
Figure BDA0002261132150000041
Wherein, WkAnd VkAre respectively the noise wkAnd vkThe covariance matrix of (a);
obtaining a state vector in a k-1 time period according to Kalman filtering prediction
Figure BDA0002261132150000042
And state vector estimates for the t-th iteration in a curtain over a period of k time
Figure BDA0002261132150000043
The reinforcement learning agent executes corresponding actions according to the current state, and obtains the corresponding reward values after transferring to a new state as follows:
Figure BDA0002261132150000044
wherein the content of the first and second substances,
Figure BDA0002261132150000045
further, the predicted value
Figure BDA0002261132150000046
And the estimated value
Figure BDA0002261132150000047
The calculation method comprises the following steps:
the kalman filter prediction and update formula is as follows:
Figure BDA0002261132150000048
Pk|k-1=ΓPk-1ΓT+Wk-1
Gk=Pk|k-1HT(HPk|k-1HT+Vk)-1
Figure BDA0002261132150000049
Pk=Pk|k-1-GkHPk|k-1
wherein the content of the first and second substances,
Figure BDA0002261132150000051
and
Figure BDA0002261132150000052
predicted and estimated values representing the state of the system, estimated values
Figure BDA0002261132150000053
Including predictive value
Figure BDA0002261132150000054
And prediction error
Figure BDA0002261132150000055
GkAs a weighting factor, representing a Kalman gain matrix, Pk|k-1Representing the prior estimation error covariance matrix, PkRepresenting a covariance matrix of a posterior estimation error;
state vector estimation for the t-th iteration in a curtain over a k time period
Figure BDA0002261132150000056
Comprises the following steps:
Figure BDA0002261132150000057
wherein, Y(t+1,k)Is the measurement of the t-th iteration in one screen in the k period.
Further, according to the historical data set D'kAnd a current data set DkThe reinforcement learning model is used for adaptively selecting the characteristics suitable for the current DDoS attack type in a limited time through a plurality of times of iterative training reinforcement learning agent, detecting DDoS attack connection and disconnection, and specifically comprises the following steps:
according to a strategy ofDDQN(st) And strategy piQ(st) Selection action at
Performing action atSub-action a inadd、adelAnd akeepUpdating the feature set F' and the CS, Q-learning agent to obtain the data set D according to the reduced feature subset FtClustering result C oft={c1,c2,...,cn,...}tThen perform action atSub-action c ofdelOr ckeepAnd further obtain an updated data set Dt+1State st+1And the estimated value
Figure BDA0002261132150000058
According to the obtained state stAnd action atAnd the state after the transition st+1To obtain a reward value rt+1Updating the Q-table according to a Q-learning algorithm and updating the jackpot ReHere, the Q-table update formula is:
Figure BDA0002261132150000059
Q(st,at)←Q(st,at)+α(zt+1-Q(st,at))
wherein α ∈ (0, 1) represents the learning rate, γ ∈ [0,1] represents the discount factor;
the jackpot is updated as follows:
Re←Re+rt+1
for each one of cm∈Ct+1If, if
Figure BDA00022611321500000510
Indicating that the current screen has reached an end condition, wherein,
Figure BDA00022611321500000511
and
Figure BDA00022611321500000512
respectively representing deletion clusters cmThen, the set of the remaining clusters and the system state estimation value are obtained;
comparing the cumulative award R after the front screen reaches the end conditioneMaximum jackpot with first e-1 curtains
Figure BDA00022611321500000513
To obtain better DDoS attack cancellation results if the reward is accumulated
Figure BDA00022611321500000514
Q-learning agent updates data set Dk+1Maximum jackpot value
Figure BDA0002261132150000061
And system state estimation
Figure BDA0002261132150000062
Otherwise, keeping the state unchanged;
until the action-state value function Q (s, a) converges or the end of the current time period is reached, the Q-learning agent ends the whole iteration process, at which point the data set D isk+1As a normal connection, Dk-Dk+1Otherwise, the feature set F' and the candidate feature set CS are reset and the next screen is restarted, and the process is restarted after the next time period is reached.
Further, action atThe selection rules are as follows:
q-learning agent according to strategy piDDQN(st) An action a is selectedtStrategy nDDQN(st) Is obtained by asynchronously training DDQN agent, if the current time period is continuously obtained by multiple curtains, the maximum accumulated reward is obtained
Figure BDA0002261132150000063
No longer changed, Q-lerning agent decommission strategy piDDQN(st) Selection action atAnd during the remaining time, the Q-learning agent follows strategy πQ(st) And the epsilon-greedy method to select action at
Further, a strategy pi is obtained by asynchronously training DDQN agentDDQN(st) The method specifically comprises the following steps:
asynchronously training DDQN agent to obtain strategy piDDQN(st) Asynchronous here refers to training DDQN agent by using other processes to get strategy piDDQN(st) Therefore, the asynchronous effect is achieved, and the normal operation of the Q-learning algorithm is not interfered;
method and strategy pi for DDQN agent to use epsilon-greedyDDQN(st) Selection actions
Figure BDA0002261132150000064
And perform actions
Figure BDA0002261132150000065
DDQN agent transitions to a new state
Figure BDA0002261132150000066
And obtain an immediate prize value
Figure BDA0002261132150000067
At this time, the DDQN intelligence body will
Figure BDA0002261132150000068
Storing the data into a playback buffer M, and then updating and optimizing the neural network according to a DDQN algorithm by adopting a small-batch gradient descent method
Figure BDA0002261132150000069
Wherein the loss function is defined as:
Figure BDA00022611321500000610
wherein the content of the first and second substances,Mbindicating the size of the number of data items for a batch process,
Figure BDA00022611321500000611
is the target value for the DDQN agent:
Figure BDA00022611321500000612
wherein the content of the first and second substances,
Figure BDA0002261132150000071
γ∈[0,1]is a discount factor, and in addition, the DDQN agent copies the value of the parameter theta to the parameter every tau steps
Figure BDA0002261132150000072
Performing the following steps;
when the termination condition is reached, the DDQN agent ends the current screen, the iterative updating continues until the current time period is ended, and the DDQN iterative updating process restarts after the k +1 time period is reached.
According to a second aspect of the embodiments of the present invention, a system for eliminating a feature adaptive reinforcement learning DDoS attack is provided, where the system includes:
a data preprocessing module, configured to dynamically obtain interaction information data between the base station and the vehicle according to a preset time interval in a current kth time period, and add the interaction information data to the historical data set D'kAnd a current data set DkThen, pre-selecting characteristics;
the model building module is used for building a reinforcement learning model and comprises a model state space, an action space and a reward function;
and the DDoS attack eliminating module is used for adaptively selecting the characteristics suitable for the current DDoS attack type in a limited time by using the reinforcement learning model through repeated iteration training of the reinforcement learning intelligent body according to the pre-selected characteristic set, and detecting the connection and disconnection of the DDoS attack.
The embodiment of the invention has the following advantages:
according to the method and the system for eliminating the feature self-adaptive reinforcement learning DDoS attack, provided by the embodiment of the invention, in the current kth time period, interactive information data between a base station and a vehicle are obtained and are respectively added to a historical data set D'kAnd a current data set DkThen, pre-selecting characteristics; constructing a reinforcement learning model, including the construction of a model state space, an action space and a reward function; and according to the pre-selected feature set, using the reinforcement learning model to self-adaptively select features suitable for the current DDoS attack type through a plurality of times of iterative training reinforcement learning intelligent agents in a limited time, and detecting DDoS attack connection and disconnection. By learning attack characteristics in a self-adaptive manner, the purpose of detecting the DDoS attack of unknown type in the Internet of vehicles is achieved with a small amount of priori knowledge, dependence on labeled data is eliminated, the DDoS attack eliminating method is obtained, the requirements of low time delay and high accuracy in the Internet of vehicles are met, and a new thought and a new detecting method are provided for the DDoS attack detecting problem in the Internet of vehicles.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
Fig. 1 is a schematic flowchart of a method for eliminating DDoS attack through feature adaptive reinforcement learning according to embodiment 1 of the present invention;
fig. 2 is a schematic view of a DDoS attack elimination flow of a vehicle networking according to a feature adaptive reinforcement learning DDoS attack elimination method provided in embodiment 1 of the present invention;
fig. 3 shows a policy pi in the car networking of a feature adaptive reinforcement learning DDoS attack elimination method according to embodiment 1 of the present inventionDDQN(st) The flow diagram of the asynchronous training phase of (1);
fig. 4 is a schematic structural diagram of a feature adaptive reinforcement learning DDoS attack elimination system provided in embodiment 2 of the present invention.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment 1 of the invention provides a method for eliminating DDoS attack by feature self-adaptive reinforcement learning, which mainly comprises three stages, namely a data preprocessing stage, a reinforcement learning model establishing stage and a DDoS attack eliminating stage by using reinforcement learning.
Specifically, the time axis is divided into equal time periods, the time periods are appropriately divided, a larger time period cannot meet the dynamic requirement of the internet of vehicles, a smaller time period cannot meet the requirement of obtaining a high-accuracy machine learning model, the reinforcement learning agent is trained through multiple iterations in each time period, and after the divided time periods are ended, the following three stages are restarted.
As shown in fig. 1 in detail, the method includes the following steps:
step 110, in the current kth time period, dynamically obtaining interactive information data between the base station and the vehicle according to a preset time interval and respectively adding the interactive information data to a historical data set D'kAnd a current data set DkThen, a pre-selection of the features is performed.
The method comprises a data preprocessing stage, which is responsible for collecting current traffic flow data and analyzing historical data. And extracting a better simplified feature subset according to the collected historical data information.
In the divided current kth time period, the base station dynamically collects interactive information data of the base station and the vehicle at regular intervals and respectively adds the interactive information data to a historical data set D'kAnd a current data set DkIn (1). Historical data set D'kIs a collection of historically stored data sets, data set D'k=D1∪D2∪…∪DkIt is dynamically changing, and information for each time segment is stored to historical data set D'kTherefore, it is said that the data set collected in the past time period is recorded. However, not all data, because the longer the history time, the more memory space is needed, generally only history information of a long time period is saved, not data set information of all time periods. In this case, a predetermined memory space (here, the history data set D 'may be provided)'kSize 100M) to store the data, the new data will overwrite the original data from the beginning.
The method comprises the steps of acquiring interaction information data between a base station and a vehicle, extracting a plurality of characteristics to represent the information, and adding the characteristics into a data set, wherein the data characteristics comprise 41 in total, as shown in the following table.
Figure BDA0002261132150000091
Figure BDA0002261132150000101
Figure BDA0002261132150000111
In the embodiment, the data is according to a historical data set D'kTo calculate the correlation between features, then remove those redundant and irrelevant features, put the remaining features into a reduced feature set F', put the removed features into a candidate feature set CS, and pair redundant features (F)i,Fj) Put into the set PS of redundant feature pairs.
Further, according to the historical data set D'kObtaining an initial simplified feature subset F', a candidate feature subset CS and a redundant feature pair subset PS;
reduced feature subset F' storage satisfies δ1<ref(Fi,L)<δ2Characteristic F ofiThe candidate feature subset CS stores the satisfied ref (F)i,L)>δ2Redundant feature of (F)iThe redundant feature pair set PS stores satisfying ref (F)i,Fj)>δ3Redundant feature pair (F)i,Fj) And ref (F) thereofi,Fj) A value;
Fiand FjCorrelation between them
Figure BDA0002261132150000121
Characteristic FiCorrelation with data item labels L
Figure BDA0002261132150000122
E(Fi) Is characterized byiInformation entropy of (1), I (F)i;Fj) Is characterized byiAnd FjThe | F | represents the number of all features in the feature set.
Wherein, delta1、δ2、δ3Is constant, in this example, take δ1=0.6、δ2=3.5、δ30.3; feature F hereini、FjRefer to any two of the 41 features previously mentioned, such as feature duration, protocol _ type, service, and so on.
And 120, constructing a reinforcement learning model, including the construction of a model state space, an action space and a reward function.
And secondly, a reinforcement learning model establishing stage, namely establishing a reinforcement learning model according to the potential and predictable traffic flow space-time law in the Internet of vehicles, and establishing a corresponding state space, action space and reward function aiming at the DDoS attack of the Internet of vehicles environment.
In the Internet of vehicles, under normal or non-attack conditions, some statistical characteristics in the traffic flow, such as the number x of the connections between the vehicles and the base stationaAverage number of data bytes x sent by vehicle to base stationbitThe ratio of long connections to all connections xoccStation, stationWith a proportion x of "SYN" errors in the connectionerrMay change steadily and these statistical properties may mutate when the base station is subjected to DDoS attacks. In the application, in the reinforcement learning model establishing stage, the change of the state value at the next moment is predicted by introducing Kalman filtering according to the potential and predictable traffic flow space-time law in the Internet of vehicles.
Specifically, the method comprises the following steps:
the state space in the reinforcement learning model is defined as:
s(t,k)={sa,sbit,socc,serr}(t,k)
where the subscript (t, k) denotes the t-th iteration, s, in the course of a k time periodaNumber of connections between vehicle and base station, s, representing k-1 time periodbitIndicating the average number of bytes, s, of data transmitted from the vehicle to the base station during the k-1 time periodoccRepresenting the proportion of all connections, s, of connections exceeding a time threshold eerrIndicating the proportion of all connections having "SYN" errors.
In the process of establishing the TCP connection, the TCP protocol of the transport layer is a connection-oriented protocol, and normal data interaction needs to be performed after the connection is established, and this process of establishing the connection is three-way handshake in the TCP protocol: firstly, a request end (client) sends a TCP message containing a 'SYN' mark, SYN is synchronization, and the synchronization message indicates a port used by the client and an initial serial number of TCP connection; and secondly, after receiving the SYN message of the client, the server returns a SYN + ACK message to indicate that the request of the client is accepted. And thirdly, after the client receives the message sent by the server, the client also returns an acknowledgement message ACK to the server, and the TCP serial number is added by one until the TCP connection is completed. Whereas a "SYN" error occurs in the TCP three-way handshake process described above. In the TCP connection establishment phase, after sending the SYN + ACK packet, the server cannot receive an Acknowledgement (ACK) packet from the client, so that the connection cannot be established.
The action space in the reinforcement learning model is defined as:
Figure BDA0002261132150000131
wherein the sub-action a in the action spaceadd、adel、akeep、cdelAnd ckeepIs defined as follows:
(1) action aaddRepresenting reinforcement learning agent to select a feature F from a subset of candidate features CSaddAdding a reduced feature subset F', where the feature F to be added is selected according to the UCB (upper confidence) algorithmaddThen F '← F' ∪ { F }is performedadd}、CS←CS\{FaddAnd updating the reduced feature subset F' and the candidate feature subset CS, wherein the feature FaddThe selection rule is as follows:
calculate UCB value for each feature:
Figure BDA0002261132150000132
here, N (F)iAnd t) represents a feature FiTotal number of times selected, R (F)iT) represents the selection of feature F by t iterationsiThe average prize value obtained is calculated as follows:
Figure BDA0002261132150000133
herein, the
Figure BDA0002261132150000134
Shows that at j-1 iterations, feature F is addediThe benefit to be obtained is that the user has,
Figure BDA0002261132150000135
indicates whether feature F was selected in j-1 iterationsiAnd added to the reduced feature set F', where,
Figure BDA0002261132150000136
the reward earned in relation to performing an action in the action space is proportional, i.e. it is
Figure BDA0002261132150000141
Coefficient of performance
Figure BDA0002261132150000142
Representing a weight factor for the metric selection feature FiAct a ofaddFor the bonus value rt+1Is usually set to 0.5, in which case the feature to be added is
Figure BDA0002261132150000143
(2) Action adelRepresenting reinforcement learning agent according to probability Pd(Fi) Removing a feature F from the reduced feature subset FiAnd added into the candidate feature subset CS, and then F ' ← F ' \ { F ' is performeddel}、CS←CS∪{FdelUpdating the reduced feature subset F' and the candidate feature subset CS, wherein the probability P isd(Fi) The calculation rule is as follows:
Figure BDA0002261132150000144
for computational convenience, if feature pairs (F)i,Fj) In the redundant feature set PS, ref (F) is seti,Fj) Is 0;
(3) action akeepThe reinforcement learning agent keeps the current reduced feature subset F' and the candidate feature subset CS unchanged;
(4) action cdelRepresenting a reinforcement learning agent to randomly select a clique ciAt this time, the ball c is brokeniAll of the connections in (1);
(5) action ckeepIndicating that the reinforcement learning agent keeps the current clustering and access connections unchanged.
Reinforced learning modelReward function r in modelt+1The establishment is as follows:
the following vehicle networking traffic flow state space model is established by a linear time-invariant discrete system,
Xk=ΓXk-1+wk
Yk=HXk+vk
wherein the system state vector is defined as
Figure BDA0002261132150000145
Here, xaIndicates the number of connections between the vehicle and the base station, xbitAverage number of data bytes, x, transmitted by vehicle to base stationoccDenotes the proportion of long connections to all connections, xerrRepresenting the proportion of all connections with "SYN" errors, the system measurement vector is defined as
Figure BDA0002261132150000146
And Y iskOf (2) with XkWhere Γ is the state transition matrix, H is a measurement matrix, and w iskRepresenting process noise, v, associated with randomness to traffic flow fluctuations in the Internet of vehicles and prediction model inaccuracieskRepresenting the measurement noise during data collection, assuming wkAnd vkIs an uncorrelated, 0-mean gaussian white noise process, now with
Figure BDA0002261132150000151
Figure BDA0002261132150000152
Wherein, WkAnd VkAre respectively the noise wkAnd vkThe covariance matrix of (a);
prediction value
Figure BDA0002261132150000153
And the estimated value
Figure BDA0002261132150000154
The calculation method comprises the following steps:
the kalman filter prediction and update formula is as follows:
Figure BDA0002261132150000155
Pk|k-1=ΓPk-1ΓT+Wk-1
Gk=Pk|k-1HT(HPk|k-1HT+Vk)-1
Figure BDA0002261132150000156
Pk=Pk|k-1-GkHPk|k-1
wherein the content of the first and second substances,
Figure BDA0002261132150000157
and
Figure BDA0002261132150000158
predicted and estimated values representing the state of the system, estimated values
Figure BDA0002261132150000159
Including predictive value
Figure BDA00022611321500001510
And prediction error
Figure BDA00022611321500001511
GkAs a weighting factor, representing a Kalman gain matrix, Pk|k-1Representing the prior estimation error covariance matrix, PkRepresenting a covariance matrix of a posterior estimation error;
state vector estimation for the t-th iteration in a curtain over a k time period
Figure BDA00022611321500001512
Comprises the following steps:
Figure BDA00022611321500001513
wherein, Y(t+1,k)Is the measurement of the t-th iteration in one screen in the k period.
Obtaining a state vector in a k-1 time period according to Kalman filtering prediction
Figure BDA00022611321500001514
And state vector estimates for the t-th iteration in a curtain over a period of k time
Figure BDA00022611321500001515
The reinforcement learning agent executes corresponding actions according to the current state, and obtains the corresponding reward values after transferring to a new state as follows:
Figure BDA00022611321500001516
wherein the content of the first and second substances,
Figure BDA00022611321500001517
in the following steps, the subscript k is omitted and the subscript t is used instead of (t, k).
And step 130, according to the pre-selected feature set, using a reinforcement learning model to train a reinforcement learning agent to self-adaptively select features suitable for the current DDoS attack type through multiple iterations, and detecting DDoS attack connection and disconnection.
And finally, using reinforcement learning to eliminate a DDoS attack stage, and being responsible for training a reinforcement learning intelligent agent according to the reinforcement learning model obtained in the previous stage, adaptively selecting attack characteristics in a limited time, and removing DDoS attack connection as far as possible so as to eliminate DDoS attack in the Internet of vehicles. In this stage, the method mainly includes training a Q-learning agent to select characteristics suitable for the current DDoS attack type, and only a small amount of experience knowledge is needed to realize self-adaptive processing of the DDoS attack of unknown type, and asynchronously training a DDQN (deep double Q network) agent to obtain a strategy piDDQN(st) By strategy piDDQN(st) And guiding the selection of the Q-learning agent action.
The method comprises the following steps of training a Q-learning intelligent agent to select characteristics suitable for the current DDoS attack type:
step 131, according to the strategy piDDQN(st) And strategy piQ(st) Selection action at
Action atThe selection rules are as follows: q-learning agent according to strategy piDDQN(st) An action a is selectedtStrategy pi hereDDQN(st) Is obtained by asynchronously training DDQN agent, if the current time period is continuously obtained by multiple curtains, the maximum accumulated reward is obtained
Figure BDA0002261132150000161
Q-learning agent ceases to use strategy pi without any changeDDQN(st) Selection action atAnd during the remaining time, the Q-learning agent follows strategy πQ(st) And the epsilon-greedy method to select action at
Step 132, perform action at
In particular, action a is performedtSub-action a inadd、adelAnd akeepUpdating the feature set F' and the CS, Q-learning agent obtaining the data set D by using the DBSCAN clustering algorithm according to the reduced feature subset FtClustering result C oft={c1,c2,...,cn,...}tThen perform action atSub-action c ofdelOr ckeepAnd further obtain an updated data set Dt+1State st+1And the estimated value
Figure BDA0002261132150000162
Step 133, update Q-table and jackpot Re
In particular, according to the obtained state stAnd action atAnd rotatingShifted state st+1To obtain a reward value rt+1Updating the Q-table according to a Q-learning algorithm and updating the jackpot ReHere, the Q-table update formula is:
Figure BDA0002261132150000163
Q(st,at)←Q(st,at)+α(zt+1-Q(st,at))
wherein α ∈ (0, 1) represents the learning rate, γ ∈ [0,1] represents the discount factor;
cumulative prize ReThe update is as follows:
Re←Re+rt+1
and step 134, judging whether the current screen reaches the termination condition.
In particular, for each cm∈Ct+1If, if
Figure BDA0002261132150000171
Indicating that the current screen has reached an end condition, wherein,
Figure BDA0002261132150000172
and
Figure BDA0002261132150000173
respectively representing deletion clusters cmA set of clusters and system state estimates then remain.
Step 135, update data set Dk+1Maximum jackpot value
Figure BDA0002261132150000174
And system state estimation
Figure BDA0002261132150000175
Specifically, after the front screen reaches the end condition, the cumulative award R is comparedeMaximum jackpot with first e-1 curtains
Figure BDA0002261132150000176
To obtain better DDoS attack cancellation results if the reward is accumulated
Figure BDA0002261132150000177
Q-learning agent updates data set Dk+1Maximum jackpot value
Figure BDA0002261132150000178
And system state estimation
Figure BDA0002261132150000179
Otherwise it remains unchanged.
In this embodiment, the symbol D is usedk+1To indicate the end result of the current time period, i.e. to store the normal connections to the data set Dk+1In (D)k+1Initially, it is an empty set, and D is updated for each iteration in the algorithmk+1. In particular, D herek+1And not the data set representing the next k +1 time period.
Step 136, determine whether Q (s, a) converges or reaches the end of the current time period.
The Q-learning agent finishes the whole iteration process until the action-state value function Q (s, a) converges or the end time of the current time period is reached, wherein Q (s, a) converges to be the Q-learning algorithm, and at the moment, the data set Dk+1As a normal connection, Dk-Dk+1The connection in (3) is taken as abnormal connection and is disconnected, so that the purpose of eliminating DDoS attack is achieved, otherwise, the feature set F' and the candidate feature set CS are reset, and the next screen is restarted. After the next time period is reached, the process is restarted.
Obtaining a policy pi by asynchronously training a DDQN agentDDQN(st) The method specifically comprises the following steps:
asynchronously training DDQN agent to obtain strategy piDDQN(st) Asynchronous here refers to training DDQN agent by using other processes to get strategy piDDQN(st) Therefore, the asynchronous effect is achieved and the normal operation of the Q-learning algorithm is not interfered. Here, by adding an over-dash line (e.g., state) to the original symbol
Figure BDA00022611321500001710
Movement of
Figure BDA00022611321500001711
And a reward function
Figure BDA00022611321500001712
) To distinguish from the symbols in the Q-learning algorithm, but the symbols are defined as above.
Method and strategy pi for DDQN agent to use epsilon-greedyDDQN(st) Selection actions
Figure BDA00022611321500001713
And perform actions
Figure BDA0002261132150000181
In particular
Figure BDA0002261132150000182
Execution is as in step 132, with the DDQN agent transitioning to a new state
Figure BDA0002261132150000183
And obtain an immediate prize value
Figure BDA0002261132150000184
At this time, the DDQN intelligence body will
Figure BDA0002261132150000185
Storing the data into a playback buffer zone M (replaybuffer M), and then updating and optimizing an online neural network according to a DDQN algorithm by using a small-batch gradient descent (mini-batch) method
Figure BDA0002261132150000186
Wherein the loss function is defined as:
Figure BDA0002261132150000187
wherein M isbIndicating the size of the number of data items for a batch process,
Figure BDA0002261132150000188
is the target value for the DDQN agent:
Figure BDA0002261132150000189
wherein the content of the first and second substances,
Figure BDA00022611321500001810
γ∈[0,1]is a discount factor, and in addition, the DDQN agent copies the value of the parameter theta to the parameter every tau steps
Figure BDA00022611321500001811
Performing the following steps;
when the termination condition is reached, the DDQN agent ends the current screen, the iterative updating continues until the current time period is ended, and the DDQN iterative updating process restarts after the k +1 time period is reached.
The DDoS attack elimination method based on the combination of Kalman filtering, Q-learning and DDQN algorithm is based on a data set D'kObtaining an initial simplified feature subset F', a candidate feature subset CS and a redundant feature pair set PS, and simultaneously obtaining a predicted value of the current system state by using Kalman filtering
Figure BDA00022611321500001812
According to the current state stThe reinforcement learning intelligent agent executes the action atTransition to a new state st+1Obtaining a new reward value r according to the obtained predicted valuet+1. After a plurality of iterations in a time period, a maximum accumulated reward R is finally outputeCorresponding data set Dk+1. At this time, data set Dk+1The connection involved is the normal connection, Dk-Dk+1The connection is DDoS attack connection and disconnection, so that DDoS attack in the Internet of vehicles is eliminated.
According to the method for eliminating the DDoS attack of the feature self-adaptive reinforcement learning, firstly, a better simplified feature subset is extracted according to collected historical data information, then, a reinforcement learning model is established according to the potential and predictable traffic flow space-time law in the internet of vehicles, finally, according to the reinforcement learning model, a Q-learning intelligent body is trained to select the features suitable for the current DDoS attack type, only a small amount of experience knowledge is needed, the DDoS attack of the unknown type can be processed in a self-adaptive mode, and meanwhile, the DDQN intelligent body is trained asynchronously to obtain a strategy piDDQN(st) By strategy piDDQN(st) And guiding the selection of the Q-learning agent action. By learning attack characteristics in a self-adaptive manner, the purpose of detecting the DDoS attack of unknown type in the Internet of vehicles is achieved with a small amount of priori knowledge, dependence on labeled data is eliminated, the DDoS attack eliminating method is obtained, the requirements of low time delay and high accuracy in the Internet of vehicles are met, and a new thought and a new detecting method are provided for the DDoS attack detecting problem in the Internet of vehicles.
Corresponding to the foregoing embodiment 1, an embodiment 2 of the present invention provides a system for eliminating a DDoS attack through feature adaptive reinforcement learning, where the system includes:
a data preprocessing module 210, configured to dynamically obtain interaction information data between the base station and the vehicle according to a preset time interval in a current kth time period, and add the interaction information data to the historical data set D'kAnd a current data set DkThen, pre-selecting characteristics;
the model construction module 220 is used for constructing a reinforcement learning model, including the construction of a model state space, an action space and a reward function;
and the DDoS attack elimination module 230 is configured to adaptively select, according to the pre-selected feature set, features suitable for the current DDoS attack type through multiple iterative training of the reinforcement learning agent by using a reinforcement learning model within a limited time, and detect connection and disconnection of DDoS attacks.
The functions executed by each component in the feature adaptive reinforcement learning DDoS attack elimination system provided by the embodiment of the present invention have been described in detail in the above embodiment 1, and therefore, redundant description is not repeated here.
According to the system for eliminating the DDoS attack of the feature self-adaptive reinforcement learning, firstly, a better simplified feature subset is extracted according to collected historical data information, then, a reinforcement learning model is established according to the potential and predictable traffic flow space-time law in the internet of vehicles, finally, according to the reinforcement learning model, a Q-learning intelligent body is trained to select the features suitable for the current DDoS attack type, only a small amount of experience knowledge is needed, the DDoS attack of the unknown type can be processed in a self-adaptive mode, and meanwhile, the DDQN intelligent body is trained asynchronously to obtain a strategy piDDQN(st) By strategy piDDQN(st) And guiding the selection of the Q-learning agent action. By learning attack characteristics in a self-adaptive manner, the purpose of detecting the DDoS attack of unknown type in the Internet of vehicles is achieved with a small amount of priori knowledge, dependence on labeled data is eliminated, the DDoS attack eliminating method is obtained, the requirements of low time delay and high accuracy in the Internet of vehicles are met, and a new thought and a new detecting method are provided for the DDoS attack detecting problem in the Internet of vehicles.
Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (10)

1. A method for eliminating DDoS attack of feature adaptive reinforcement learning is characterized by comprising the following steps:
in the current k time period, the base station is dynamically acquired according to the preset time intervalInteraction information data with the vehicle are added to the historical data set D 'respectively'kAnd a current data set DkThen, pre-selecting characteristics;
constructing a reinforcement learning model, including the construction of a model state space, an action space and a reward function;
and according to the pre-selected feature set, using the reinforcement learning model to self-adaptively select features suitable for the current DDoS attack type through a plurality of times of iterative training reinforcement learning intelligent agents in a limited time, and detecting DDoS attack connection and disconnection.
2. The method for eliminating the DDoS attack of feature self-adaptive reinforcement learning according to claim 1, wherein interaction information data between a base station and a vehicle is dynamically acquired according to a preset time interval in a current kth time period and is respectively added to a historical data set D'kAnd a current data set DkThen, pre-selecting the characteristics, specifically comprising:
according to the historical data set D'kObtaining an initial simplified feature subset F', a candidate feature subset CS and a redundant feature pair subset PS;
reduced feature subset F' storage satisfies δ1<ref(Fi,L)<δ2Characteristic F ofiThe candidate feature subset CS stores the satisfied ref (F)i,L)>δ2Redundant feature of (F)iThe redundant feature pair set PS stores satisfying ref (F)i,Fj)>δ3Redundant feature pair (F)i,Fj) And ref (F) thereofi,Fj) A value;
Fiand FjCorrelation between them
Figure FDA0002261132140000011
Characteristic FiCorrelation with data item labels L
Figure FDA0002261132140000012
E(Fi) Is characterized byiInformation entropy of (1), (I)Fi;Fj) Is characterized byiAnd FjThe | F | represents the number of all features in the feature set.
3. The method for eliminating the feature adaptive reinforcement learning DDoS attack according to claim 2, wherein the state space in the reinforcement learning model is defined as:
s(t,k)={sa,sbit,socc,serr}(t,k)
where the subscript (t, k) denotes the t-th iteration, s, in the course of a k time periodaNumber of connections between vehicle and base station, s, representing k-1 time periodbitIndicating the average number of bytes, s, of data transmitted from the vehicle to the base station during the k-1 time periodoccRepresenting the proportion of all connections, s, of connections exceeding a time threshold eerrIndicating the proportion of all connections having "SYN" errors.
4. The method for eliminating the feature adaptive reinforcement learning DDoS attack according to claim 2, wherein the action space in the reinforcement learning model is defined as:
Figure FDA0002261132140000021
wherein the sub-action a in the action spaceadd、adel、akeep、cdelAnd ckeepIs defined as follows:
action aaddRepresenting reinforcement learning agent to select a feature F from a subset of candidate features CSaddAdding a reduced feature subset F', where the feature F to be added is selected according to a confidence capped UCB algorithmaddThen F '← F' ∪ { F }is performedadd}、CS←CS\{FaddUpdating the reduced feature subset F' and the candidate feature subset CS;
action adelRepresenting reinforcement learning agent according to probability Pd(Fi) Removing one from the reduced feature subset FCharacteristic FiAnd added into the candidate feature subset CS, and then F ' ← F ' \ { F ' is performeddel}、CS←CS∪{FdelUpdating the reduced feature subset F' and the candidate feature subset CS, wherein the probability P isd(Fi) The calculation rule is as follows:
Figure FDA0002261132140000022
for computational convenience, if feature pairs (F)i,Fj) In the redundant feature set PS, ref (F) is seti,Fj) Is 0;
action akeepThe reinforcement learning agent keeps the current reduced feature subset F' and the candidate feature subset CS unchanged;
action cdelRepresenting a reinforcement learning agent to randomly select a clique ciAt this time, the ball c is brokeniAll of the connections in (1);
action ckeepIndicating that the reinforcement learning agent keeps the current clustering and access connections unchanged.
5. The method for eliminating the DDoS attack in the feature self-adaptive reinforcement learning according to claim 2, wherein the reward function in the reinforcement learning model is established as follows:
the following vehicle networking traffic flow state space model is established by a linear time-invariant discrete system,
Xk=ΓXk-1+wk
Yk=HXk+vk
wherein, XkIs a system state vector, YkIs a systematic measurement vector, Γ is a state transition matrix, H is a measurement matrix, wkRepresenting process noise, v, associated with randomness to traffic flow fluctuations in the Internet of vehicles and prediction model inaccuracieskRepresenting the measurement noise during data collection, assuming wkAnd vkIs an uncorrelated, 0-mean gaussian white noise process, now with
Figure FDA0002261132140000031
Wherein, WkAnd VkAre respectively the noise wkAnd vkThe covariance matrix of (a);
obtaining a state vector in a k-1 time period according to Kalman filtering prediction
Figure FDA0002261132140000032
And state vector estimates for the t-th iteration in a curtain over a period of k time
Figure FDA0002261132140000033
The reinforcement learning agent executes corresponding actions according to the current state, and obtains the corresponding reward values after transferring to a new state as follows:
Figure FDA0002261132140000034
wherein the content of the first and second substances,
Figure FDA0002261132140000035
6. the method for eliminating the feature adaptive reinforcement learning DDoS attack according to claim 5, wherein the predicted value is
Figure FDA0002261132140000036
And the estimated value
Figure FDA0002261132140000037
The calculation method comprises the following steps:
the kalman filter prediction and update formula is as follows:
Figure FDA0002261132140000038
Pk|k-1=ΓPk-1ΓT+Wk-1
Gk=Pk|k-1HT(HPk|k-1HT+Vk)-1
Figure FDA0002261132140000039
Pk=Pk|k-1-GkHPk|k-1
wherein the content of the first and second substances,
Figure FDA00022611321400000310
and
Figure FDA00022611321400000311
predicted and estimated values representing the state of the system, estimated values
Figure FDA00022611321400000312
Including predictive value
Figure FDA00022611321400000313
And prediction error
Figure FDA00022611321400000314
GkAs a weighting factor, representing a Kalman gain matrix, Pk|k-1Representing the prior estimation error covariance matrix, PkRepresenting a covariance matrix of a posterior estimation error;
state vector estimation for the t-th iteration in a curtain over a k time period
Figure FDA00022611321400000315
Comprises the following steps:
Figure FDA0002261132140000041
wherein, Y(t+1,k)Is the measurement of the t-th iteration in one screen in the k period.
7. The method for eliminating feature adaptive reinforcement learning DDoS attack according to claim 2, characterized by being according to a historical data set D'kAnd a current data set DkThe reinforcement learning model is used for adaptively selecting the characteristics suitable for the current DDoS attack type in a limited time through a plurality of times of iterative training reinforcement learning agent, detecting DDoS attack connection and disconnection, and specifically comprises the following steps:
according to a strategy ofDDQN(st) And strategy piQ(st) Selection action at
Performing action atSub-action a inadd、adelAnd akeepUpdating the feature set F' and the CS, Q-learning agent to obtain the data set D according to the reduced feature subset FtClustering result C oft={c1,c2,...,cn,...}tThen perform action atSub-action c ofdelOr ckeepAnd further obtain an updated data set Dt+1State st+1And the estimated value
Figure FDA0002261132140000042
According to the obtained state stAnd action atAnd the state after the transition st+1To obtain a reward value rt+1Updating the Q-table according to a Q-learning algorithm and updating the jackpot ReHere, the Q-table update formula is:
Figure FDA0002261132140000043
Q(st,at)←Q(st,at)+α(zt+1-Q(st,at))
wherein α ∈ (0, 1) represents the learning rate, γ ∈ [0,1] represents the discount factor;
the jackpot is updated as follows:
Re←Re+rt+1
for each one of cm∈Ct+1If, if
Figure FDA0002261132140000044
Indicating that the current screen has reached an end condition, wherein,
Figure FDA0002261132140000045
and
Figure FDA0002261132140000046
respectively representing deletion clusters cmThen, the set of the remaining clusters and the system state estimation value are obtained;
comparing the cumulative award R after the front screen reaches the end conditioneMaximum jackpot with first e-1 curtains
Figure FDA0002261132140000047
To obtain better DDoS attack cancellation results if the reward is accumulated
Figure FDA0002261132140000048
Q-learning agent updates data set Dk+1Maximum jackpot value
Figure FDA0002261132140000049
And system state estimation
Figure FDA00022611321400000410
Otherwise, keeping the state unchanged;
until the action-state value function Q (s, a) converges or the end of the current time period is reached, the Q-learning agent ends the whole iteration process, at which point the data set D isk+1As a normal connection, Dk-Dk+1Otherwise, the feature set F' and the candidate feature set CS are reset and the next screen is restarted, and the process is restarted after the next time period is reached.
8. The method of feature adaptive reinforcement learning DDoS attack elimination according to claim 7, wherein the action atThe selection rules are as follows:
q-learning agent according to strategy piDDQN(st) An action a is selectedtStrategy nDDQN(st) Is obtained by asynchronously training DDQN agent, if the current time period is continuously obtained by multiple curtains, the maximum accumulated reward is obtained
Figure FDA0002261132140000051
Q-learning agent ceases to use strategy pi without any changeDDQN(st) Selection action atAnd during the remaining time, the Q-learning agent follows strategy πQ(st) And the epsilon-greedy method to select action at
9. The method of claim 8, wherein the strategy pi is obtained by asynchronously training DDQN agentDDQN(st) The method specifically comprises the following steps:
asynchronously training DDQN agent to obtain strategy piDDQN(st) Asynchronous here refers to training DDQN agent by using other processes to get strategy piDDQN(st) Therefore, the asynchronous effect is achieved, and the normal operation of the Q-learning algorithm is not interfered;
method and strategy pi for DDQN agent to use epsilon-greedyDDQN(st) Selection actions
Figure FDA0002261132140000052
And perform actions
Figure FDA0002261132140000053
DDQN agent transitions to a new state
Figure FDA0002261132140000054
And obtain an immediate prize value
Figure FDA0002261132140000055
At this time, the DDQN intelligence body will
Figure FDA0002261132140000056
Storing the data into a playback buffer M, and then updating and optimizing the neural network according to a DDQN algorithm by adopting a small-batch gradient descent method
Figure FDA0002261132140000057
Wherein the loss function is defined as:
Figure FDA0002261132140000058
wherein M isbIndicating the size of the number of data items for a batch process,
Figure FDA0002261132140000059
is the target value for the DDQN agent:
Figure FDA00022611321400000510
wherein the content of the first and second substances,
Figure FDA00022611321400000511
γ∈[0,1]is a discount factor, and in addition, the DDQN agent copies the value of the parameter theta to the parameter every tau steps
Figure FDA00022611321400000512
Performing the following steps;
when the termination condition is reached, the DDQN agent ends the current screen, the iterative updating continues until the current time period is ended, and the DDQN iterative updating process restarts after the k +1 time period is reached.
10. A feature adaptive reinforcement learning (DDoS) attack cancellation system, the system comprising:
a data preprocessing module, configured to dynamically obtain interaction information data between the base station and the vehicle according to a preset time interval in a current kth time period, and add the interaction information data to the historical data set D'kAnd a current data set DkThen, pre-selecting characteristics;
the model building module is used for building a reinforcement learning model and comprises a model state space, an action space and a reward function;
and the DDoS attack eliminating module is used for adaptively selecting the characteristics suitable for the current DDoS attack type in a limited time by using the reinforcement learning model through repeated iteration training of the reinforcement learning intelligent body according to the pre-selected characteristic set, and detecting the connection and disconnection of the DDoS attack.
CN201911071642.XA 2019-11-05 2019-11-05 Method and system for eliminating DDoS (distributed denial of service) attack in feature self-adaptive reinforcement learning Active CN110958135B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911071642.XA CN110958135B (en) 2019-11-05 2019-11-05 Method and system for eliminating DDoS (distributed denial of service) attack in feature self-adaptive reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911071642.XA CN110958135B (en) 2019-11-05 2019-11-05 Method and system for eliminating DDoS (distributed denial of service) attack in feature self-adaptive reinforcement learning

Publications (2)

Publication Number Publication Date
CN110958135A true CN110958135A (en) 2020-04-03
CN110958135B CN110958135B (en) 2021-07-13

Family

ID=69976608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911071642.XA Active CN110958135B (en) 2019-11-05 2019-11-05 Method and system for eliminating DDoS (distributed denial of service) attack in feature self-adaptive reinforcement learning

Country Status (1)

Country Link
CN (1) CN110958135B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814988A (en) * 2020-07-07 2020-10-23 北京航空航天大学 Testing method of multi-agent cooperative environment reinforcement learning algorithm
CN112101556A (en) * 2020-08-25 2020-12-18 清华大学 Method and device for identifying and removing redundant information in environment observation quantity
CN112256739A (en) * 2020-11-12 2021-01-22 同济大学 Method for screening data items in dynamic flow big data based on multi-arm gambling machine
CN112365048A (en) * 2020-11-09 2021-02-12 大连理工大学 Unmanned vehicle reconnaissance method based on opponent behavior prediction
CN112435275A (en) * 2020-12-07 2021-03-02 中国电子科技集团公司第二十研究所 Unmanned aerial vehicle maneuvering target tracking method integrating Kalman filtering and DDQN algorithm
CN112446470A (en) * 2020-11-12 2021-03-05 北京工业大学 Reinforced learning method for coherent synthesis
CN112637814A (en) * 2021-01-27 2021-04-09 桂林理工大学 DDoS attack defense method based on trust management
CN112670982A (en) * 2020-12-14 2021-04-16 广西电网有限责任公司电力科学研究院 Active power scheduling control method and system for micro-grid based on reward mechanism
CN113055384A (en) * 2021-03-12 2021-06-29 周口师范学院 SSDDQN network abnormal flow detection method
CN113128705A (en) * 2021-03-24 2021-07-16 北京科技大学顺德研究生院 Intelligent agent optimal strategy obtaining method and device
CN114374541A (en) * 2021-12-16 2022-04-19 四川大学 Abnormal network flow detector generation method based on reinforcement learning
CN115840363A (en) * 2022-12-06 2023-03-24 上海大学 Denial of service attack method for remote state estimation of information physical system

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240171979A1 (en) * 2021-07-15 2024-05-23 Telefonaktiebolaget Lm Ericsson (Publ) Detecting anomalous behaviour in an edge communication network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108574668A (en) * 2017-03-10 2018-09-25 北京大学 A kind of ddos attack peak flow prediction technique based on machine learning
US20190061147A1 (en) * 2016-04-27 2019-02-28 Neurala, Inc. Methods and Apparatus for Pruning Experience Memories for Deep Neural Network-Based Q-Learning
CN109639515A (en) * 2019-02-16 2019-04-16 北京工业大学 Ddos attack detection method based on hidden Markov and Q study cooperation
CN110401675A (en) * 2019-08-20 2019-11-01 绍兴文理学院 Uncertain ddos attack defence method under a kind of sensing cloud environment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190061147A1 (en) * 2016-04-27 2019-02-28 Neurala, Inc. Methods and Apparatus for Pruning Experience Memories for Deep Neural Network-Based Q-Learning
CN108574668A (en) * 2017-03-10 2018-09-25 北京大学 A kind of ddos attack peak flow prediction technique based on machine learning
CN109639515A (en) * 2019-02-16 2019-04-16 北京工业大学 Ddos attack detection method based on hidden Markov and Q study cooperation
CN110401675A (en) * 2019-08-20 2019-11-01 绍兴文理学院 Uncertain ddos attack defence method under a kind of sensing cloud environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱振国等: "基于强化学习的特征选择算法", 《计算机系统应用》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814988A (en) * 2020-07-07 2020-10-23 北京航空航天大学 Testing method of multi-agent cooperative environment reinforcement learning algorithm
CN112101556B (en) * 2020-08-25 2021-08-10 清华大学 Method and device for identifying and removing redundant information in environment observation quantity
CN112101556A (en) * 2020-08-25 2020-12-18 清华大学 Method and device for identifying and removing redundant information in environment observation quantity
CN112365048A (en) * 2020-11-09 2021-02-12 大连理工大学 Unmanned vehicle reconnaissance method based on opponent behavior prediction
CN112256739A (en) * 2020-11-12 2021-01-22 同济大学 Method for screening data items in dynamic flow big data based on multi-arm gambling machine
CN112446470A (en) * 2020-11-12 2021-03-05 北京工业大学 Reinforced learning method for coherent synthesis
CN112446470B (en) * 2020-11-12 2024-05-28 北京工业大学 Reinforced learning method for coherent synthesis
CN112256739B (en) * 2020-11-12 2022-11-18 同济大学 Method for screening data items in dynamic flow big data based on multi-arm gambling machine
CN112435275A (en) * 2020-12-07 2021-03-02 中国电子科技集团公司第二十研究所 Unmanned aerial vehicle maneuvering target tracking method integrating Kalman filtering and DDQN algorithm
CN112670982B (en) * 2020-12-14 2022-11-08 广西电网有限责任公司电力科学研究院 Active power scheduling control method and system for micro-grid based on reward mechanism
CN112670982A (en) * 2020-12-14 2021-04-16 广西电网有限责任公司电力科学研究院 Active power scheduling control method and system for micro-grid based on reward mechanism
CN112637814A (en) * 2021-01-27 2021-04-09 桂林理工大学 DDoS attack defense method based on trust management
CN113055384A (en) * 2021-03-12 2021-06-29 周口师范学院 SSDDQN network abnormal flow detection method
CN113128705A (en) * 2021-03-24 2021-07-16 北京科技大学顺德研究生院 Intelligent agent optimal strategy obtaining method and device
CN113128705B (en) * 2021-03-24 2024-02-09 北京科技大学顺德研究生院 Method and device for acquiring intelligent agent optimal strategy
CN114374541A (en) * 2021-12-16 2022-04-19 四川大学 Abnormal network flow detector generation method based on reinforcement learning
CN115840363A (en) * 2022-12-06 2023-03-24 上海大学 Denial of service attack method for remote state estimation of information physical system
CN115840363B (en) * 2022-12-06 2024-05-10 上海大学 Denial of service attack method aiming at remote state estimation of information physical system

Also Published As

Publication number Publication date
CN110958135B (en) 2021-07-13

Similar Documents

Publication Publication Date Title
CN110958135B (en) Method and system for eliminating DDoS (distributed denial of service) attack in feature self-adaptive reinforcement learning
CN112181971B (en) Edge-based federated learning model cleaning and equipment clustering method and system
EP3563554B1 (en) System and method for detecting unknown iot device types by monitoring their behavior
Heidari et al. Tight Policy Regret Bounds for Improving and Decaying Bandits.
US20190056983A1 (en) It system fault analysis technique based on configuration management database
CN111092823A (en) Method and system for adaptively adjusting congestion control initial window
CN115943382A (en) Method and apparatus for defending against adversarial attacks on a federated learning system
CN110166344B (en) Identity identification method, device and related equipment
CN110890930A (en) Channel prediction method and related equipment
EP3430767B1 (en) Method and device for real-time network event processing
Dong et al. Secure distributed on-device learning networks with byzantine adversaries
CN114065863A (en) Method, device and system for federal learning, electronic equipment and storage medium
Yan et al. Gaussian process reinforcement learning for fast opportunistic spectrum access
CN107257365B (en) A kind of data download processing method and device
CN112422546A (en) Network anomaly detection method based on variable neighborhood algorithm and fuzzy clustering
CN111491300A (en) Risk detection method, device, equipment and storage medium
CN103455525B (en) The method and apparatus of popularization account number state is determined based on the search popularization behavior of user
CN102271348B (en) Link quality estimation system and method for cyber physical system
CN109583203B (en) Malicious user detection method, device and system
Abolhassani et al. SwiftCache: Model-Based Learning for Dynamic Content Caching in CDNs
Li et al. Adversarial Distributional Reinforcement Learning against Extrapolated Generalization
CN115174130B (en) AGV semantic attack detection method based on HMM
CN111400031B (en) Value function-based reinforcement learning method for processing unit deployment
CN116757726A (en) User network screening method and device
CN116432941A (en) Data long tail characteristic-based method and system for discovering Sybil defense true value

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant