CN114866291B

CN114866291B - DDoS defense system and method based on deep reinforcement learning under SDN

Info

Publication number: CN114866291B
Application number: CN202210405147.3A
Authority: CN
Inventors: 周海峰; 陈述涵; 杨明亮; 吴春明
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2023-06-23
Anticipated expiration: 2042-04-18
Also published as: CN114866291A

Abstract

The invention discloses a DDoS attack active defense system and method based on deep reinforcement learning under SDN architecture, which are used for collecting state characteristics of an edge switch, extracting network characteristics from dynamic environment through a near-end strategy optimization algorithm, making defense decisions on each flow, namely determining the allowed passing proportion of each flow, discarding malicious flow as possible, checking the deep reinforcement learning action through network constraint conditions, improving the robustness of the method, and completing the active defense on DDoS attack. The construction method of the invention is simple, flexible to realize and high in efficiency.

Description

DDoS defense system and method based on deep reinforcement learning under SDN

Technical Field

The invention belongs to the field of network security active defense under SDN, and particularly relates to a DDoS attack active defense system and method based on deep reinforcement learning under SDN architecture.

Background

The number of DDoS attack events is still increasing year by year and has extremely high attack traffic and short attack duration, so it is important to take defensive measures in time before such attacks rise. Because of the advantages of Software Defined Network architecture in defending against DDoS attacks, such as flexible programming and control features, statistical model and machine learning model based methods can effectively defend against DDoS attacks in Software Defined Networks (SDN), but these methods are less real-time and require re-collection of samples and reconstruction of models before models become ineffective when the attack signature changes. The advent of deep reinforcement learning provides an opportunity to effectively defend against DDoS attacks in real time. Meanwhile, the deep reinforcement learning method runs in a black box mode and depends on an opaque data driving model, so that large differences can occur in DDoS attack defense effects based on the deep reinforcement learning, and therefore, it is important to consider the efficiency, the robustness and the instantaneity of DDoS attack defense.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a DDoS attack active defense system and method based on deep reinforcement learning under an SDN architecture.

The aim of the invention is realized by the following technical scheme: a DDoS attack active defense system based on deep reinforcement learning under an SDN architecture comprises an SDN controller, an edge switch and a deep reinforcement learning agent processing module; the SDN controller comprises a network state collection module, a defensive action execution module and a feedback acquisition module; converting the defending process into a Markov decision process, establishing a network view through an SDN controller, and collecting network characteristic information on an edge switch in real time to reflect the current network request state; based on a near-end strategy optimization algorithm in deep reinforcement learning, extracting network characteristics from dynamic environments, mapping the state of each flow to a defense decision, ensuring the passing of normal flow and discarding malicious flow, and realizing active defense on DDoS attack; and training a deep neural network through interaction between the deep reinforcement learning agent and the network, and optimizing a defense strategy by experience.

Further, the method comprises the steps of:

the network state collection module actively requests the state information of the edge switch, and acquires the returned state information after the delta t moment;

the deep reinforcement learning agent processing module is realized based on a near-end policy optimization algorithm, the input state of the module is port information of the switch and flow information passing through the switch, which are collected by the network state collection module, and the output of the module is action which represents the flow ratio of a certain flow on the edge switch to allow the flow to pass through; wherein the allowable passing ratio of malicious traffic approaches 0%, and the allowable passing ratio of normal traffic approaches 100%;

the defensive action execution module is used for verifying the action of the action output by the deep reinforcement learning agent processing module by using a bandwidth reassignment method, reassigning the bandwidth for each flow according to the size of the flow which is originally allowed to pass, and adding constraint conditions in the process of bandwidth reassignment; the constraint condition is that the total traffic allowed to be sent to the server through the edge switch does not exceed the available bandwidth of the server;

and a feedback acquisition module: the defending action execution module triggers the feedback acquisition module after execution, and at the moment, the feedback acquisition module calls the network state collection module to actively request the state information of the edge switch and the server, and after the delta t moment, the next returned network state information state' is obtained; then, in combination with the flow information passing through the edge switch and the flow information reaching the server in the last network state information state collected by the network state collection module, the malicious flow proportion p is calculated _m And a normal flow rate ratio p _n The method comprises the steps of carrying out a first treatment on the surface of the Calculating a reward function value reward based on the two ratios; the feedback acquisition module feeds back the next network state information state' and the reward function value reward to the deep reinforcement learning agent processing module.

Further, when the SDN controller does not collect the state of a certain flow, the state of the corresponding flow is 0, and the flow ratio allowed by the corresponding flow is also 0.

Further, the proportion of traffic that a certain flow is allowed to pass on an edge switch ranges from [0.05,1].

Further, the deep reinforcement learning agent processing module comprises an actor neural network A, an actor neural network B, a critique neural network and a memory pool, and specifically comprises:

(2.1) the actor neural network A is responsible for interaction with a network environment, wherein the actor neural network A takes state as input, outputs a distributed mean sigma and a mean mu of actions, and randomly samples the corresponding normal distribution to obtain action;

(2.2) updating the neural network in the deep reinforcement learning agent processing module, depending on a sample set collected in a memory pool; after the defensive action executing module executes the defensive action, collecting feedback information from the feedback collecting module, wherein the feedback information comprises the next network state information state' and a rewarding value report; and storing the (state, action, rewind, state') in the memory pool; when storing f in the memory cell ₁ After the group of samples, the actual benefits of the samples are obtainedThe difference of the state cost function is a dominance function value A, the mean square error of the function value A is used as a loss value, the back propagation is carried out, the parameters of the commentator neural network are updated, and the updating process trains f ₂ Secondary times;

(2.3) sampling on the distribution of the output of the actor neural network A and sampling on the distribution of the output of the actor neural network B to obtain the ratio of the action probability as the ratio; the loss value updated by the actor neural network B is loss=min (ratio a, clip (1-e, 1+e, ratio) a); wherein e is a custom value, and clip () function limits the ratio to the range of (1-e, 1+e); this update procedure trains f ₃ Secondary times;

(2.4) per training f of the entire near segment policy optimization algorithm ₁ *f ₃ And step, assigning the parameter values of the actor neural network B to the actor neural network A to finish updating of the actor neural network A.

Further, the defensive action execution module performs security defenses against two network conditions:

firstly, when the total flow allowed to pass through the edge switch is larger than the available bandwidth of the server, the defensive action execution module reduces the size of the flow which can reach the server according to the constraint condition;

second, when the total traffic allowed to pass through the edge switch is smaller than the available bandwidth of the server, the defensive action execution module allocates the remaining bandwidth such that the normal traffic proportion is greater than the malicious traffic, comprising the steps of:

and according to the original set TR of allowed traffic output by the deep reinforcement learning agent processing module, under the constraint condition, reallocating the bandwidth allowed to pass through by each flow based on the Softmax function, respectively assigning the reallocated set TR' of allowed traffic to the meter table bound with the flow table, and discarding the traffic exceeding the set value of the meter table.

Further, the constraint is that the total traffic allowed through the edge switch is limited to the server load U _S Within 95%.

Further, the feedback acquisition module calculates a reward function value reward:

reward＝0.9p _n +0.1(1-p _m )。

a DDoS attack active defense method based on deep reinforcement learning under SDN architecture comprises the following steps:

(1) Environment initialization: initializing the total training number epositions of parameters, and the training steps of each training number steps; the number of turns of current training is eposide=1, and the number of steps of current training is step=1;

(2) Initializing the size and interval of user package, setting the current training step number to step=1;

(3) The SDN controller actively transmits a message request delta t to the state of an edge switch, wherein the state comprises port information of the switch and flow information passing through the switch;

(4) Analyzing the state information obtained in the step (3);

(5) Judging whether step < steps are met;

(6) If step is not less than steps, then determine whether eposide < eposide is satisfied: if so, the eposide count is incremented by 1, and the process returns to the step (2); if not, ending;

(7) If step is smaller than steps, taking the network state information analyzed in the step (4) as the input of a near-field strategy optimization algorithm, and outputting a proportion set TR through which the corresponding flow is allowed to pass;

(8) Verifying the action output by the near-segment strategy optimization algorithm by using a bandwidth reallocation method, and reallocating the available bandwidth of the server based on a softmax function to obtain an available bandwidth value set TR' of each stream;

(9) Assigning TR' to a meter table speed limit value of the corresponding flow, and discarding the excess flow;

(10) SDN controller actively requests state information of edge switch in delta t and state information of server, and calculates malicious traffic proportion p by combining flow information passing through edge switch and flow information reaching server in state _m And a normal flow rate ratio p _n The method comprises the steps of carrying out a first treatment on the surface of the Calculating a reward function value reward based on the two ratios;

(11) Storing current training data (state, action, forward, state') into a memory pool, each time f is collected by the memory pool ₁ Group data, completeUpdating parameters of the neural network in a near-field strategy optimization algorithm;

(12) step count is increased by 1, so that state=state', as input of a near-stage strategy optimization algorithm in the next training, the next training is carried out in the step (5) until the training steps and the training rounds reach the maximum.

The beneficial effects of the invention are as follows: the invention collects the real-time data characteristics (flow characteristics and port characteristics) of the edge switch as the input of a near-end policy optimization algorithm, finishes the mapping from the flow state to the flow allowed through proportion, intelligently decides the flow allowed through proportion, actively discards malicious traffic in real time, simultaneously uses normal traffic as possible, combines the constraint condition that the total traffic reaching a server should be smaller than the server load for decision debugging, and particularly, reallocates the bandwidth of each flow originally allowed through based on a softmax function. Adopting a near-end strategy optimization algorithm of deep reinforcement learning to realize active defense against DDoS attack; the decision debugging process avoids wrong or dangerous decisions and ensures the efficiency and robustness of DDoS attack defense. The method is simple, flexible to realize and has strong practicability.

Drawings

FIG. 1 is a schematic diagram of the DDoS attack active defense system of the present invention;

fig. 2 is a flow chart of the active defense method of DDoS attack of the present invention.

Detailed Description

As shown in fig. 1, the DDoS attack active defense system based on deep reinforcement learning under an SDN architecture comprises an SDN controller, an edge switch and a deep reinforcement learning agent processing module; the SDN controller comprises a network state collection module, a defensive action execution module and a feedback acquisition module. The invention converts the defending process into a Markov decision process, establishes a network view through the SDN network controller, and collects network characteristic information (flow characteristics) on the edge switch in real time to accurately reflect the current network request state. Extracting network characteristics from dynamic environment through a near-end strategy optimization algorithm in deep reinforcement learning, mapping the state of each flow to a defense decision, ensuring the passing of normal flow and discarding malicious flow, and realizing active defense on DDoS attack; and training a deep neural network through interaction between the deep reinforcement learning agent and the network, and optimizing the relief strategy by experience. The performance difference of passing through normal traffic and discarding malicious traffic under different network states of dynamic change is reduced, and the robustness of the defense method is improved.

In an embodiment of the invention, the SDN controller is implemented based on OpenDaylight (ODL) and the edge switches are implemented based on Open vSwitch (OvS). The network environment includes K edge switches, and there are at most P flows on each edge switch. The state information of the switch includes port information of the switch (a transceiving packet size, a transceiving byte size of each port) and flow information passing through the switch (a packet number, a byte number of a flow whose destination address is a server). The state information of the server includes port information of the server (a packet size and a byte size for each port) and stream information passing through the server (a packet number and a byte number of a stream addressed to the server). The time Δt for the controller to request status information and to collect status information is chosen to be 0.5s. The system of the invention trains 2000 rounds (epodes) altogether, each round of training comprises 200 steps (steps), and each round of training reinitializes the network environment, including the size and interval of user package; each step comprises a network state collection module, a defensive action execution module and a feedback acquisition module which are completely and sequentially called.

(1) The network state collection module actively sends OFPT_STATS_REQUEST to REQUEST the state information of the edge switch, and acquires the returned state information from the OFPT_STATS_REPLY after the delta t moment; and the controller sends the state information to the deep reinforcement learning agent processing module in a JSON format.

(2) The deep reinforcement learning agent processing module is realized based on a near-end policy optimization algorithm, and the input state of the module is port information E of the switch and flow information F passing through the switch, which are collected by the network state collection module, and the state= [ (E) ₁ ,F ₁ ),...,(E _K ,F _K )]Wherein F _k ＝[f _k1 ,f _k2 ...f _kP ]When a certain stream is not collectedState f of the corresponding stream _kp Is 0. The output of the module is aciion= [ (a) ₁₁ ,...,a _1P ),...,(a _K1 ,...,a _KP )]，a _kp The proportion of traffic allowed to pass for the p-th flow on the kth edge switch ranges from [0.05,1]]K=1 to K, p=1 to P, when f _kp When 0, it corresponds to a _kp Also 0; wherein the traffic allowed through proportion approaches 0% and the traffic allowed through proportion approaches 100%. For example, an action value equal to 0.4 indicates a flow rate of 40% allowed through the flow.

The deep reinforcement learning agent processing module comprises an Actor (Actor) neural network A, an Actor (Actor) neural network B, a critter (Critic) neural network and a memory pool.

(2.1) Actor (Actor) neural network a is responsible for interacting with the network environment, wherein Actor neural network a takes state as input, outputs the distributed mean sigma and mean mu of actions, and randomly samples from the corresponding normal distribution to obtain action.

(2.2) updating the neural network in the deep reinforcement learning agent processing module, depending on the sample set collected in the memory pool. After the defensive action executing module executes the defensive action, collecting feedback information from the feedback collecting module, wherein the feedback information comprises the next network state information state' and a rewarding value report; and (state, action, rewind, state') is stored in the memory pool. After 4 groups of samples are stored in the memory pool, the difference between the actual benefits of the samples and the state cost function is taken as a dominant function value A, the mean square error of the function value A is taken as a loss value, the parameters of the commentator neural network are updated through back propagation, and the updating process is trained for 4 times.

(2.3) sampling on the distribution of the output of the Actor (Actor) neural network A and sampling on the distribution of the output of the Actor (Actor) neural network B to obtain the ratio of the action probability as the ratio; the loss value updated by Actor (Actor) neural network B is loss=min (ratio a, clip (1-e, 1+e, ratio) a), where e takes 0.2 and clip () function limits the ratio within the range of (0.8,1.2). This update process was trained 4 times.

(2.4) after each 16 steps of training of the whole near-field strategy optimization algorithm, the Actor (Actor) neural network B parameter values are assigned to the Actor (Actor) neural network A, and updating of the Actor (Actor) neural network A is completed.

(3) And the defensive action execution module is used for verifying the action output by the deep reinforcement learning agent processing module by utilizing a bandwidth reassignment method, reassigning the bandwidth for each flow according to the size of the flow which is originally allowed to pass, and adding constraint conditions in the process of bandwidth reassignment, namely, allowing the total flow sent to the server through the edge switch not to exceed the available bandwidth of the server.

Specifically, security defenses are made against two network conditions:

first, when the total flow allowed to pass through the edge switch is larger than the available bandwidth of the server, the defensive action execution module reduces the flow which can reach the server according to the constraint condition, and can protect the server from overload.

Second, when the total traffic allowed to pass through the edge switch is smaller than the available bandwidth of the server, since an attacker is in order to recruit fewer agents to overload the server when a DDoS attack occurs, the malicious host tends to send more traffic than the normal host in a short time, and the allocation of the remaining bandwidth can make the normal traffic proportion larger than the malicious traffic, so that the normal traffic passing rate can be effectively improved. The method comprises the following specific steps:

after the deep reinforcement learning agent processing module outputs the flow size action which is originally allowed to pass, the bandwidth which is allowed to pass for each flow is redistributed based on the Softmax function; and limit the total traffic allowed through the edge switches to the server load U _S Within 95%. Let the set of allowed traffic of the original γ -stripe stream be tr= [ TR ] ₁ ,tr ₂ ....,tr _γ ]The reassigned set TR' is denoted as:

and assigning the redistributed TR' to the meter tables bound with the flow tables respectively, so as to achieve the effects of ensuring normal flow to pass and limiting malicious flow. This is accomplished by inserting, deleting and updating flow tables through SalFlowService and SalMeterService's APIs in OpenFlowPlugin, and traffic exceeding the meter table set point will be discarded.

(4) And a feedback acquisition module: the defense action execution module triggers the feedback acquisition module after execution, and at the moment, the feedback acquisition module calls the network state collection module to actively send OFPT_STATS_REQUEST to REQUEST state information of the edge switch and the server, and after the delta t moment, the next returned network state information state' is obtained from OFPT_STATS_REPLY; then, in combination with the flow information passing through the edge switch and the flow information reaching the server in the last network state information state collected by the network state collection module, the malicious flow proportion p is calculated _m And a normal flow rate ratio p _n . Based on these two ratios, a reward function value, reward=0.9p, is calculated _n +0.1(1-p _m ). The feedback acquisition module feeds back the next network state information state' and the reward function value reward to the deep reinforcement learning agent processing module.

As shown in fig. 2, the DDoS attack active defense method based on deep reinforcement learning under the SDN architecture is specifically implemented as follows:

(1) And initializing the environment. Initializing the total training number of rounds of parameters, namely, eposide=2000, the training step number of each round, namely, step=200, the number of rounds of current training, namely, eposide=1, and the step number of current training, namely, step=1.

(2) The size and interval of the user's hair pack are initialized and the current number of training steps is set to step=1.

(3) The openDayleight controller actively issues a message requesting the state of the edge switch, including port information of the switch and flow information through the switch, at=0.5 s.

(4) And (3) analyzing the state information acquired in the step (3), wherein the state information comprises the packet number and byte number of the stream with the destination address being the server, and the size of a receiving and transmitting packet and the size of receiving and transmitting bytes of each port of the edge switch. These status information effectively reflect the status of the current network's requests sent to the server and whether the server is congested.

(5) It is determined whether step < steps is satisfied.

(6) If step is not less than steps, then determine whether eposide < eposide is satisfied: if so, the eposide count is incremented by 1, and the process returns to the step (2); if not, ending.

(7) If step is smaller than step, the network state information analyzed in step (4) is taken as input of a near-field policy optimization algorithm, and a proportion set TR which allows the corresponding flow to pass is output.

(8) And verifying the action output by the near-segment strategy optimization algorithm by using a bandwidth reallocation method, and reallocating 95% of available bandwidth of the server based on a softmax function to obtain an available bandwidth value set TR' of each stream.

(9) Assigning TR' to the meter table speed limit value of the corresponding flow, the excess traffic is discarded.

(10) The OpenDayleight controller actively requests state information state' of an edge switch in delta t and state information of a server, and calculates malicious traffic proportion p by combining flow information passing through the edge switch and flow information reaching the server in the state _m And a normal flow rate ratio p _n . Based on these two ratios, a reward function value, reward, is calculated.

(11) And storing the current training data (state, action, forward, state') into a memory pool, and completing the parameter updating of the neural network in the near-segment strategy optimization algorithm once every 4 groups of data are collected by the memory pool.

The foregoing description is only of the preferred embodiments of the invention, and all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. The DDoS attack active defense system based on deep reinforcement learning under the SDN architecture is characterized by comprising an SDN controller, an edge switch and a deep reinforcement learning intelligent agent processing module; the SDN controller comprises a network state collection module, a defensive action execution module and a feedback acquisition module; converting the defending process into a Markov decision process, establishing a network view through an SDN controller, and collecting network characteristic information on an edge switch in real time to reflect the current network request state; based on a near-end strategy optimization algorithm in deep reinforcement learning, extracting network characteristics from dynamic environments, mapping the state of each flow to a defense decision, ensuring the passing of normal flow and discarding malicious flow, and realizing active defense on DDoS attack; training a deep neural network through interaction between a deep reinforcement learning agent and a network, and optimizing and relieving strategies by experience to reduce the performance difference of normal traffic and malicious traffic discarded under different network states of dynamic change;

the defensive action execution module is used for verifying action output by the deep reinforcement learning agent processing module by using a bandwidth reallocation method, reallocating bandwidth for each flow according to the size of the flow which is originally allowed to pass, and adding constraint conditions in the bandwidth reallocation process; the constraint condition is that the total traffic allowed to be sent to the server through the edge switch does not exceed the available bandwidth of the server;

the feedback acquisition module is used for: the defending action execution module triggers the feedback acquisition module after execution, and at the moment, the feedback acquisition module calls the network state collection module to actively request the state information of the edge switch and the server, and after the delta t moment, the next returned network state information state' is obtained; then, in combination with the flow information passing through the edge switch and the flow information reaching the server in the last network state information state collected by the network state collection module, the malicious flow proportion p is calculated _m And a normal flow rate ratio p _n The method comprises the steps of carrying out a first treatment on the surface of the Calculating a reward function value reward based on the two ratios; the feedback acquisition module will be nextThe personal network state information state' and the reward function value reward are fed back to the deep reinforcement learning agent processing module.

2. The DDoS attack active defense system based on deep reinforcement learning under the SDN architecture of claim 1, comprising:

the deep reinforcement learning agent processing module is realized based on a near-end policy optimization algorithm, the input state of the module is port information of the switch and flow information passing through the switch, which are collected by the network state collection module, and the output of the module is action which represents the flow ratio of a certain flow on the edge switch to allow the flow to pass through; wherein the traffic allowed through proportion approaches 0% and the traffic allowed through proportion approaches 100%.

3. The DDoS attack active defense system based on deep reinforcement learning under the SDN architecture of claim 1, wherein when the SDN controller does not collect a state of a certain flow, the state of the corresponding flow is 0, and the proportion of traffic allowed by the corresponding flow is also 0.

4. The DDoS attack active defense system based on deep reinforcement learning under SDN architecture of claim 1, wherein the proportion of traffic allowed to pass by a certain flow on an edge switch ranges from [0.05,1].

5. The DDoS attack active defense system based on deep reinforcement learning under SDN architecture of claim 2, wherein the deep reinforcement learning agent processing module comprises an actor neural network a, an actor neural network B, a critique neural network and a memory pool, and specifically comprises:

(2.2) updating neural networks in the deep reinforcement learning agent processing module, depending on the memory poolA collection of collected samples; after the defensive action executing module executes the defensive action, collecting feedback information from the feedback collecting module, wherein the feedback information comprises the next network state information state' and a rewarding value report; and storing the (state, action, rewind, state') in the memory pool; when storing f in the memory cell ₁ After the samples are assembled, the difference between the actual benefits of the samples and the state cost function is taken as a dominant function value A, the mean square error of the function value A is taken as a loss value, the parameters of the commentary neural network are reversely propagated, and the updating process trains f ₂ Secondary times;

(2.3) sampling on the distribution of the output of the actor neural network A and sampling on the distribution of the output of the actor neural network B to obtain the ratio of the action probability as the ratio; the loss value updated by the actor neural network B is loss=main (ratio a, clip (1-e, 1+e, ratio) a); wherein e is a custom value, and clip () function limits the ratio to the range of (1-e, 1+e); this update procedure trains f ₃ Secondary times;

(2.4) per training f of the entire near-end policy optimization algorithm ₁ *f ₃ And step, assigning the parameter values of the actor neural network B to the actor neural network A to finish updating of the actor neural network A.

6. The DDoS attack active defense system based on deep reinforcement learning under SDN architecture of claim 2, wherein the defense action execution module performs security defense for two network conditions:

7. The DDoS attack active defense system based on deep reinforcement learning under SDN architecture of claim 6, wherein the constraint is that total traffic allowed through edge switches is limited to server load U _S Within 95%.

8. The DDoS attack active defense system based on deep reinforcement learning under the SDN architecture of claim 2, wherein the feedback acquisition module calculates a reward function value reward:

reward＝0.9p _n +0.1(1-p _m )。

9. the DDoS attack active defense method based on deep reinforcement learning under the SDN architecture is characterized by comprising the following steps:

(4) Analyzing the state information obtained in the step (3);

(5) Judging whether step < steps are met;

(7) If step is smaller than steps, taking the network state information analyzed in the step (4) as input of a near-end policy optimization algorithm, and outputting a proportion set TR through which a corresponding flow is allowed to pass;

(8) Verifying the action output by the near-end strategy optimization algorithm by using a bandwidth reallocation method, and reallocating the available bandwidth of the server based on a softmax function to obtain an available bandwidth value set TR' of each stream;

(11) Storing current training data (state, action, forward, state') into a memory pool, each time f is collected by the memory pool ₁ The data are assembled, and the parameter updating of the neural network in the near-end strategy optimization algorithm is completed once;

(12) step count is increased by 1, so that state=state', which is used as input of the near-end strategy optimization algorithm in the next training, returns to the step (5) to perform the next training until the training steps and training rounds reach the maximum.