CN113395207B

CN113395207B - Deep reinforcement learning-based route optimization framework and method under SDN framework

Info

Publication number: CN113395207B
Application number: CN202110663396.8A
Authority: CN
Inventors: 霍如; 沙宗轩; 汪硕; 黄韬; 刘韵洁
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2022-12-23
Anticipated expiration: 2041-06-15
Also published as: CN113395207A

Abstract

The invention provides a routing optimization framework and a routing optimization method based on deep reinforcement learning under an SDN framework, which are used for solving the problem that the traditional method is based on static rules, cannot adapt to the current dynamically-changed network environment and causes low data transmission. Under the SDN framework, considering the influence on the data packet forwarding efficiency due to the increase of the number of flow table entries in forwarding equipment, the method estimates the forwarding performance of the equipment by designing a neural network, takes the estimated value and network parameters as variables, and utilizes an Actor-Critic algorithm to generate a more reasonable routing planning scheme. The invention also designs an expert sample generation module to guide and accelerate model training, and a filter layer closely related to the availability of the current network link, so that the robustness of the algorithm is enhanced. The invention is better suitable for the modern network environment, has the characteristics of high complexity and dynamic state, realizes reasonable flow scheduling, and achieves the purposes of balancing network link load and improving data transmission efficiency.

Description

Deep reinforcement learning-based route optimization framework and method under SDN framework

Technical Field

The invention belongs to the technical field of networks, and particularly relates to a routing optimization framework and a method based on deep reinforcement learning under an SDN framework.

Background

In a communication network, routing planning is an online decision problem for flow distribution, and plays a crucial role in improving network performance. With the explosive growth of intelligent devices and network traffic, the advent of network function virtualization and cloud computing has prompted the industry to revisit traditional network architectures. On one hand, the current traffic pattern has changed significantly, and the static system architecture of the traditional network is not suitable for the dynamic computing and storage requirements of the environments such as a data center and cloud computing, and cannot meet the requirement of performing fine-grained control on data flow in the current network. Software-defined networking (SDN) is a network reconfiguration technology, and a data forwarding process is separated from a logic control process, so that considerable controllability of a network is achieved. The SDN is open, standardized and programmable in the vertical direction of the whole network, and can replan the network in a central control mode on the premise of not changing hardware equipment, thereby providing a new scheme for controlling network flow. On the other hand, rule-based routing protocols expose problems of inapplicability and inefficiency in dynamic network environments. Reinforcement Learning (RL) learns from experience by building agents to interact with the environment, maximizing the cumulative rewards that can be achieved for actions output under different states. The RL can independently learn an intelligent control strategy, so that the RL can be widely used in the fields of robot control, automatic driving, games, path planning and the like. The invention solves the routing planning problem in the SDN by using the RL, and updates the self strategy by using the agent and the network environment in an interactive way, so that the routing of data transmission can be adjusted in time according to different network states, and the routing selection which is more dynamic and intelligent than a routing algorithm based on rules is realized.

The routing planning of the current data flow inevitably causes the change of the network state, further influences the routing planning of the subsequent data flow, and the classic Markov decision process can be solved by RL. The network routing planning algorithm based on reinforcement learning takes the network as the environment in the RL, acquires the network state information as the state, and the Agent outputs corresponding behavior (action) according to the current state and the policy of the Agent. In the routing planning problem, action is the route of data transmission. And a controller (controller) in the SDN builds a flow table according to the action and issues the flow table to the switch, and the switch completes data transmission. The Agent obtains network state, such as link utilization rate, network load and other data as reward (rewarded), adjusts policy to increase the probability of obtaining more cumulative rewarded action outputs, and tries new actions to explore more state space to a certain extent, so that the Agent has the capability of finding better route. When the model training is completed, the agent can output the optimal data forwarding route according to different network states.

The traditional routing planning algorithm based on reinforcement learning has the problems of incomplete consideration, low training efficiency, incapability of coping with network faults and the like. Specifically, the traditional algorithm only considers the influence of the current network state on subsequent routing planning, and can generate routes according to the performance and the use condition of different network links, so that the performance of the whole network is maintained at a good level. In the SDN, a controller issues a flow table to a switch, and the switch determines a forwarding rule of a data flow according to the flow table, so as to control the data flow. Due to the influence of the increase of the number of matching items and the performance of the switch, the time for the switch to be matched with the corresponding forwarding rule is remarkably increased. The forwarding performance of Switch becomes a non-negligible factor in routing planning. In addition, reinforcement learning, while it may interact autonomously with the environment, updates its own strategy in ongoing attempts. However, the algorithm itself has no experience in the early stage, and agents often implement a random strategy in an unknown environment, so that the learning efficiency of knowledge is very low. Finally, the existing routing planning algorithm based on reinforcement learning lacks a processing module aiming at network faults. And the model structure is adjusted according to the state of the link, retraining is needed, and the network robustness is poor.

Disclosure of Invention

In order to solve the problem that the data transmission efficiency is influenced by the Switch forwarding performance, the invention provides a routing optimization architecture and a method based on deep reinforcement learning under an SDN architecture on the basis of a traditional routing planning algorithm. On the basis of a traditional routing algorithm based on machine Learning, the influence of the performance of forwarding equipment on data transmission is considered, the forwarding efficiency of the forwarding equipment is estimated by using a depth model, and the estimated value is used as input and is transmitted to a Deep Reinforcement Learning (DRL) algorithm, so that the decision output by an agent is influenced by the estimated value of the switch forwarding efficiency. The optimal route is more reasonable because the factors considered are more comprehensive. The invention designs an expert sample, generates routes by using a mature routing protocol such as OSPF in another environment with the same network state and topology, and stores the change of the network state, the executed routes and rewards as the expert sample in an experience pool. Also stored in the experience pool are training samples generated by agent's own interaction with the environment. After the data transmission is completed, the agent extracts a part of samples from the experience playback pool to perform model training. The expert sample will guide the training direction of agent and accelerate its initial training efficiency. In addition, the invention designs a filter layer closely related to the current network link availability in the DRL model. The filter layer represents the availability of links using binary vectors having the same dimensions as the output of the neural network. When a network node or link becomes unavailable, the output probability of the optimal route can be affected by adjusting the values in the filter layer. Through the improvement, the forwarding efficiency of the equipment is considered, the more reasonable optimal route is output, the expert sample is used for guiding and accelerating the model training, and meanwhile, the robustness of the network is enhanced.

The specific technical scheme is as follows:

a routing optimization architecture based on deep reinforcement learning under an SDN architecture is disclosed, specifically, in a control plane, a Controller acquires network state information of a data plane, on one hand, network link state and switch data are transmitted to an Agent, on the other hand, in a plurality of parallel virtual network environments running with the Agent and having the same network topology, the same parameters and the same network state, an existing protocol is utilized to generate a transmission route of the existing protocol in the current network state and generate an expert sample, and the expert sample and a training sample generated by interaction of the Agent and the environment are put into an experience pool; different samples have the same structure, all are<Current network state, output route, reward for feedback, next network state>Four-tuple, represented as<S _t ,A _t ,R _t+1 ,S _t+1 >The reward function is defined as: - (max { U } + D ^s,d ) U is a vector representing the utilization of each link in the current network environment, D ^s,d Representing the delay of the data flow from the source node s to the destination node d. The goal of the algorithm is to maximize the reward, i.e. to minimize the maximum link utilization and end-to-end transmission delay in the current network.

The Agent adopts an improved Actor-criticic algorithm, the Actor-criticic algorithm comprises an Actor network, softmax, a network link availability analysis module and a criticic network, and the improvement is that a filter layer is added between the Actor and the Softmax in the Actor-criticic algorithm, the filter layer is a binary vector, the numerical value of the binary vector is related to the network link availability, the output of the Actor module is multiplied by the filter layer bit by bit, when the link is unavailable in the network, the corresponding position representing the link availability is 0, otherwise, the corresponding position is 1; the input of the Agent comprises a network link state and switch data, the network link state comprises transmission delay, jitter, packet loss rate, bandwidth and flow type number, the switch data is the forwarding performance estimated by the switch efficiency estimation module according to the switch state, and the switch state estimation comprises throughput, CPU, memory, forwarding delay, packet forwarding rate, flow table number, total number of matching items and current flow type; the switch performance estimation module is realized by a 3-layer neural network, the first layer of the neural network comprises 8 neurons and is used for receiving input vectors, the second layer is a fully-connected layer and comprises 10 neurons respectively, an activation function is a ReLU, and the last layer comprises 1 neuron and outputs an estimation value of the switch performance.

A routing optimization method based on deep reinforcement learning under an SDN framework comprises the following steps:

(1) The switch receives the service data flow and inquires a flow table; if the matching is carried out on the forwarding rule, data forwarding is carried out according to the rule, the generated quadruple of the current network state, the output route and the feedback reward and the next network state is added into the experience pool, and if the matching is not carried out, the step is continued;

(2) Is divided into two paths of parallel branches,

the branch I sequentially comprises the following steps:

sending, by the switch, a flow rule request to the controller;

a controller acquires the current network link state and the switch state information;

the controller takes throughput, CPU, memory, forwarding delay, packet forwarding rate, flow table number, total number of matching items and current flow type as input and transmits the input to the switch efficiency estimation module, and outputs an estimation value corresponding to the switch efficiency;

the controller takes the current network link state and the switch performance estimated value as input and is marked as S _t Will S _t Transmitting a modified Actor-Critic, and outputting an estimation of all transmission links by an Actor network;

determining the value of a filter layer according to the current network condition;

the vector output by the Actor sequentially passes through the filter layer and the softmax layer to generate probability distribution for executing actions, namely, the transmission path of the current data stream is determined and marked as A _t ；

controller according to A _t After data transmission, network state information is obtained and recorded as S _t+1 And calculating the reward value according to the reward function, and recording the reward value as R _t+1 Will be<S _t ,A _t ,R _t+1 ,S _t+1 >The quadruple is used as a training sample and stored in an experience playback pool;

the second branch comprises the following steps in sequence:

the controller in the virtual network with the same parameters and states, at S _t Generating data transmission route A according to existing protocol under state _t Obtaining the network state S after data transmission _t+1 And link utilization ratio R _t+1 Generated as quadruplets<S _t ,A _t ,R _t+1 ,S _t+1 >Storing the expert samples into an experience playback pool;

(3) Randomly extracting a batch of samples from an experience pool, and recording the samples as mini-batch;

(4) S in mini-batch _t And S _t+1 Passing as input into the Critic network, generating a V (S) probability _t ) And V (S) _t+1 ) Resolution of the representation pair S _t And S _t+1 Estimates of the value of these two states;

(5) According to V (S) _t )、V(S _t+1 ) And R _t+1 Calculating TD error for updating parameters in an Actor network and a Critic network;

(6) And (5) when the Actor-Critic converges, finishing model training, and performing optimal route estimation by using the trained Agent, otherwise, turning to the step (1).

Advantageous effects

The invention provides a route planning algorithm comprehensively considering the switch forwarding efficiency. As the number of matching entries in the flow table increases, the time for switch matching and forwarding increases, and the forwarding performance becomes one of the important factors to be considered in routing planning. More reasonable data transmission routes can be generated due to more comprehensive consideration factors.

The invention designs an expert sample generation module, generates expert samples by utilizing the existing mature protocol and puts the expert samples into an experience pool, thereby being capable of quickly accelerating the initial training speed of the model and guiding the adjustment direction of the model parameters. Deep reinforcement learning features with exploratory state space can ultimately yield optimal routes that exceed the performance of expert samples.

The invention designs a filter layer, and uses binary vectors to represent the availability of network links. When a network node or a link fails, the network link availability module can output the optimal routing scheme under different network link availability conditions without changing the model structure and continuously using the knowledge obtained in the training process by adjusting the numerical value in the filter layer.

The invention considers the forwarding efficiency of the equipment, outputs more reasonable optimal route, uses expert sample to guide and accelerate model training, and enhances the robustness of the network.

Drawings

Fig. 1 is a schematic diagram of a deep reinforcement learning-based routing optimization architecture under an SDN architecture;

FIG. 2 is a schematic diagram of the working process of the Agent according to the present invention;

FIG. 3 is a flow chart of the method of the present invention.

Detailed Description

A routing optimization architecture based on deep reinforcement learning under an SDN architecture is shown in fig. 1. And the Controller acquires the network state information of the data plane and transmits the network link and switch data to the Agent on one hand. On the other hand, in a plurality of parallel virtual network environments which have the same network topology, the same parameters and the same network state as the operation of the Agent, protocols such as OSPF, load Balance and the like are utilized to generate a transmission route of the protocol in the current network state and generate an expert sample, and the expert sample and training samples generated by the interaction of the Agent and the environment are placed into an experience playback pool. Different samples have the same structure, all are<Current network state, output route, reward for feedback, next network state>Quadruplets, which can be represented as<S _t ,A _t ,R _t+1 ,S _t+1 >The expert sample contains different mapping relationships between the network state and the output route. In the training process of the model, samples are extracted from an experience pool through random sampling to calculate loss functions and update Agent model parameters. If a training sample generated by the Agent is sampled, updating the model parameters of the Agent through an algorithm process; if expert samples generated by other routing protocols are sampled, the routing protocol generates routing knowledge under different network states to guide the training direction of the model, and the agent can integrate optimization targets of different strategies and accelerate the initial training speed of the model. The DRL algorithm has exploration capacity for a state space, so that the DRL algorithm has a mechanism for trying actions except the expert sample, and finally achieves the performance exceeding the expert sample.

Switches with different performance parameters have different forwarding efficiencies, and the switches implement rule-based data forwarding through flow table matching in the SDN. The increase of the number of the flow table entries has a great influence on the matching speed, and further influences the forwarding efficiency. Meanwhile, even if the data streams have the same source address and destination address, the network resources required for different types of transmitted data are greatly different. For example, pictures with very small bandwidth requirements and video streams with larger bandwidth requirements may have the same source and destination addresses, the same port ID and the same transport protocol. Because the forwarding performance influenced by multiple factors is difficult to accurately express by using a polynomial, the SDN switch performance estimation module is designed, the module evaluates the forwarding performance of the switch by using an artificial neural network through acquiring switch state information such as throughput, CPU, memory, forwarding delay, flow table number and the like, the switch data is the forwarding performance estimated by the switch performance estimation module according to the switch state, and the switch state estimation comprises throughput, CPU, memory, forwarding delay, packet forwarding rate, flow table number, total number of matching items and current flow type; the switch performance estimation module is realized by a 3-layer neural network, the first layer of the neural network comprises 8 neurons and is used for receiving input vectors, the second layer is a full connection layer and comprises 10 neurons respectively, an activation function is a ReLU, and the last layer comprises 1 neuron and outputs an estimation value of the switch performance. And training the Agent by combining the current network state information, such as transmission delay, jitter, packet loss rate and the like, and finally generating a fine-grained optimal data forwarding route considering the efficiency of forwarding equipment. The structure of a routing optimization algorithm model based on deep reinforcement learning under the SDN framework designed by the invention is shown in FIG. 2.

In the invention, a filter layer is designed, the layer is positioned between the output of the artificial neural network Actor and the softmax layer, the numerical value of the filter layer is closely related to a network link availability module, and the availability of the link is represented by a binary vector. When some link is unavailable due to nodes, available bandwidth and the like in the network, the corresponding position indicating the link availability is 0. The output of the Actor module is multiplied by the filter layer bit by bit, and the optimal data transmission route in a new network state can be generated quickly aiming at the sudden fault in the network under the condition of not changing the model structure.

The process of the method of the invention is shown in figure 3:

deploying a routing algorithm designed by the invention in a controller;

1, the switch receives a service data stream and inquires a stream table; if the forwarding rule is matched, data forwarding is carried out according to the rule, the generated quadruple of < the current network state, the output route and the feedback reward and the next network state > is added into the experience pool, and if not, the step is continued;

2. is divided into two paths of parallel branches,

the branch I sequentially comprises the following steps:

sending, by the switch, a flow rule request to the controller;

the controller takes the current network link state and the switch performance estimation value as input and is marked as S _t Will S _t Transmitting a modified Actor-Critic, and outputting an estimation of all transmission links by an Actor network;

the vector output by the Actor sequentially passes through the filter layer and the softmax layer to generate probability distribution for executing actions, namely determining the transmission path of the current data stream, which is marked as A _t ；

controller according to A _t After data transmission, network state information is obtained and recorded as S _t+1 While calculating link utilization, denoted as R _t+1 Will be<S _t ,A _t ,R _t+1 ,S _t+1 >The quadruple is used as a training sample and stored in an experience playback pool; the branch two sequentially comprises the following steps:

3. randomly extracting a batch of samples from an experience pool, and recording the samples as mini-batch;

4. s in mini-batch _t And S _t+1 Passing as input into the Critic network, generating a V (S) probability _t ) And V (S) _t+1 ) Resolution of the representation pair S _t And S _t+1 Estimates of the value of these two states;

5. according to V (S) _t )、V(S _t+1 ) And R _t+1 Calculating TD error for updating parameters in an Actor network and a Critic network;

6. and (5) when the Actor-Critic converges, finishing model training, and performing optimal route estimation by using the trained Agent, otherwise, turning to the step (1).

According to the description, the algorithm designed by the invention firstly estimates the influence of different performances and different numbers of matching items of the switch in the SDN on the forwarding efficiency, and estimates the efficiency of the switch by using a depth model, and the efficiency is taken as a consideration factor of route planning; an expert sample generation module is designed to guide the Actor-Critic model training; meanwhile, the condition of node or link failure in the network is considered, the model structure is not changed, retraining is not needed, and a new optimal route is generated. By adopting the method, more reasonable data transmission routes considering the switch forwarding efficiency can be generated in the SDN, and meanwhile, the method has higher training efficiency and enhances the network robustness.

The invention provides a routing optimization framework and a method based on deep reinforcement learning under an SDN framework, wherein forwarding routes are generated according to current strategies under different network states by constructing agents, the quality of the executed data forwarding routes is evaluated by using feedback, and the strategies are adjusted to increase the probability of the occurrence of routes which can obtain more accumulated rewards. After the training is finished, the agent can quickly generate a transmission path according to the current network state and has certain generalization capability.

The invention considers the influence of the forwarding efficiency of the equipment on the transmission speed in the process of calculating the optimal route. When the number of matching items is increased in the SDN, the matching time of the flow table of the switch is also increased correspondingly, and the performance of the switch also affects the efficiency of rule matching and forwarding. The method estimates the forwarding efficiency of the switch by using the depth model, can more reasonably represent the forwarding efficiency of the switch, and can output more reasonable optimal route by taking the estimated value as an internal variable in the overall model to participate in route calculation.

The invention designs an expert sample. As the initial training efficiency of the DRL algorithm is low, the mature routing protocol is used for generating expert samples, the expert samples and training samples generated by the agent are placed into the experience playback pool, and a certain number of samples are extracted from the experience playback pool for model training. The expert sample guides the direction of Agent training and accelerates the training efficiency in the early stage. The DRL algorithm has the property of exploring unknown state spaces, making it possible to find results that exceed the performance of an expert sample.

The invention designs a filter layer after the output of the Actor network. This is a binary vector used to represent the availability of the current network link, and the values of the filter layers are closely related to the current network state. When a network node or a link is unavailable due to a fault or the like, all routes output by the Actor are available by adjusting a numerical value representing the position of the link in the filter layer. The method can deal with network faults, does not need to adjust the model structure and retrain, and can directly obtain the optimal route in a new network state.

Claims

1. A routing optimization framework based on deep reinforcement learning under an SDN framework is characterized in that:

in the control plane, a Controller acquires network state information of a data plane, on one hand, network link state and switch data are transmitted to an Agent, on the other hand, in a parallel virtual network environment which has the same network topology, the same parameters and the same network state as the Agent, an existing protocol is used for generating a transmission route of the existing protocol in the current network state and generating an expert sample, and the expert sample and a training sample generated by interaction of the Agent and the network environment are put into an experience pool; different samples have the same structure, all are<Current network state, output route, reward for feedback, next network state>Quadruplets of, expressed as<S _t ,A _t ,R _t+1 ,S _t+1 >The reward function is defined as: - (max { U } + D ^s,d ) U is a vector representing the utilization of each link in the current network environment, D ^s,d Representing the delay of the data flow from the source node s to the destination node d, the objective of the algorithm is to maximize the reward, i.e. to minimize the maximum link utilization and the end-to-end transmission delay in the current network.

2. The deep reinforcement learning-based routing optimization architecture under the SDN architecture of claim 1, wherein: the Agent adopts an improved Actor-Critic algorithm, an Actor-Critic algorithm module comprises an Actor network, a softMax, a network link availability analysis module and a Critic network, a filter layer is added between Actor and softMax in the Actor-Critic algorithm, the filter layer is a binary vector, the numerical value of the binary vector is related to the network link availability, the output of the Actor module is multiplied by the filter layer bit by bit, when a link is unavailable in the network, the corresponding position representing the link availability is 0, otherwise, the corresponding position is 1; the input of the Agent comprises a network link state and switch data, the network link state comprises transmission delay, jitter, packet loss rate, bandwidth and flow type number, the switch data is the forwarding performance estimated by the switch efficiency estimation module according to the switch state, and the switch state estimation comprises throughput, CPU, memory, forwarding delay, packet forwarding rate, flow table number, total number of matching items and current flow type; the switch performance estimation module is realized by a 3-layer neural network, the first layer of the neural network comprises 8 neurons and is used for receiving input vectors, the second layer is a full connection layer and comprises 10 neurons respectively, an activation function is a ReLU, and the last layer comprises 1 neuron and outputs an estimation value of the switch performance.

3. A routing optimization method based on deep reinforcement learning under the SDN architecture of claim 1, characterized by comprising the following steps:

(1) The switch receives the service data flow and inquires the flow table; if the matching is carried out on the forwarding rule, data forwarding is carried out according to the rule, the generated quadruple of the current network state, the output route and the feedback reward and the next network state is added into the experience pool, and if the matching is not carried out, the step is continued;

(2) Is divided into two paths of parallel branches,

the branch I sequentially comprises the following steps:

sending, by the switch, a flow rule request to the controller;

the controller takes the current network link state and the switch performance estimation value as input and is marked as S _t A 1, S _t Transmitting a modified Actor-Critic, and outputting an estimation of all transmission links by an Actor network;

the branch two sequentially comprises the following steps:

(4) S in mini-batch _t And S _t+1 Passing as input into the Critic network, generating a V (S) probability _t ) And V (S) _t+1 ) To distinguish the pair S _t And S _t+1 Estimates of the value of these two states;

(5) According to V (S) _t )、V(S _t+1 ) And R _t+1 Calculating a TD error for updating parameters in an Actor and Critic network;