CN110986979A

CN110986979A - SDN multi-path routing planning method based on reinforcement learning

Info

Publication number: CN110986979A
Application number: CN201911183909.4A
Authority: CN
Inventors: 李传煌; 方春涛; 卢正勇
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2020-04-10
Anticipated expiration: 2039-11-27
Also published as: CN110986979B

Abstract

The invention discloses an SDN multi-path route planning method based on reinforcement learning, which comprises the following steps: applying reinforcement learning to SDN multi-path routing planning, using a QLearning algorithm as a reinforcement learning model, and generating different reward values according to different QoS (quality of service) levels of flow; setting different reward functions for the flows with different QoS levels according to the input network topology matrix and the current flow characteristic matrix to be forwarded, and planning a plurality of paths to forward the flows; and under the condition that the link bandwidth is not enough, a larger flow is divided into a plurality of small flows, so that the link bandwidth utilization rate is improved. The invention utilizes the characteristics of reinforcement learning, continuous interaction with environment and strategy adjustment, and can realize high link utilization rate and effectively reduce network congestion compared with the traditional single-path routing planning.

Description

SDN multi-path routing planning method based on reinforcement learning

Technical Field

The invention relates to the field of network communication technology and reinforcement learning, in particular to an SDN multi-path route planning method based on reinforcement learning.

Background

In recent years, with the popularization of the internet, particularly with the appearance of related technologies such as cloud computing and big data, the internet has entered a rapid development period. The rapid development of the internet enables the data volume of network transmission services to increase rapidly, and particularly, in recent years, with the rise of short video and live broadcast platforms, the interaction of network services is more real-time, and a terminal user puts higher demands on the Quality of Service (QoS) of the network services. However, in the case of limited network resources, the continuous increase of internet traffic data may cause problems such as a sharp increase of bandwidth consumption, difficulty in guaranteeing quality of service, and an increase of security problems. Obviously, the traditional network architecture is difficult to meet the diversified requirements of users. In view of the foregoing, there is a need in the internet industry for a new network architecture that addresses existing network problems, and that is more flexible and efficient than conventional architectures to meet the ever-increasing traffic data needs of society.

The SDN is a novel network architecture, is widely concerned by various borders, and solves some problems which cannot be avoided in the traditional network. In the traditional network architecture, each device can independently make a forwarding rule and transmit information through a series of network protocols (such as TCP/IP), under the system structure, the control and the forwarding of the network device are closely coupled, the network device can only plan a path by taking the network device as a center as a flow service and does not have network global resource information, and the problems of network link congestion and the like are easily caused. SDN forwards and control separation, can obtain link information in real time through the OpenFlow protocol, be favorable to the centralized control of network for the control layer obtains network global resource information, and carry out unified management and distribution according to the demand of business, and simultaneously, centralized control still makes whole network regard as a whole, convenient maintenance. Compared with the traditional IP network, the SDN network solves the problems of inaccurate routing information, low routing efficiency and the like of the traditional network, and lays a foundation for realizing intelligent routing planning according to the requirements of different flows. Therefore, the research on the SDN network architecture is of great significance. Routing is an indispensable component of both traditional networks and SDN networks, however, the basic adopted by the current mainstream SDN routing modules is Dijkstra (shortest path) algorithm, if all data packets depend on the shortest path algorithm only, data flows are easy to cause link congestion due to the selection of the same link, and other links are placed in an idle state, which greatly reduces link utilization. On the other hand, the shortest path algorithm is an algorithm for finding the shortest path in graph theory, and when the algorithm is operated, the shortest path from the source node to all other nodes in the topology is actually obtained, so the time complexity of the algorithm is high. There are also protocols that support multipath, such as ECMP, but these protocols do not take into account the quality of service requirements of different traffic streams. Therefore, a better routing strategy is needed in the SDN network to generate routes, improve the performance of the network, and guarantee the service quality of different service flows.

Disclosure of Invention

The invention provides a high-bandwidth-utilization-oriented SDN intelligent routing planning technology, which mainly adopts the shortest path as a routing planning algorithm around the current SDN network, so that the problem of low link bandwidth utilization rate and the like is caused.

The technical scheme adopted by the invention for solving the technical problem is as follows: an SDN multi-path route planning method based on reinforcement learning comprises the following steps:

step 1: acquiring available bandwidth information, total bandwidth information, node information and link information of a network to construct a network topology matrix, and acquiring a characteristic matrix of a stream to be forwarded;

step 2: a QLearning algorithm is adopted as a reinforcement learning model, and the network topology matrix and the characteristic matrix of the flow to be forwarded in the step 1 are input into a reinforcement learning model training Q value table; the reward function R in the QLearning algorithm is as follows:

wherein: r_t(S_i,A_j) Indicating slave status S of a data packet_iSelection action A_jThe obtained reward is represented in a routing planning task as the reward generated when the next hop selected by the data packet at the node i is the node j, β is the flow QoS grade, η is the bandwidth utilization rate, d is the destination node, delta (j-d) is an impulse function and represents that when the next hop of the data packet is the destination node, the value is 1, T is the connection state of the network topology nodes, 1 when the two nodes are connected and 0 when the two nodes are not connected, and g (x) is) As a cost function, the following is shown:

in the formula I_mIs the total number of links of the network topology. x is the hop count passed by the data packet in forwarding;

and step 3: and obtaining a path Routing according to the Q value table, putting the path Routing into a path set Routing (S, D), and judging whether the minimum link bandwidth of the Routing is smaller than the bandwidth of the stream. If so, a slice of size is divided from the stream

Wherein B is_{Can be used}Represents the minimum link available bandwidth for the current output path, β represents the Qos level, Σ, of the current flow_iβ_iIndicating the overall QoS class. The divided streams are passed from the source node to the target node through the current output path. Taking the residual flow as a new flow to flow back to the step 2 to train the Q value table again; if not, the planning is finished, and the planned multi-path route is obtained from the Routing (S, D).

Further, the traffic characteristic matrix to be forwarded includes a source address, a destination address, a QoS class, and a traffic size of the flow.

Further, the process of training the Q-value table by the QLearning algorithm is specifically as follows:

and setting the maximum step number of the single training.

(1) Initializing a Q value table and a reward function R;

(2) an action epsilon-greedy strategy P is adopted, and an action a is selected;

(3) executing action a, transferring to a state s', calculating a reward value by using a reward function R, and updating a Q value table;

(4) and judging whether s' is a destination node or not. If not, let s ═ s', return to (2).

Further, the SDN multipath routing planning method based on reinforcement learning is characterized in that the cost function is defined as that the cost increases with the increase of the number x of hops passed by the data packet forwarding, and the cost function g (x) is e (0,1), and the cost function should satisfy: the curve of the cost function g (x) is an upward convex function curve, and when the total hop number of the data packet tends to infinity, the cost function value tends to 1.

The method has the advantages that reinforcement learning is applied to SDN multi-path routing planning, a QLearning algorithm is used as a reinforcement learning model, different reward functions are set for flows with different QoS levels according to an input network topology matrix and a current flow characteristic matrix to be forwarded, a plurality of paths are planned to forward the flows, and a larger flow is divided into a plurality of small flows under the condition that the link bandwidth is not enough, so that the utilization rate of the link bandwidth is improved.

Drawings

Figure 1 is an SDN multi-path routing planning architecture diagram;

figure 2 is a SDN network topology diagram;

FIG. 3 is a graph of a cost function;

fig. 4 is a flow chart of reinforcement learning-based multi-path routing planning.

Detailed Description

Aiming at the fact that the existing SDN control adopts Dijkstra algorithm as the shortest route searching algorithm, the method tries to apply reinforcement learning to the SDN route. And directly using the network topology environment for the training of the Q value table by utilizing the characteristic of SDN forwarding control separation. Considering that different services have different requirements on QoS, the invention provides routes with different service qualities for different services; and under the condition that the link bandwidth is not enough, a larger flow is divided into a plurality of small flows, so that the link bandwidth utilization rate is improved.

As shown in fig. 1, the present invention provides a reinforcement learning based SDN multi-path route planning method, which includes the following steps:

step 1: acquiring available bandwidth information, total bandwidth information, node information and link information of a network to construct a network topology matrix, and constructing a network topology graph as shown in figure 2 by using Mininet, wherein the network topology graph comprises 9 OpenFlow switches and 5 hosts; the method comprises the steps of obtaining a characteristic matrix of the flow to be forwarded, setting the bandwidth of each network link to be 200 according to a multi-path routing planning algorithm and an SDN network topology, setting a sending end to be h 1-h 5 and a receiving end to be h 1-h 5, wherein the sending end randomly sends data to other receiving ends with the probability of 20%, and all hosts send 30 static flows in total, wherein the static flows refer to flows which occupy the bandwidth of the link until the experiment is finished once being injected into the network.

Step 2: as shown in fig. 1, using QLearning algorithm as the reinforcement learning model, the multi-path routing algorithm proposed by the present invention uses markov decision process for modeling, and therefore, the model MDP quadruplet proposed by the present invention is defined as follows:

(1) state collection: in a network topology, each switch represents a state, and thus, according to the network topology, a set of network states is defined herein as follows:

S＝[s₁,s₂,s₃,…s₉]

wherein s is₁～s₉Representing 9 OpenFlow switches in the network. The source node information of the data packet indicates an initial state of the data packet, and the destination node information indicates a termination state of the data packet. When a certain data packet reaches the destination node, the data packet reaches the termination state. Once the current data packet reaches the termination state, the termination of one round of training is indicated, and the data packet will return to the initial state again for the next round of training.

(2) An action space: in an SDN network, the transmission path of a data packet is determined by the network state, i.e. the data packet can only be transmitted at connected network nodes. According to the network topology, the network connection state is defined as the following formula:

since packets can only be transmitted at connected network nodes, the following set of actions for each state S [ i ] ∈ S can be defined herein according to the set of network states and the network connection state:

A(s_i)＝{s_j|T[s_i][s_j]＝1}

indicates that the current state is at s_iThe state-selectable action set appears as s on the network topology_iDirectly connected nodes s_jI.e. the current state s_iWill only select the state s connected to it_j. For example: state s₁The action set of (1) is: a(s)₁)＝{s₂,s₄}。

(3) And (3) state transition: in each round of training, when the data packet is in state s_iIf the action is not the selected state of the round, the data packet moves to the next state. Another key issue with reinforcement learning is the generation of reward values. When the Agent generates state transition, the system feeds back a reward to the Agent according to the reward function R.

(4) The final purpose of the multi-path routing planning based on reinforcement learning is to plan reasonable multi-paths through training, so the setting of a reward value R is also important, the bandwidth utilization rate and the delay are mainly considered herein, the delay mainly refers to the hop count of the path, and in order to plan different paths for links with different QoS levels, the hop count of the planned path is considered to be smaller as the traffic level β is larger.

1. QoS level β and link utilization η need to be considered;

2.β large flows are encouraged to allocate paths with fewer hops;

in summary, the reward function formula designed herein is as follows:

wherein: r_t(S_i,A_j) Indicating slave status S of a data packet_iSelection action A_jThe obtained reward is represented in a routing planning task as the reward generated when the next hop selected by the data packet at the node i is the node j, β is the flow QoS grade, η is the bandwidth utilization rate, d is the destination node, delta (j-d) is an impulse function and represents that when the next hop of the data packet is the destination node, the value is 1, T is the connection state of the network topology nodes, the two nodes are 1 when connected and 0 when not connected, and the reward function represents that the data packet is in the state S_iWhen the next hop that can be selected (connected) is j (action a)_j) Then, from T [ S ]_i][A_j]The bonus function when 1 yields a bonus value, otherwise the bonus value is set to-1.

g (x) is a cost function defined as the cost increases as the number of hops x passed by the packet increases, and the cost function g (x) is e (0,1) with l_mFor the total number of links in the network topology, considering that when the network topology is large, it is impractical to walk all paths, and a data packet can only be forwarded through a part of links, so the cost function should satisfy: the early stage is increased quickly and becomes stable to the later stage, if the total number of hops passed by the data packet reaches l_mThen the cost function value is maximized. In summary, the cost function is as follows:

as shown in fig. 3, it can be seen that the cost function is an increasing function, the range of the increasing function is (0,1), the cost increases with the increase of the number of hops x, and the function grows more rapidly in the early stage and tends to be stable in the later stage, which meets the requirement of the cost function.

The second requirement of designing the reward function is to encourage β flows with large flows to allocate paths with small hop count, so the cost function g (x) in the reward function is multiplied by the traffic QoS class β, at this time, under the same condition, the more the number of the path hops passed by the packet forwarding, the larger the cost required by the flows with large QoS class, and the paths with small hop count will be selected by the flows with large QoS class during planning.

After the MDP quadruples are determined, a Q-value table is trained by using a Qlearning algorithm, which comprises the following specific steps.

And setting the maximum step number of the single training.

(1) Initializing a Q value table and a reward function R;

(4) and judging whether s' is a destination node or not. If not, the process returns to step (2) with s ═ s'.

As shown in fig. 4, the learning rate α is set to 0.8, the discount rate γ is set to 0.6, and the value of the action strategy ∈ -greedy strategy ∈ is equal to 0.1.

And obtaining a path Routing according to the trained Q value table, putting the path Routing into a path set Routing (S, D), and judging whether the minimum link bandwidth of the Routing is smaller than the bandwidth of the stream. If so, a slice of size is divided from the stream

Of (a) wherein B_{Can be used}Represents the minimum link available bandwidth for the current output path, β represents the Qos level, Σ, of the current flow_iβ_iIndicating the overall QoS class. And the small flow reaches the target node from the source destination node through the current output path. Use of

Updating the flow, wherein B represents the size of the planned flow, namely returning the rest flows as new flows to step 2 for the training of the Q value table; if not, the planning is finished, and the planned multi-path route is obtained from the Routing (S, D).

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims

1. An SDN multi-path route planning method based on reinforcement learning is characterized by comprising the following steps:

wherein: r_t(S_i,A_j) Indicating slave status S of a data packet_iSelection action A_jThe obtained reward is represented in a routing planning task as the reward generated when the next hop selected by the data packet at the node i is the node j, β is the flow QoS grade, η is the bandwidth utilization rate, d is the destination node, delta (j-d) is an impulse function and represents that when the next hop node of the data packet is the destination node, the value is 1, T is the connection state of the network topology nodes, the two nodes are 1 when connected and 0 when not connected, and g (x) is a cost function, and the following steps are shown:

and step 3: and obtaining a path Routing according to the Q value table, putting the path Routing into a path set Routing (S, D), and judging whether the minimum link bandwidth of the Routing is smaller than the bandwidth of the stream. If so,a division into a size of

2. The reinforcement learning-based SDN multipath routing planning method of claim 1, wherein the traffic feature matrix to be forwarded represents information of a flow in a matrix form, wherein the information includes a source address, a destination address, a QoS class, and a traffic size of the flow.

3. The reinforcement learning-based SDN multipath routing planning method of claim 1, wherein the QLearing algorithm training Q-value table specifically comprises the following steps:

after setting the maximum step number of single training, carrying out the following steps;

(1) initializing a Q value table and a reward function R;

4. A reinforcement learning-based SDN multi-path routing planning method according to claim 3, wherein the cost function is defined as that the cost increases as the number of hops x passed by the data packet forwarding increases, and the cost function g (x) is e (0,1), and the cost function should satisfy: the curve of the cost function g (x) is an upward convex function curve, and when the total hop number of the data packet tends to infinity, the cost function value tends to 1.