CN109361601B

CN109361601B - SDN route planning method based on reinforcement learning

Info

Publication number: CN109361601B
Application number: CN201811292342.XA
Authority: CN
Inventors: 李传煌; 卢正勇; 吴艳; 唐豪; 任云方
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2021-03-30
Anticipated expiration: 2038-10-31
Also published as: CN109361601A

Abstract

The invention discloses an SDN route planning method based on reinforcement learning, which comprises the following steps: in an SDN control plane, a reinforcement learning model capable of generating a route is constructed by adopting Q learning in reinforcement learning, a reward function in a Q learning algorithm is designed, and different reward values are generated according to different QoS levels of flow; and inputting a current network topology matrix, flow characteristics and QoS (quality of service) grades of flows into a reinforcement learning model for training, thereby realizing SDN (software defined network) routing planning of flow differentiation and finding a shortest forwarding path which meets the QoS requirement of each flow. The method utilizes the characteristics of reinforcement learning, continuous interaction of environment and strategy adjustment, and has high link utilization rate and capability of effectively reducing network congestion compared with a Dijkstra algorithm commonly used in the traditional routing planning.

Description

SDN route planning method based on reinforcement learning

Technical Field

The invention relates to the field of network communication technology and reinforcement learning, in particular to an SDN route planning method based on reinforcement learning.

Background

Internet traffic data continuously increases, which causes problems of sharply increased bandwidth consumption, difficulty in ensuring service quality, increased safety problems and the like, internet and various industries are inseparable and become the industries with the widest prospect at present, however, with the popularization of internet and the increase of internet services, various industries and individual users can generate thousands of network information traffic every day, such as file transmission, voice communication, network games and the like, new application modes and requirements are not continuous, a traditional network architecture cannot deal with the rapidly developed internet, and various problems such as insufficient network address space, increasingly overstaffed equipment, difficulty in ensuring service quality and the like are faced.

The Software Defined Network (SDN) is an innovative network architecture proposed by the Clean Slate research group of stanford university, usa in 2007, the initiation purpose of the SDN is to "reshape the Internet", and as a novel network architecture, a brand new technology is provided for solving the existing network problem, and the core idea of the SDN is to separate a network device control plane from a data plane by means of OpenFlow, thereby realizing flexible control of network resources.

SDN is a programmable network architecture with a control plane separate from a data forwarding plane. Thus, the routing algorithm of the SDN may be customized by software. When one flow comes to the switch, a routing algorithm on the SDN control plane starts to plan a route, then a flow table is generated according to the route, and the flow table is issued to the switch by the SDN controller to complete data packet forwarding.

Currently, mainstream SDN controllers such as POX, FloodLight, and the like all provide modules for completing packet forwarding, and basically adopt Dijkstra (shortest path) algorithm. The Dijkstra algorithm searches for a shortest path from an originating node to a destination node for packet forwarding each time. However, if all the packets are forwarded by only relying on the shortest path algorithm, a serious problem will be caused, and data flows are easy to gather together by selecting the same forwarding path, which greatly reduces the link utilization rate and also easily causes network congestion. Some multi-path protocols exist that also do not take into account the quality of service (QoS) requirements of the different traffic flows, which is limiting from a path optimization point of view, since it does not take into account the traffic status of the whole network.

Disclosure of Invention

The invention provides an SDN route planning method based on reinforcement learning, aiming at overcoming the defects of the Dijkstra algorithm. Compared with the traditional Dijkstra algorithm, the method has the advantages that the link utilization rate is high and the network congestion can be effectively reduced by utilizing the characteristics of continuous interaction and strategy adjustment of reinforcement learning and environment.

The technical scheme adopted by the invention for solving the technical problem is as follows: an SDN route planning method based on reinforcement learning is disclosed, which comprises the following steps: in an SDN control plane, a model capable of generating a route is constructed by adopting Q learning in reinforcement learning, a reward function in a Q learning algorithm is designed, and different reward values are generated according to different QoS levels of flow; and inputting a current network topology matrix, flow characteristics and QoS (quality of service) grades of flows into a reinforcement learning model for training, thereby realizing SDN (software defined network) routing planning of flow differentiation and finding a shortest forwarding path which meets the QoS requirement of each flow.

Further, the method has the flow characteristics that: start, end and size of the flow.

Further, the reinforcement learning model is constructed by the following method:

setting the maximum step number of single training, adopting an action strategy P to select an action a, executing the action a, obtaining a next step state s' and a reward value r, updating Q (s, a) according to a quality updating function, and repeating the operation until the terminal is reached.

Further, the function required by the reinforcement learning model is constructed by the following method:

(1) selecting an action a according to the formula (1), wherein the action strategy adopts an epsilon-greedy strategy,

wherein pi (a | s) ═ P (A)_t＝a|S_tS) is expressed as the probability that the decision maker selects the action a in a certain state s, and epsilon is expressed as the probability that the decision maker takes a random strategy, i.e. selects possible actions with equal probability; adopting a greedy strategy for the probability of 1-epsilon, namely selecting the action with the maximum corresponding quality value; a(s) represents the set of actions a decision-maker may take in state s; q (s, a) represents a quality set obtained by selecting different actions a in a state s;

(2) the prize value is calculated according to equation (2),

wherein i, j represent nodes in the network, R_t(S_t,A_t|_i→j) Represents the reward value resulting from selecting action At (jumping from node i to node j) while in state St; wherein, B_totalRepresents the total bandwidth of the link from node i to node j, B represents the residual bandwidth of the link from node i to node j, B_minRepresents the minimum bandwidth required by the flow (namely the size of the flow), beta represents the QoS level of the flow, d represents a destination node, delta (j-d) represents that if the next hop of j is an end point d, the impulse function value is 1, T represents the condition that the nodes are connected, and T [ S ]_t][A_t]Not equal to-1 indicates that node i is connected to node j, T [ S ]_t][A_t]-1 indicates that node i is not connected to node j;

(3) updating the quality function using a Q learning algorithm according to equation (3),

wherein γ ∈ [0, 1]]Referred to as discount rate, indicating how important the future award is relative to the current award; alpha is belonged to 0, 1]Called the learning rate, determines the degree of coverage of the newly acquired information with respect to the old information; r_t+1Indicating the value of the prize, S, earned at time t_t+1Represents the state at time t +1, A_tRepresents the action at time t, S_tRepresenting the state at time t, Q (S)_t+1,A_t) Is shown in state S_t+1Take action A_tThe resulting mass, Q (S)_t,A_t) Is shown in state S_tTake action A_tThe resulting mass, Q (S)_t+1A) represents in state S_t+1The resulting quality set when taking different actions a.

Compared with the prior art, the invention has the following beneficial effects: the method finds the shortest forwarding path meeting the QoS requirement of each flow according to the QoS grades of different flows, has high link utilization rate and can effectively reduce network congestion.

Drawings

Figure 1 is an SDN routing planning architecture diagram;

figure 2 is a SDN network topology.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

Aiming at the fact that the current mainstream SDN network forwarding data packet basically adopts a Dijkstra algorithm, reinforcement learning is applied to route planning, the characteristics of SDN architecture centralized control, easy link information acquisition and programmable are utilized, the acquired current network topology matrix, flow characteristics and QoS (quality of service) grade are input into a reinforcement learning model, and the model can output the optimal forwarding path of flow from a starting point to an end point.

The SDN route planning method based on reinforcement learning provided by the invention utilizes the characteristics of continuous interaction and strategy adjustment of reinforcement learning and environment, has high link utilization rate compared with the traditional Dijkstra algorithm, and can effectively reduce network congestion. The method comprises the following steps: and designing a reward function according to the QoS grades of different flows, and finding the shortest forwarding path which meets the QoS requirement of each flow.

An SDN route planning architecture is shown in FIG. 1, a reinforcement learning model is constructed to generate a route, the model is deployed on an SDN control plane, a current network topology matrix, flow characteristics (including a starting point, an end point and a size of flow) and QoS (quality of service) grades are input, and after the model is trained for multiple times by using input, an optimal forwarding path from the starting point to the end point can be output.

2. Setting the maximum step number of single training during each training, adopting an action strategy P to select an action a, executing the action a, obtaining a next step state s' and an incentive value r, updating Q (s, a) according to a quality updating function, and repeating the operation until the end point is reached.

3. The function required by the reinforcement learning model is constructed by the following method:

(2) the prize value is calculated according to equation (2),

wherein γ ∈ [0, 1]]Referred to as discount rate, indicating how important the future award is relative to the current award; alpha is belonged to 0, 1]Called the learning rate, determines the degree of coverage of the newly acquired information with respect to the old information; r_t+1Indicating the value of the prize, S, earned at time t_t+1Is shown inState at time t +1, A_tRepresents the action at time t, S_tRepresenting the state at time t, Q (S)_t+1,A_t) Is shown in state S_t+1Take action A_tThe resulting mass, Q (S)_t,A_t) Is shown in state S_tTake action A_tThe resulting mass, Q (S)_t+1A) represents in state S_t+1The resulting quality set when taking different actions a.

Example (b):

the specific routing algorithm pseudo-code is described as follows:

the present invention will be further described with reference to the following examples.

The shortest path planning method involved in the present invention can be described as follows:

in an SDN network with 25 OpenFlow switches and 10 hosts, the SDN network topology is shown in fig. 2, and the topology relationship can be described by a matrix of 25 × 25. The topology matrix T is set to 0 if two switches are connected and not set to-1 as shown below. For example: t [0] [0] ═ 1 denotes that switch s1 is disconnected from s1, and T [0] [1] ═ 0 denotes that switch s1 is connected to s 2. Define the state set S ═ { S1, S2, S3, …, S24, S25}, the action set a for each state S ∈ S (S) { x | T [ S ] [ x ] ≠ 1}

One of the hosts hopes to send a message to another node, the sender is a starting point, the receiver is an end point, and the controller performs routing planning under the condition of obtaining the starting point, the end point and the network topology structure, so that the shortest path from the starting point to the end point meeting the QoS grade is realized.

Randomly selecting one node as a starting point, one node as an end point, setting the total training times to be 300, setting the maximum number of steps of single training to be 50, selecting one behavior a from all possible behaviors of the current state s before reaching the destination node, executing the behavior a to obtain the next state s', updating Q (s, a) according to the quality function updating formula, and repeating the steps until the current state s is the target state, wherein the behavior strategy is an epsilon-greedy strategy (epsilon-0.1), the learning rate alpha is 0.7, and the discount rate gamma is 0.8.

The final result of the Q learning algorithm is a Q matrix from which a shortest path can be selected that satisfies the QoS class from the starting point to the end point. When a service request arrives, the controller can easily find a shortest path satisfying the QoS grade from the trained Q matrix according to the start address information and the destination address information carried by the controller.

Claims

1. An SDN route planning method based on reinforcement learning is characterized in that the method comprises the following steps: in an SDN control plane, a reinforcement learning model capable of generating a route is constructed by adopting Q learning in reinforcement learning, a reward function in a Q learning algorithm is designed, and different reward values are generated according to different QoS levels of flow; inputting a current network topology matrix, flow characteristics and QoS (quality of service) grades of flows into a reinforcement learning model for training, thereby realizing SDN (software defined network) routing planning of flow differentiation and finding a shortest forwarding path which meets the QoS requirement of each flow;

the reinforcement learning model is constructed by the following method:

setting the maximum step number of single training, selecting an action a by adopting an action strategy P, executing the action a, obtaining a next step state s' and an incentive value r, updating Q (s, a) according to a quality updating function, and repeating the operation until the end point is reached;

the function required by the reinforcement learning model is constructed by the following method:

(2) the prize value is calculated according to equation (2),

wherein i, j represent nodes in the network, R_t(S_t,A_t|_i→j) Indicates being in state S_tHour selection action A_t(jump from node i to node j), the resulting reward value; wherein, B_totalRepresents the total bandwidth of the link from node i to node j, B represents the residual bandwidth of the link from node i to node j, B_minRepresents the minimum bandwidth required by the flow (namely the size of the flow), beta represents the QoS level of the flow, d represents a destination node, delta (j-d) represents that if the next hop of j is an end point d, the value of delta (j-d) is 1, T represents the condition that the nodes are connected, and T [ S ]_t][A_t]Not equal to-1 indicates that node i is connected to node j, T [ S ]_t][A_t]-1 indicates that node i is not connected to node j;

wherein γ ∈ [0, 1]]Called discount rate, indicating future awards relative to current awardsThe degree of importance; alpha is belonged to 0, 1]Called the learning rate, determines the degree of coverage of the newly acquired information with respect to the old information; r_t+1Indicating the value of the prize, S, earned at time t_t+1Represents the state at time t +1, A_tRepresents the action at time t, S_tRepresenting the state at time t, Q (S)_t+1,A_t) Is shown in state S_t+1Take action A_tThe resulting mass, Q (S)_t,A_t) Is shown in state S_tTake action A_tThe resulting mass, Q (S)_t+1A) represents in state S_t+1The resulting quality set when taking different actions a.

2. The reinforcement learning-based SDN route planning method of claim 1, wherein the traffic characteristics comprise a start point, an end point and a size of traffic.