CN111988225A

CN111988225A - Multi-path routing method based on reinforcement learning and transfer learning

Info

Publication number: CN111988225A
Application number: CN202010840208.XA
Authority: CN
Inventors: 魏雯婷; 张瑞卿; 伏丽莹; 顾华玺
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2020-11-24
Anticipated expiration: 2040-08-19
Also published as: CN111988225B

Abstract

The invention provides a multi-path routing method based on reinforcement learning and transfer learning, which is used for solving the technical problem of poor equivalent path load balance in a network environment with less flow data in the prior art and comprises the following implementation steps: constructing a real network Z and an experimental network G with a topology structure consistent with that of the Z; establishing a two-dimensional array H; constructing a multi-path routing model based on reinforcement learning; initializing a flow matrix DM and an equivalent path flow ratio matrix PM; performing iterative training on a multi-path routing model based on reinforcement learning in an experimental network G; migrating global neural network weight parameters in a routing decision model trained in an experimental network G to a real network Z based on a migration learning method; and (3) performing adaptive training on the global neural network initialized in the real network Z to obtain a multi-path routing result according with the real network environment characteristics. The method can be used in data center networks and other scenes.

Description

Multi-path routing method based on reinforcement learning and transfer learning

Technical Field

The invention belongs to the technical field of computer networks, and relates to a multipath routing method based on reinforcement learning and transfer learning, which can be used in the fields of data center networks and the like.

Background

For the network, the routing decision specifies how the data traffic reaches another node from a designated node in the network, and the routing decision can schedule the traffic, so that the load balance of different transmission paths in the network is determined, the numerical value for measuring the network load balance condition is the difference of the bandwidth utilization rate of all equivalent paths between two communication pairs in the network, and the smaller the difference is, the better the equivalent path load balance is. The routing decision can be divided into a traditional routing decision method and a routing decision method based on reinforcement learning, wherein the traditional routing decision algorithm designs a routing rule in advance by manpower, but lacks perception on a network state, so that loads of some equivalent paths are high easily, and traffic in a high-load path cannot be transferred to a low-load path, so that load imbalance is caused. The reinforcement learning is one branch of machine learning, a routing decision algorithm based on the reinforcement learning has higher perception capability on the network flow state, and the data sending quantity of different transmission paths can be dynamically adjusted according to the change of the network flow. When the load in the equivalent path changes, the algorithm can quickly sense and make effective strategy adjustment, and data in the high-load path is adjusted to the low-load path for transmission. The reinforcement learning routing method based on the Q-learning algorithm cannot be applied to a complex network environment, and the reinforcement learning routing method based on the DDPG algorithm has a low convergence speed and cannot effectively react to different network scenes under the condition of less network traffic data, so that the network load is unbalanced. Although the routing method based on reinforcement learning effectively solves the problem of difficult routing decision algorithm design in a large-scale complex network, the dependence of the algorithm training process on data is high, the generalization capability of the trained routing decision model is poor, and when the system slightly changes, the model needs to be recalculated. The transfer learning is also an important branch of machine learning, and is different from a traditional machine learning algorithm, a machine can automatically extract information from a large amount of data, the core idea of the transfer learning is to find the similarity between problems, the application of past experience and knowledge in similar problems is realized, and the training purpose is achieved by utilizing data in other data sets.

For example, in a patent application with publication number CN109361601A entitled "a reinforcement learning-based SDN route planning method", a route decision model based on reinforcement learning is constructed by using a Q-learning algorithm, and network topology information, a traffic matrix and QoS levels thereof are used as inputs of the reinforcement learning route decision model to output a shortest path meeting requirements. The designed reward function contains the QoS level of the traffic in the network, link bandwidth utilization information, and the like. Reinforcement learning continuously interacts with the network model to try and adjust routing decisions. The method finds the shortest forwarding path for each flow, improves the bandwidth utilization rate of links in the network, and reduces network congestion. However, the existing defects are that each flow is only forwarded along the selected fixed shortest path, which easily causes the conditions that the load of part of paths is higher, the difference of the bandwidth utilization rate of the equivalent path is larger, and the load of the equivalent path is unbalanced.

As another patent application with application publication number CN110611619A entitled "intelligent routing decision method based on DDPG reinforcement learning", a DDPG algorithm is used to construct a routing decision model based on reinforcement learning, network traffic matrix information is used as input of the reinforcement learning algorithm, the reinforcement learning algorithm is used to minimize the absolute value of the difference between the maximum bandwidth utilization rate and the minimum bandwidth utilization rate in the equivalent path of the network as an optimization target, and the purpose of load balancing is achieved by dynamically adjusting the data transmission amount in different transmission paths in the network. The method fully utilizes bandwidth resources of different transmission paths in the network and balances network load, but the algorithm only utilizes the numerical values of the maximum bandwidth utilization rate and the minimum bandwidth utilization rate in a group of equivalent paths to judge the quality of the network load balancing state, and the numerical values of other transmission paths in the group are not used, so that the bandwidth of other paths cannot be effectively adjusted, and the load of other paths is unbalanced. In addition, the flow data in the network is difficult to collect, the data set is small, and the training requirement of the reinforcement learning algorithm cannot be met, so that the model only performs well in a small number of network scenes, and in many cases, the network still has unbalanced load distribution.

Disclosure of Invention

The present invention aims to provide a multipath routing method based on reinforcement learning and transfer learning, aiming at the defects of the prior art, and the method is used for solving the technical problem of poor equivalent path load balance in a network environment with less traffic data in the prior art.

The technical idea of the invention is as follows: firstly, an experimental network consistent with a real network topological structure is constructed, flow information in the experimental network is collected, the information is input into a reinforcement learning algorithm to calculate a routing strategy, and an algorithm model meeting requirements is obtained through multiple times of training. Then, transferring the algorithm model obtained by training in the experimental network to a real network, and training out a network model conforming to the real network environment characteristics, which specifically comprises the following steps:

(1) constructing a real network Z and an experimental network G with a topology structure consistent with that of the Z:

constructing a real network Z comprising a server nodes and m switch nodes and an experimental network G with a topological structure consistent with that of Z, wherein each server node is a source node and a destination node of other server nodes, and numbering n equivalent paths formed by connecting each source node with other destination nodes through one or more switch nodes from 1 to n, wherein a is more than or equal to 16, and m is more than or equal to 16;

(2) establishing a two-dimensional array H:

establishing a two-dimensional array H with a source nodes as abscissa and a destination nodes as ordinate, and storing the equivalent path number between each source-destination node pair to the position corresponding to the two-dimensional array H;

(3) constructing a multi-path routing model based on reinforcement learning:

construction of reinforcement learning A3C-based network including global neural network and num_aThe global neural network and the local agents both adopt an Actor-Critic neural network structure comprising L full connection layers, and weight parameter sets of the Actor neural network and the Critic neural network in the global neural network are respectively theta_gAnd ω_gThe weight parameter sets of the Actor neural network and the Critic neural network in the local agent are respectively theta and omega, wherein num_a≥10，L≥15；

(4) Initializing traffic matrix DM and equivalent path traffic proportion matrix PM:

initializing a flow matrix DM and an equivalent path flow ratio matrix PM with the sizes of a multiplied by a, and randomly assigning DM to each element in the DM_ijMore than or equal to 0, and simultaneously assigning each element in the PM to be equal value between (0, 1);

(5) performing iterative training on a multi-path routing model based on reinforcement learning in an experimental network G:

(5a) initializing the set of weight parameters for the local agent to be θ and ω, and the set of weight parameters for the global neural network to be θ_gAnd ω_gThe method comprises the following steps of (1) initializing an empirical playback set D with the length of N, wherein N is larger than 0, according to standard normal distribution; the number of initialization iterations is K, the maximum number of iterations is K, and K is more than or equal to 10⁶The initial sampling state of the network environment is S₀Let k equal to 0, S₀＝0；

(5b) Synchronizing global neural network weight parameter sets to num_aIn a local agent, i.e. theta ═ theta_g，ω＝ω_g；

(5c) Will be of size DM_ij×PM_ijThe flow of the bottleneck link is correspondingly sent to the equivalent path of the G according to the number in the H, the bandwidth utilization rate of the equivalent path corresponding to each number in the H is measured through the SDN controller, the bandwidth utilization rate of the bottleneck link is used as the bandwidth utilization rate of the equivalent path, and the bandwidth utilization rates S of the n equivalent paths are used_tCurrent sampling state as G;

(5d) using a state gain algorithm and passing through S_tAnd S_t-1Calculates a state gain vector phi (deltas) while simultaneously dividing S_tConversion into a feature vector phi (S)_t) Then phi (S)_t) And Φ (. DELTA.S) as num_aThe input of the Actor neural network of the individual local agent is calculated to obtain a route decision behavior vector A_t；

(5e) Obtaining G according to the method of step (5c) and executing A_tBandwidth utilization S of the last n equivalent paths_t+1And taking the sampling state as the sampling state of G after state transition according to S_t+1Calculating a reward value R for a network environment_t；

(5f) Will S_t、A_t、R_tAnd S_t+1Combined empirical information S_t,A_t,R_t,S_tStoring the state of the G into an experience playback set D to realize the transfer of the state of the G;

(5g) randomly sampling M samples from D, calculating parameter update gradient values D omega of omega and parameter update gradient values D theta of theta, wherein { S }_k,A_k,R_k,S_k+1Represents the kth sample, and updates ω with d ω and θ with d θ;

(5h) updating the global neural network weight parameters by using the updated omega and theta;

(5i) by routing decision vector A_tUpdating the equivalent path traffic ratio matrix PM according to the behavior value corresponding to each path, judging whether K is equal to K, if so, obtaining a routing decision model trained in the experimental network G, otherwise, making K equal to K +1, and executing the step (5 b);

(6) migrating the global neural network weight parameters in the routing decision model obtained by training in the experimental network G to a real network Z based on a migration learning method:

the first L layer weight parameters of the global neural network in the route decision model are made to be unchanged, and the rest L-L layer weight parameters are initialized randomly to be used as the global neural network initialized in the real network Z, so that the migration process is completed;

(7) carrying out adaptive training on the global neural network initialized in the real network Z:

initializing a maximum number of iterations to K_T，K_T≥10⁴And performing adaptive training on the global neural network initialized in the Z in the real network Z according to the method in the steps (5b) - (5h) to obtain a route decision model according with the Z characteristics.

Compared with the prior art, the invention has the following advantages:

1. because the method calculates the reinforcement learning reward function by using the variance of the bandwidth utilization rate of the equivalent transmission paths in each group, the algorithm finally aims at minimizing the bandwidth utilization rate difference of all equivalent transmission paths in each group, and continuously adjusts the parameters of the algorithm, thereby obtaining a final routing decision model, being capable of accurately routing data on paths with higher load to paths with lower load.

2. Because the invention adopts A3C reinforcement learning algorithm, a plurality of local agents train at the same time, break the relativity of the flow data, improve the convergence of the route decision model, and make the obtained route decision model more accurately adjust the flow proportion of each equivalent path.

3. The invention utilizes the transfer learning method to ensure that the multi-path routing algorithm based on the reinforcement learning can be better applied to the network environment with less flow data, increases the practical value of the reinforcement learning routing algorithm, avoids the condition that a routing decision model is excessively fitted to a small number of network states and is not friendly to other network states, compared with the prior art, the invention utilizes the transfer learning to ensure that the generalization capability of the model is stronger, can improve the performance of the routing decision model in different network scenes and ensure the load balance of the network, and compared with the routing algorithm only using the reinforcement learning, the invention improves the adaptability of the routing decision model to the network environment and ensures the load balance of different network scenes.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

FIG. 2 is a flow chart of an implementation of the present invention for iterative training of a single agent in a routing method based on reinforcement learning and transfer learning.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments.

Referring to fig. 1, the implementation steps of the invention are as follows:

step 1), constructing a real network Z and an experimental network G which is consistent with the Z topological structure:

constructing a real network Z comprising a server nodes and m switch nodes and an experimental network G with a topology structure consistent with that of Z, wherein each server node is a source node and a destination node of other server nodes, and numbering n equivalent paths formed by connecting each source node with other destination nodes through one or more switch nodes from 1 to n, wherein a is more than or equal to 16, m is more than or equal to 16, in the example, a fat-tree topology comprising 16 server nodes is selected, and in the topology, a is 16, and m is 20;

step 2), establishing a two-dimensional array H:

step 3), constructing a multi-path routing model based on reinforcement learning:

construction of reinforcement learning A3C-based network including global neural network and num_aMultiple paths for independent local agentsThe routing model, the global neural network and the local agent all adopt an Actor-Critic neural network structure comprising L full connection layers, and the weight parameter sets of the Actor-Critic neural network and the Critic neural network in the global neural network are respectively theta_gAnd ω_gThe weight parameter sets of the Actor neural network and the Critic neural network in the local agent are respectively theta and omega, wherein num_aMore than or equal to 10, and L more than or equal to 15; the A3C algorithm utilizes a distributed scheme to improve the convergence of the Actor-critical neural network in the reinforcement learning algorithm, and the complexity of the algorithm is low, in this example, num is included_a10 agents carry out algorithm training, L15, the Actor neural network is used for calculating routing behaviors, the criticic neural network evaluates a routing result calculated by the Actor neural network according to the difference of two adjacent network state evaluation values, and the evaluation result can increase the probability that a good decision of the Actor neural network is selected;

step 4), initializing a flow matrix DM and an equivalent path flow ratio matrix PM:

step 5), performing iterative training on the multi-path routing model based on reinforcement learning in the experimental network G:

(5a) initializing the set of weight parameters for the local agent to be θ and ω, and the set of weight parameters for the global neural network to be θ_gAnd ω_gAll obey standard normal distribution, complete random initialization of the neural network, and initialize an experience playback set D with the length of N, wherein N is more than or equal to 10⁴(ii) a The number of initialization iterations is K, the maximum number of iterations is K, and K is more than or equal to 10⁶The initial sampling state of the network environment is S₀Let k equal to 0, S₀The experience playback set in this example stores the network state and change information available over a period of time, N10⁴；

Referring to FIG. 2, the specific steps for training a single agent routing decision model are described in further detail:

(5c) Will be of size DM_ij×PM_ijThe flow of the bottleneck link is correspondingly sent to the equivalent path of the G according to the number in the H, the bandwidth utilization rate of the equivalent path corresponding to each number in the H is measured through the SDN controller, the bandwidth utilization rate of the bottleneck link is used as the bandwidth utilization rate of the equivalent path, and the bandwidth utilization rates S of the n equivalent paths are used_tAs the current sampling state of G, where the bottleneck link refers to a link with the maximum bandwidth utilization rate in an equivalent path where the source node and the destination node are the same, each element in the traffic demand matrix DM represents the size of traffic that needs to be sent from one source node to another destination node, and each entry in the equivalent path traffic proportion matrix PM represents the proportion of data volume that needs to be carried by each path of each group of equivalent paths;

(5d) using a state gain algorithm and passing through S_tAnd S_t-1Calculates a state gain vector phi (deltas) while simultaneously dividing S_tConversion into a feature vector phi (S)_t) Then phi (S)_t) And Φ (. DELTA.S) as num_aThe input of the Actor neural network of the individual local agent is calculated to obtain a route decision behavior vector A_tThe calculation formula is as follows:

A_t＝π((Φ(S_t)+Φ(ΔS)),θ)

where pi represents a behavior decision algorithm whose inputs are two parameters Φ (S)_t) The result of addition with Φ (Δ S);

(5e) obtaining G according to the method of step (5c) and executing A_tBandwidth utilization S of the last n equivalent paths_t+1And taking the sampling state as the sampling state of G after state transition according to S_t+1Calculating a reward value R for a network environment_tThe calculation method comprises dividing equivalent paths with the same source node and destination node into a group, calculating the variance of the bandwidth utilization rate of each group of equivalent paths, adding and summing the variances of all groups to obtain the reward value R of the reinforcement learning algorithm_tBecause the load balancing degree and the network throughput are completely unified, in a multi-path network structure, the equivalent path load balancing can improve the overall throughput performance of the network, and therefore the difference of the bandwidth utilization rate of the minimized equivalent path is taken as an optimization target;

(5g) randomly sampling M samples from D, wherein M is more than or equal to 128, calculating a parameter update gradient value D omega of omega, updating a local Critic neural network weight parameter omega, calculating a parameter update gradient value D theta of theta, and updating a local Actor neural network weight parameter theta, wherein { S ≧ S { (S) } S { (S) }_k,A_k,R_k,S_k+1Denotes the kth sample, where M is 256 in this example, and the calculation formula is:

θ＝θ-αdθ

ω＝ω-βdω

v represents a behavior value algorithm, pi represents a behavior decision algorithm, noise is added to neural network parameters instead of noise added to an action space for random noise values added to theta and omega, the method is more reasonable, the value of the random number is 00.3, the amplitude is reduced along with the increase of a k value of iteration times, so that the influence of the random number on the decision of the intelligent agent is gradually reduced, the probability of outputting an optimal routing result by the Actor neural network of the intelligent agent is improved, alpha and beta are the learning rates of the Actor neural network and the criticic neural network in the local intelligent agent respectively, and the values are all constant 0.01;

(5h) updating the global neural network weight parameters by using the updated omega and theta, wherein the updating method comprises the following steps:

θ_g←τθ_g+(1-τ)θ

ω_g←τω_g+(1-τ)ω

wherein τ is learning efficiency, and τ is 0.8;

(5i) by routing decision vector A_tUpdating the equivalent path traffic ratio matrix PM according to the behavior value corresponding to each path in the routing table, and determining whether K is equal to K, if so, obtaining a trained routing decision model, otherwise, making K equal to K +1, and executing step (5b), where K is equal to 10 in this example⁶；

Step 6), migrating the global neural network weight parameters in the routing decision model trained in the experimental network G to a real network Z based on a migration learning method:

the first L layer weight parameters of the global neural network in the routing decision model trained in the experimental network G are made to be unchanged, the remaining L-L layer weight parameters are initialized randomly to serve as the global neural network initialized in the real network Z, and the migration process is completed, because the local change of the network environment has little influence on the state distribution of network traffic data, the experience and knowledge of the routing decision model in the original network can be kept, the convergence of the model in the real network environment is improved, the training speed of the model is accelerated, and the low layer neural network parameters are mainly used for extracting and sensing the general characteristics in the communication network, so the first L layer weight parameters are made to be unchanged, in the example, L is 15, and L is 10;

step 7), performing adaptive training on the global neural network initialized in the real network Z:

initializing a maximum number of iterations to K_T，K_T≥10⁴And performing adaptive training on the global neural network initialized in the Z in the real network Z according to the method of the steps (5b) - (5h) to obtain a route decision model according with the characteristics of the Z, wherein K is used in the example_T＝10⁴And the number of iterations in the experimental network is far less, so that the training speed of the model is proved to be accelerated.

Claims

1. A multipath routing method based on reinforcement learning and transfer learning is characterized by comprising the following steps:

(2) establishing a two-dimensional array H:

(3) constructing a multi-path routing model based on reinforcement learning:

(5a) initializing a set of weight parameters for a local agentThe total theta and omega, and the weight parameter set of the global neural network is total theta_gAnd ω_gAll obey the standard normal distribution; initializing an empirical playback set D of length N, N ≧ 10⁴(ii) a The number of initialization iterations is K, the maximum number of iterations is K, and K is more than or equal to 10⁶The initial sampling state of the network environment is S₀And let k equal to 0, S₀＝0；

(5b) Setting a global neural network weight parameter theta_gAnd ω_gSynchronise to num_aIn a local agent, i.e. theta ═ theta_g，ω＝ω_g；

(5g) randomly sampling M samples from D, wherein M is more than or equal to 128, calculating the parameter update gradient value D omega of omega and the parameter update gradient value D theta of theta, wherein { S ≧ S_k,A_k,R_k,S_k+1Represents the kth sample, and updates ω with d ω and θ with d θ;

(6) migrating global neural network weight parameters in a routing decision model trained in an experimental network G to a real network Z based on a migration learning method:

the first L layer weight parameters of the global neural network in the routing decision model obtained by training in the experimental network G are unchanged, and the rest L-L layer weight parameters are initialized randomly to be used as the global neural network initialized in the real network Z, so that the migration process is completed;

2. The multi-path routing method based on reinforcement learning and migration learning of claim 1, wherein the bottleneck link in step (5c) is a link with maximum bandwidth utilization in an equivalent path where the source node and the destination node are the same.

3. The reinforcement learning and migration learning-based multipath routing method of claim 1, wherein the routing decision behavior vector A in step (5d)_tThe calculation formula is as follows:

A_t＝π((Φ(S_t)+Φ(ΔS)),θ)

where pi represents a behavioral decision algorithm.

4. The reinforcement learning and migration learning-based multi-path routing method as claimed in claim 1, wherein the reward value R in step (5e)_tThe calculation method comprises dividing equivalent paths with the same source node and destination node into a group, calculating the variance of the bandwidth utilization rate of each group of equivalent paths, adding and summing the variances of all groups to obtain the reward value R of the reinforcement learning algorithm_t。

5. The multi-path routing method based on reinforcement learning and migration learning of claim 1, wherein the parameter update gradient values d ω and θ of ω in step (5g) update the gradient values d θ, ω is updated by d ω, θ is updated by d θ, and the calculation formulas are:

dθ←dθ+▽_θlogπ(A_k|S_k；θ；)(R-V(S_k；ω；))

dω←dω+▽_ω(R-V(S_k；ω；))²

θ＝θ-αdθ

ω＝ω-βdω

wherein V represents a behavior value algorithm, pi represents a behavior decision algorithm, random numbers in the range of 00.3 are taken for random noise values added on theta and omega, alpha and beta are respectively the learning rates of an Actor neural network and a Critic neural network in a local agent, and the values are all constant 0.01.

6. The multi-path routing method based on reinforcement learning and migration learning of claim 1, wherein the global neural network weight parameters are updated by using the updated ω and θ in step (5h), and the updating method is as follows:

θ_g←τθ_g+(1-τ)θ

ω_g←τω_g+(1-τ)ω

where τ is learning efficiency, and τ is 0.8.