CN111917642A

CN111917642A - SDN intelligent routing data transmission method for distributed deep reinforcement learning

Info

Publication number: CN111917642A
Application number: CN202010673851.8A
Authority: CN
Inventors: 刘宇涛; 崔金鹏; 章小宁; 贺元林
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2020-11-10
Anticipated expiration: 2040-07-14
Also published as: CN111917642B

Abstract

The invention discloses a distributed deep reinforcement learning SDN network intelligent routing data transmission method, which realizes the calculation of a fast routing path, maximizes the throughput under the condition of ensuring delay and solves the problems of low speed and low throughput of the traditional algorithm. The invention uses the reinforcement learning algorithm, the algorithm simplifies the route calculation process into simple input and output, avoids multiple iterations during calculation so as to realize the rapid calculation of the route path, the speed of the route algorithm is accelerated, the forwarding delay is reduced, the data packet which is discarded due to the expiry of ttl originally has a more probable survival rate and is successfully forwarded, and the network throughput is increased. The invention is provided with two stages of off-line training and on-line training, and the parameters are updated in a dynamic environment to select the optimal path, so that the invention has topology self-adaptability.

Description

SDN intelligent routing data transmission method for distributed deep reinforcement learning

Technical Field

The invention belongs to the field of data transmission, and particularly relates to an SDN intelligent routing data transmission method for distributed deep reinforcement learning.

Background

The current information technology is in a mature stage, data flow in an SDN (Software Defined Network) architecture is flexible and controllable, a controller has a full Network view and can sense Network state change (such as flow distribution, congestion condition, link utilization condition and the like) in real time, in reality, the routing problem is often solved through a shortest path algorithm, and some simple Network parameters (such as path hop count, time delay and the like) are used as optimization indexes of the algorithm to find a path with the least hop count or a path with the least time delay as a final target of the algorithm. The single measurement standard and the optimization target easily cause the congestion of part of the key links, thereby causing the problem of unbalanced network load. Although the optimal path with multiple compound constraint conditions can be found by the shortest routing algorithm based on Lagrange relaxation during multi-service path distribution, the heuristic routing algorithm can calculate the optimal path only through multiple iterations, and has the advantages of low convergence speed, poor timeliness and low throughput.

Disclosure of Invention

Aiming at the defects in the prior art, the SDN intelligent routing data transmission method for the distributed deep reinforcement learning solves the problems in the prior art.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a SDN intelligent routing data transmission method for distributed deep reinforcement learning comprises the following steps:

s1, constructing a reward function and a deep reinforcement learning model comprising an actor network and an evaluator network, and arranging the deep reinforcement learning model in an application layer of the SDN network;

s2, randomly initializing actor network parameter theta of deep reinforcement learning model_aAnd evaluator network parameter θ_c；

S3, randomly initializing ith local GPU in control layer of SDN network_iLocal actor parameter θ 'of Upper actor network'_aAnd local evaluator parameter θ 'of evaluator network'_c；

S4, according to the reward function and the actor network parameter theta_aEvaluator network parameter θ_cLocal actor parameter θ'_aAnd local evaluator parameter θ'_cUsing A3C algorithm to the ith local GPU_iThe above deep reinforcement learning model is used for off-line training and updating the actor network parameter theta_aAnd evaluator network parameter θ_c；

S5, updating the actor network parameter theta_aAnd updated evaluator network parameter θ_cActing on the whole SDN network, and transmitting data by using the SDN network after updating parameters;

s6, regularly detecting whether the topological structure of the SDN network changes, if so, entering the step S7, otherwise, repeating the step S6;

s7, carrying out on-line training on the deep reinforcement learning model, and using the self-adaptive operation algorithm to carry out the on-line training on the actor network parameter theta_aAnd evaluator network parameter θ_cUpdating and updating the actor network parameter theta_aAnd evaluator network parameter θ_cActing on the whole SDN network, and transmitting data by using the SDN network after updating parameters;

where i ═ 1, 2., L represents the total number of local GPUs.

Further, the actor network in the step S1 is a fully connected neural network, and the evaluator network in the step S1 is a combined network of the fully connected neural network and the CNN convolutional neural network; the input of the actor network and the evaluator network comprise network states of the SDN network, the network states comprise current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network further comprises network characteristics of the SDN network processed by the CNN convolutional neural network; the CNN convolutional neural network comprises an input layer, a convolutional layer, a pooling layer, a full-connection layer and an output layer which are sequentially connected.

Further, the incentive function in step S1 is:

wherein the content of the first and second substances,

is shown in state s_nIn case of (a), the nth routing node in the SDN network makes an action a to the mth routing node_nThe reward value obtained later; g denotes an action penalty, a₁Represents a first weight, a₂Representing a second weight, c (n) representing a remaining capacity of an nth routing node, c (m) representing a remaining capacity of an mth routing node, c (l) representing a remaining capacity of an ith link in the SDN network, d (n) representing a degree of difference in traffic load between the nth routing node and its neighboring nodes, and d (m) representing a degree of difference in traffic load between the mth routing node and its neighboring nodes; the state s_nThe method comprises the following steps: the node where the data packet is located is the nth routing node, the final destination node of the data packet, the forwarding bandwidth requirement of the datagram and the delay requirement of the data packet; the action a_nIs shown in state s_nAll forwarding operations that may be taken in case of (1).

Further, the step S4 includes the following sub-steps:

s41, setting the first counter T to 0, the second counter T to 0, and the maximum iteration number T_maxAnd routing hop count limit t_max；

S42, let d theta_a0 and d θ_cSynchronizing the local parameter with the global parameter to obtain a local actor parameter theta_a' the value of is synchronized to the actor network parameter θ_aIs a local evaluator parameter theta_c' value synchronization as evaluator network parameter θ_cA value of (d);

s43, let the first intermediate count value t_startT, by local GPU_iReading the state s at the current moment_t；

S44, obtaining strategy pi (a) through actor network_t|s_t；θ′_a) And according to the strategy pi (a)_t|s_t；θ′_a) Performing action a_tWherein, pi (a)_t|s_t；θ′_a) Is shown in state s_tAnd local GPU_iGo local actor parameter θ'_aThe action to be performed in the case of (a) is_t；

S45, acquiring and executing action a_tThe latter prize value r_tAnd new state s_t+1And the count value of the first counter t is increased by one;

s46, judging new state S_tIf the condition defined by the final state is met, if yes, setting the updated reward value R to be 0, and entering the step S48, otherwise, entering the step S47;

s47, judging t-t_startWhether greater than the routing hop limit t_maxIf yes, the updated reward value R is set as V(s)_t，θ′_c) And proceeds to step S48, otherwise returns to step S44, where V (S)_t，θ′_c) Representing evaluator network local evaluator parameter θ'_cTime pair arrival state s_tThe routing policy evaluation value of (1);

s48, setting t-1 as third counter z and updating R as gradient_updata＝r_z+ γ R, initializing gradient Δ θ of actor network parameters_aAnd gradient of evaluator network parameters delta theta_cIs 0;

s49, updating the reward value R according to the gradient_updataLocal actor parameter θ'_aAnd local evaluator parameter θ'_cObtaining a local actor parameter gradient Δ θ_aAnd local actor parameter gradient Δ θ_cThe update values of (a) are:

wherein, Delta theta_{a_updata}Representing a gradient Δ θ_aThe updated value of (a) is set,

representing a local actor parameter θ'_aDerivative of (d), log pi (a)_z|s_z；θ′_a) Is represented by the parameter theta'_aAnd state s_zIn case of performing action a_zLogarithm of the probability of this strategy, r_zIndicating the execution of action a_zγ represents the discount rate of the prize, V(s)_z；θ′_c) Representing evaluator network local evaluator parameter θ'_cTime pair arrival state s_zIs a routing policy evaluation value, Δ θ_{c_updata}Representing a gradient Δ θ_cThe updated value of (a) is set,

represents a pair (R)_updata-V(s_z；θ′_c))²Get theta'_cPartial derivatives of (d);

s410, let Delta theta_a＝Δθ_{a_updata}、Δθ_c＝Δθ_{c_updata}And R ═ R_updataAnd determining whether the third counter z is equal to the first intermediate count value t_startIf yes, go to step S411, otherwise, decrease the count value of the third counter z by one, update the gradient with the reward value R_updataIs updated to r_z+ γ R, and return to step S49;

s411, judging whether the second counter T is larger than or equal to the maximum iteration number T_maxIf so, the local actor parameter gradient Δ θ is used_aAnd local actor parameter gradient Δ θ_cUpdating the actor network parameters θ separately_aAnd evaluator network parameter θ_cAnd ending the updating process, otherwise, incrementing the count value of the second counter T by one, and returning to step S42.

Further, a local actor parameter gradient Δ θ is used in the step S411_aAnd local actor parameter gradient Δ θ_cUpdating the actor network parameters θ separately_aAnd evaluator network parameter θ_cThe formula of (1) is:

θ_{a_updata}＝θ_a+βΔθ_a

θ_{c_updata}＝θ_c+βΔθ_c

wherein, theta_{a_updata}Representing updated actor network parameters theta_a，θ_{c_updata}Representing updated evaluator network parameter θ_cAnd beta represents the local GPU_iWeights in an SDN network.

Further, the step S7 includes the following sub-steps:

s71, setting a fourth counter j to be 1, and collecting a routing request task f;

s72, distributing the routing request task f to an idle GPU in the SDN network, wherein the idle GPU is the GPU_idle；

S73, setting d theta_a0 and d θ_c0, and will GPU_idleLocal actor parameter of [ theta ]'_aSynchronizing to an actor network parameter θ_aParameter value, local evaluator parameter θ'_cSynchronizing to evaluator network parameter θ_cA parameter value;

s74, let the second intermediate count value j_startJ and reads the initial state s at the current time_j；

S75, obtaining the state S through the actor network_jAnd local actor parameter θ'_aIn case of performing action a_jStrategy of (a)_j|s_j；θ′_a) And implements the strategy pi (a)_j|s_j；θ′_a)；

S76, acquiring and executing action a_jThe latter prize value r_jAnd new state s_j+1Incrementing the count value of the fourth counter j by one, and performing action a_jAdding an action set A;

s77, judging new state S_jWhether the condition defined by the final state of the routing request task f is achieved, if yes, the step S78 is entered, otherwise, the step S75 is returned;

s78, obtaining the route path p according to the action set A, judging whether the route request task f is matched with the route path p, if yes, making the update reward value R equal to 0, and proceeding to the step S79, otherwise, making the update reward value R equal to V (S)_j,θ′_c) And proceeds to step S79;

s79, setting the fifth counter k to j-1 and updating the reward value R in gradient_updata＝r_k+ γ R, initializing gradient Δ θ of actor network parameters_aAnd gradient of evaluator network parameters delta theta_cIs 0;

s710, updating the reward value R according to the gradient_updataLocal actor parameter θ'_aAnd local evaluator parameter θ'_cObtaining a local actor parameter gradient Δ θ_aAnd local actor parameter gradient Δ θ_cThe update values of (a) are:

representing a local actor parameter θ'_aDerivative of (d), log pi (a)_k|s_k；θ′_a) Is represented by the parameter theta'_aAnd state s_zIn case of performing action a_kLogarithm of the probability of this strategy, r_kIndicating the execution of action a_kγ represents the discount rate of the prize, V(s)_k；θ′_c) Representing evaluator network local evaluator parameter θ'_cTime pair arrival state s_kIs a routing policy evaluation value, Δ θ_{c_updata}Representing a gradient Δ θ_cThe updated value of (a) is set,

represents a pair (R)_updata-V(s_k；θ′_c))²Get theta'_cPartial derivatives of (d);

s711, let Δ θ_a＝Δθ_{a_updata}、Δθ_c＝Δθ_{c_updata}And R ═ R_updataAnd determines whether the fifth counter k is equal to the second intermediate count value j_startIf yes, go to step S712, otherwise, decrease the count value of the fifth counter k by one, update the gradient with the reward value R_updataIs updated to r_k+ γ R, and return to step S710;

s712, passing the local actor parameter gradient delta theta_aAnd local actor parameter gradient Δ θ_cUpdating the actor network parameters θ separately_aAnd evaluator network parameter θ_cAnd the actor network parameter theta is calculated_aAnd evaluator network parameter θ_cAnd acting on the whole SDN network, and transmitting data by using the SDN network after updating the parameters.

The invention has the beneficial effects that:

(1) the invention realizes the calculation of the fast routing path, maximizes the throughput under the condition of ensuring the delay and solves the problems of low speed and low throughput of the traditional algorithm.

(2) The invention uses the reinforcement learning algorithm, the algorithm simplifies the route calculation process into simple input and output, avoids multiple iterations during calculation so as to realize the rapid calculation of the route path, the speed of the route algorithm is accelerated, the forwarding delay is reduced, the data packet which is discarded due to the expiry of ttl originally has a more probable survival rate and is successfully forwarded, and the network throughput is increased.

(3) The invention is provided with two stages of off-line training and on-line training, and the parameters are updated in a dynamic environment to select the optimal path, so that the invention has topology self-adaptability.

(4) The invention sets the reward function, so that the node or link load, the routing requirement and the network topology information better constrain the training process of reinforcement learning, and the trained deep reinforcement learning model can more accurately execute the routing task.

Drawings

Fig. 1 is a flowchart of a distributed deep reinforcement learning SDN network intelligent routing data transmission method according to the present invention;

FIG. 2 is a schematic diagram of a CNN convolutional neural network according to the present invention;

FIG. 3 is a diagram of a deep reinforcement learning model according to the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, a method for transmitting smart routing data in a distributed deep reinforcement learning SDN network includes the following steps:

S5, updating the actor network parameter theta_aAnd updated evaluator network parameter θ_cActing on SDN network global, usage updatesThe SDN network after the parameters transmits data;

where i ═ 1, 2., L represents the total number of local GPUs.

The actor network in the step S1 is a fully-connected neural network, and the evaluator network in the step S1 is a combined network of the fully-connected neural network and the CNN convolutional neural network; the input of the actor network and the evaluator network comprise network states of the SDN network, the network states comprise current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network further comprises network characteristics of the SDN network processed by the CNN convolutional neural network.

As shown in fig. 2, the CNN convolutional neural network includes an input layer, a convolutional layer, a pooling layer, a fully-connected layer, and an output layer, which are connected in sequence.

In step S1, the excitation function is:

wherein the content of the first and second substances,

is shown in state s_nIn case of (a), the nth routing node in the SDN network makes an action a to the mth routing node_nThe reward value obtained later; g denotes an action penalty, a₁Represents a first weight, a₂Represents the second weight, c (n) represents the remaining capacity of the nth routing node, c (m) represents the remaining capacity of the mth routing nodeVolume, c (l) represents the remaining capacity of the l link in the SDN network, d (n) represents the degree of difference of the traffic load of the n routing node and its neighboring nodes, d (m) represents the degree of difference of the traffic load of the m routing node and its neighboring nodes; the state s_nThe method comprises the following steps: the node where the data packet is located is the nth routing node, the final destination node of the data packet, the forwarding bandwidth requirement of the datagram and the delay requirement of the data packet; the action a_nIs shown in state s_nAll forwarding operations that may be taken in case of (1).

The step S4 includes the following sub-steps:

S42, let d theta_a0 and d θ_cSynchronizing the local parameter with the global parameter to obtain a local actor parameter theta'_aThe value of (A) is synchronized to the actor network parameter θ_aValue of (2), local evaluator parameter θ'_cValue synchronization of (2) as evaluator network parameter θ_cA value of (d);

representing a local actor parameter θ'_aDerivative of (d), log pi (a)_z|s_z；θ′_a) Is represented by the parameter theta'_aAnd state s_zIn case of performing action a_zLogarithm of the probability of this strategy, r_zIndicating the execution of action a_zγ represents the discount rate of the prize, V(s)_z；θ′_c) Representing evaluator network local evaluator parameter θ'_cTime pair arrival state s_zIs a routing policy evaluation value, Δ θ_{c_updata}To representGradient Delta theta_cThe updated value of (a) is set,

The local actor parameter gradient Δ θ is used in the step S411_aAnd local actor parameter gradient Δ θ_cUpdating the actor network parameters θ separately_aAnd evaluator network parameter θ_cThe formula of (1) is:

θ_{a_updata}＝θ_a+βΔθ_a

θ_{c_updata}＝θ_c+βΔθ_c

The step S7 includes the following sub-steps:

s72, distributing the routing request task f to the SDN networkA GPU in idle state, and the idle GPU is the GPU_idle；

s710, updating the reward value R according to the gradient_updataLocal actor parameter θ_a' and local evaluator parameter θ_c', obtaining local actor parameter gradient Delta theta_aAnd local actor parameter gradient Δ θ_cThe update values of (a) are:

s712, passing the local actor parameter gradient delta theta_aAnd local actor parameter gradient Δ θ_cUpdating the actor network parameters θ separately_aAnd evaluator network parameter θ_cWill actNetwork parameter theta_aAnd evaluator network parameter θ_cAnd acting on the whole SDN network, and transmitting data by using the SDN network after updating the parameters.

As shown in fig. 3, in the present embodiment, the deep reinforcement learning model includes pairs of actors and reviewers, which are constructed using the neural network NN, and the actor network outputs a probability distribution and a routing policy for all actions in a given state, which is a multi-output neural network. The reviewer network uses the time difference error to evaluate the strategy of the behavior person, and is an output neural network. The actor network is a fully-connected neural network, and after data such as current nodes, target node information, bandwidth requirements, time delay requirements and the like are input, weighted summation and activation function processing are calculated at each neural network node, and a plurality of results are output. The actor network gives the next action according to the current state, the action has a plurality of choices and is a multi-output neural network, and the output is the probability of a plurality of routing choices. The evaluator network comprises four network information inputs and also inputs of network characteristics, and the output is the evaluation of the strategy of the actor network, so the evaluator network is a single output. The evaluator network input has one more network characteristic input, the input is the change information of the network, and the real-time network state change is added when the actor network strategy is evaluated, so that the intelligent route has self-adaptability.

Claims

1. A SDN intelligent routing data transmission method for distributed deep reinforcement learning is characterized by comprising the following steps:

where i ═ 1, 2., L represents the total number of local GPUs.

2. The SDN network smart routing data transmission method of claim 1, wherein the actor network in step S1 is a fully-connected neural network, and the evaluator network in step S1 is a combination network of the fully-connected neural network and a CNN convolutional neural network; the input of the actor network and the evaluator network comprise network states of the SDN network, the network states comprise current node information, destination node information, bandwidth requirements and delay requirements, and the input of the evaluator network further comprises network characteristics of the SDN network processed by the CNN convolutional neural network; the CNN convolutional neural network comprises an input layer, a convolutional layer, a pooling layer, a full-connection layer and an output layer which are sequentially connected.

3. The SDN network smart routing data transmission method of distributed deep reinforcement learning according to claim 1, wherein the incentive function in step S1 is:

wherein the content of the first and second substances,

4. The SDN network smart routing data transmission method of distributed deep reinforcement learning according to claim 1, wherein the step S4 includes the following sub-steps:

S42, let d theta_a0 and d θ_cSynchronizing the local parameter with the global parameter to obtain a local actor parameter theta'_aThe value of (A) is synchronized to the actor network parameter θ_aWill be localEvaluator parameter θ'_cValue synchronization of (2) as evaluator network parameter θ_cA value of (d);

5. The SDN network smart routing data transmission method of claim 4, wherein a local actor parameter gradient Δ θ is used in step S411_aAnd local actor parameter gradient Δ θ_cUpdating the actor network parameters θ separately_aAnd evaluator network parameter θ_cThe formula of (1) is:

θ_{a_updata}＝θ_a+βΔθ_a

θ_{c_updata}＝θ_c+βΔθ_c

6. The SDN network smart routing data transmission method of distributed deep reinforcement learning according to claim 4, wherein the step S7 includes the following sub-steps: