CN108900419B

CN108900419B - Routing decision method and device based on deep reinforcement learning under SDN framework

Info

Publication number: CN108900419B
Application number: CN201810945527.XA
Authority: CN
Inventors: 潘恬; 黄韬; 杨冉; 张娇; 刘江; 谢人超; 杨帆; 刘韵洁
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Fenomen array (Beijing) Technology Co.,Ltd.
Priority date: 2018-08-17
Filing date: 2018-08-17
Publication date: 2020-04-17
Anticipated expiration: 2038-08-17
Also published as: CN108900419A

Abstract

The embodiment of the invention provides a routing decision method and a device based on deep reinforcement learning under an SDN framework, wherein the method is applied to an SDN controller and comprises the following steps: acquiring real-time flow information in a network; determining a priority for each of the flows; and inputting the real-time traffic information into a pre-trained deep Q network DQN, and sequentially determining the route of each flow according to the priority order of each flow. The embodiment of the invention can realize the load balance of the network in the networks with various topological structures, reduce the occurrence of network congestion and realize the optimization of the routing strategy in the network environment with highly dynamic change of network flow.

Description

Routing decision method and device based on deep reinforcement learning under SDN framework

Technical Field

The invention relates to the technical field of communication, in particular to a routing decision method and a device based on deep reinforcement learning under an SDN framework.

Background

Congestion avoidance and route optimization have long been important research topics for traffic engineering in modern communication networks. With the rapid increase of the number of users and the scale of networks, the network structure becomes more complex, and the network congestion and route optimization face more and more challenges.

Highly dynamic traffic and unevenly distributed traffic density in the network are the main causes of network congestion. In order to solve network congestion, common solutions mainly include: traffic that may cause network congestion is subject to multi-path offloading to prevent excessive concentration of load by the traffic. Among them, the Equal-cost routing (ECMP) technique is a commonly used network load balancing technique. Specifically, the basic principle of the ECMP technique is as follows: when there are multiple different links between source and destination addresses in the network, the network protocol supporting ECMP can use multiple equivalent links for data transmission between source and destination addresses at the same time.

However, ECMP techniques simply distribute traffic evenly to equivalent links without regard to the distribution of traffic in the network, which results in poor performance in networks with asymmetric topologies and traffic. In a network with an asymmetric topology, the traffic distribution is asymmetric, and the more unbalanced the traffic distribution, the more difficult it is to reduce or avoid the occurrence of network congestion by ECMP techniques. And because it is difficult to reduce or avoid the occurrence of network congestion, routing strategies based on the ECMP technique cannot be optimized in a network environment where network traffic is highly dynamically changing.

Disclosure of Invention

Embodiments of the present invention provide a routing decision method and apparatus based on deep reinforcement learning in an SDN architecture, so as to reduce occurrence of network congestion in networks with various topology structures and implement optimization of a routing policy in a network environment with highly dynamic changes in network traffic. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a routing decision method based on deep reinforcement learning in an SDN architecture, which is applied to an SDN controller, and the method includes:

acquiring real-time flow information in a network; wherein the real-time traffic information includes: a link bandwidth occupied by each flow in the network;

determining a priority for each of the flows;

inputting the real-time traffic information into a pre-trained deep Q network DQN, and sequentially determining the route of each flow according to the priority order of each flow;

the DQN is obtained by training according to sample traffic information and a sample routing strategy corresponding to the sample traffic information; the sample traffic information includes: the link bandwidth occupied by each sample flow, the sample routing policy includes: the route of each sample flow corresponding to the sample traffic information.

In a second aspect, an embodiment of the present invention provides a deep reinforcement learning-based routing decision apparatus under an SDN architecture, which is applied to an SDN controller, and the apparatus includes:

the first acquisition module is used for acquiring real-time flow information in a network; wherein the real-time traffic information includes: a link bandwidth occupied by each flow in the network;

a first determining module for determining a priority of each of the streams;

a second determining module, configured to input the real-time traffic information into a pre-trained deep Q network DQN, and sequentially determine a route of each flow according to a priority order of each flow;

In a third aspect, an embodiment of the present invention provides an SDN controller, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor is configured to, when executing the program stored in the memory, implement the method steps of the deep reinforcement learning-based routing decision under the SDN architecture according to the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, having stored therein instructions, which, when executed on a computer, cause the computer to perform the method steps of deep reinforcement learning based routing decision under the SDN architecture as described in the first aspect.

In a fifth aspect, an embodiment of the present invention provides a computer program product containing instructions, which when run on a computer, causes the computer to perform the method steps of deep reinforcement learning based routing decision under the SDN architecture as described in the first aspect above.

In the embodiment of the invention, DQN obtained by training is obtained in advance according to sample traffic information and a sample routing strategy corresponding to the sample traffic information, and then when the route of each flow in the network is determined, after real-time traffic information in the network is obtained, the real-time traffic information is input into the trained DQN, so that the DQN determines the route of each flow in sequence according to the priority of each flow in the network. Because the embodiment of the invention determines the route based on the pre-trained DQN network, and the DQM network can be trained according to the sample data of the network with the topological structure to be analyzed, the embodiment of the invention can reduce the occurrence of network congestion in the networks with various topological structures, and realize the optimization of the routing strategy in the network environment with highly dynamic change of network flow.

Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart of a routing decision method based on deep reinforcement learning in an SDN architecture according to an embodiment of the present invention;

fig. 2 is another flowchart of a deep reinforcement learning-based route decision method under an SDN architecture according to an embodiment of the present invention;

fig. 3 is another flowchart of a deep reinforcement learning-based route decision method under an SDN architecture according to an embodiment of the present invention;

fig. 4 is a structural diagram of a deep reinforcement learning-based route decision device in an SDN architecture according to an embodiment of the present invention;

fig. 5 is another structural diagram of a deep reinforcement learning-based route decision apparatus under an SDN architecture according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an SDN controller according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For ease of understanding, SDN (Software Defined Network), DRL (Deep learning), and DQN (Deep Q Network) will be briefly described below.

SDN is a new network architecture. Unlike the traditional network architecture, SDN proposes the idea of separating the data plane and the control plane of the network. The communication between the data plane and the control plane of the network can be realized by an open protocol, namely an Openflow protocol. Based on the Openflow protocol, an Openflow switch in the data plane can forward and transmit ordinary data traffic, and can upload acquired real-time traffic information of the network to an SDN controller of a control plane of the network. The SDN controller may collect and summarize traffic information uploaded by Openflow switches in a network area managed by the SDN controller, and formulate a corresponding routing policy and forwarding mechanism according to the collected network traffic information. Compared with the traditional Network architecture, the SDN architecture has many advantages, and can realize Network Function Virtualization (NFV), decoupling software and hardware and abstracting Network functions based on the SDN architecture, so that the Network equipment functions do not depend on special hardware any more, and resources are shared flexibly. With an SDN controller, global routing control of the network may be fully achieved. This means that the routing strategy and traffic distribution of flows in the network can be controlled from an overall point of view to solve congestion problems in the network due to uneven traffic density distribution.

DRL is a novel machine Learning method, a Reinforcement Learning (RL) method combined with Deep Neural Networks (DNNs), proposed by Deep mind. If the DRL is to be applied to a control problem in different scenarios, it is necessary to ensure that the control problem satisfies the following conditions: (1) an environment with explicit rules and definitions; (2) a system (3) capable of obtaining accurate and timely feedback a reward function for defining task objectives. The flow control and routing decision problem of the network is such that the above conditions are met, i.e. it is feasible to use DRL to implement flow control and routing decision of the network. Specifically, when the RL processes a task, it can be generally described using Markov Decision Processes (MDPs): in a certain context E there is a state space S, in which any state represents the current context as perceived by the agent, and an action space a, in which any action is selectable in each state. A state transition occurs when an agent performs an action in a state using a policy π(s). After the state transition, environment E gives the agent a reward (reward) based on the state transition. When the agent starts to perform a series of actions from an initial state using a policy pi(s), i.e. a series of state transitions, the agent receives a cumulative reward Q^π(s, a). The goal of RL is to find an optimal strategy of π^*(s) the cumulative prize earned by the agent can be maximized.

The optimal strategy can be found by Q-learning when RL processes a task. However, when the state space is too large, the accumulated reward Q in each state is solved through Q-learning^πThe process of (s, a) becomes very difficult. To address this problem, the cumulative prize Q may be found by approximation using DNN^π(s, a). This method of combining DNN and Q-learning is called DQN.

The main structure of DQN is a neural network-called Q network, the Q network can take the state s as input and output the Q with optional action under the state s^π(s, a) value due to Q of the output^π(s, a) is a Q network approximation, so the parameter θ of the Q network needs to be trained to make the Q approximated^π(s, a) more precisely, specifically, in the training process, the value of the loss function can be calculated, and when the value of the loss function does not meet the set condition, the parameter θ of the Q network is updated through the common inverse transfer and gradient descent method. After the Q network is trained and updated sufficiently, the Q network outputs the Q^π(s, a) will approach the optimum jackpot Q^*(s, a), and the current strategy pi(s) approaches the optimal strategy pi^*。

In a network, when the load of transmitted data traffic on a link or a device exceeds the link bandwidth or the maximum processing capacity of the device, the transmission delay on the link or the device is increased, the throughput is reduced, and packet loss is generated in the transmitted data, which causes the reduction of the network transmission performance and is called network congestion. Generally, congestion in a network is caused by overload of links or devices, and when the total data transmission load in the network exceeds the upper limit of the load that can be accommodated by the network, the network congestion is difficult to avoid, and in such a case, the network congestion can be avoided only by upgrading network hardware or adding additional devices. In many cases, however, network congestion can occur even if the total data transmission load in the network is far from the upper load limit that the network can accommodate. This network congestion is mostly caused by the maldistribution of data traffic in the network: because a basic shortest path algorithm is often used in a conventional routing protocol, a large amount of traffic load is often concentrated on some links or devices at critical positions in a network, while the load on other devices or links at the edge of the network is very small, and the utilization rate of network resources is very low. For network congestion caused by such a situation, the routing policy of the network may be optimized to reduce or avoid.

It is apparent that the optimal routing strategy in a network is determined based on the network topology and real-time network traffic information. When the data traffic in the network changes, the optimal routing policy of the network also needs to change. This requires that real-time traffic information of the network must be mastered in order to optimize the routing policy of the network based on the real-time traffic information.

In order to reduce the occurrence of network congestion in networks with various topological structures and optimize a routing strategy in a network environment with highly dynamic network traffic change, embodiments of the present invention provide a routing decision method and device based on deep reinforcement learning in an SDN architecture.

In the scheme of the invention, the collection and the collection of the real-time traffic information of the network can be realized through an SDN controller positioned on a control plane in an SDN network architecture under the SDN network architecture, and after the SDN controller collects the real-time traffic information of the whole network, the current optimal routing strategy of the network can be determined by utilizing a DRL method according to the real-time traffic information of the network. Further, the current optimal routing policy of the network may be determined based on the DQN in the DRL.

First, a routing decision method based on deep reinforcement learning in an SDN architecture provided by an embodiment of the present invention is described below.

As shown in fig. 1, a routing decision method based on deep reinforcement learning in an SDN architecture according to an embodiment of the present invention is applied to an SDN controller, and the method may include the following steps:

s101, acquiring real-time flow information in a network; wherein, the real-time traffic information includes: link bandwidth occupied by each flow in the network.

The method provided by the embodiment of the invention can be applied to an SDN controller. The SDN controller is a controller of a control plane in an SDN network architecture, and may collect real-time traffic information of the network, which is sent by an Openflow switch of a data plane in the SDN network architecture, and formulate a corresponding routing policy and forwarding mechanism based on the real-time traffic information. Thus, the network may be a communication network having an SDN network architecture.

To facilitate understanding of the network, flow and congestion problems in this embodiment, a model of the network, a definition of flows in the network, a routing manner of the flows, and a congestion problem of the network in this embodiment will be described below.

First, the model of the network is: a communication network having a plurality of communication nodes, m physical links. Each communication node corresponds to each Openflow switch of the data plane in the SDN network architecture. All communication nodes can be divided into two categories: a source node and a forwarding node. A source node is a node in the network that generates and eventually receives packets, all of which are generated by the source node and eventually arrive at the source node. In this embodiment, the number of source nodes in the network is set to n, and s is used₁，s₂，s₃......s_nRepresenting the source node. The forwarding nodes are nodes in the network which are responsible for forwarding data packets, do not generate data packets, and only forward the data packets transmitted by other nodes based on the flow table.

The flows in the network then refer to: all data packets which are sent from the same source node and finally reach the same destination node in the network are classified into a class, and the data packets form a flow together. The source node and the destination node of any flow cannot be the same node, and the source node of any flow here refers to: the start node of this flow. Based on this, it can be concluded that in a network with N source nodes, at most N ═ N²-n streams. To quantitatively describe the traffic demand of each flow in the network, define: one by node s_iBeing a source node, node s_jThe link bandwidth occupied by the normal transmission of the stream for the destination node in the link is f_i，j. For each flow f_i，jFor this flow, x alternative routes can be determined between all source nodes and all destination nodes

Each alternative route is assigned a flow f_i，jBy source node s_iDeparture to destination node s_jAll links traversed. In this embodiment, the way of making a routing decision on the network is as follows: for each flow in the network, one of the alternative routes is selected as the actual route of the flow (referred to as the route of the flow). Therefore, the flow routing method refers to: the flow enables transmission of data packets in the flow through all links as specified by its actual route. In this embodiment, a flow is the minimum unit to be controlled in a routing decision, and the routing decision is performed in the minimum unit of the flow, which is easy to implement for an SDN controller that controls packet forwarding using a flow table.

Finally, the congestion problem of the network refers to: for m physical links in the network, two parameters are specified for each physical link: maximum available bandwidth threshold t for a link₁，t₂，t₃，...，t_mAnd the real-time link load value l of the link₁，l₂，l₃，...，l_m. Wherein, the bandwidth threshold value of the link is the same as the measurement unit of the real-time link load value. In the definition of the above flow, the link bandwidth occupied by the normal transmission of the flow in the link is denoted as f_i，j. Then for a link k, if there are multiple flows routed through the link k in the current state, for example: with three flows f_1，2，f_1，3，f_1，4Is routed through link k, defining a real-time link load value l on link k_kEqual to the sum of the link bandwidths occupied by the normal transmission of the flows in the link, namely the real-time traffic load l of the link k_k＝f_1，2+f_1，3+f_1，4. If l is_kExceeds the maximum available bandwidth threshold t of link k_kThen it is assumed that congestion occurred on link k at this time, l_kExceeds the bandwidth threshold t_kCorresponds to the severity of the congestion occurring on link k, the higher the severity of the congestion, the lower the throughput of link k, the higher the delay of the flow through link k, i.e. three flows f_1，2，f_1，3，f_1，4The higher the transmission delay of; if l is_kDoes not exceed the maximum available bandwidth threshold t of link k_kThen, it is assumed that there is no congestion on link k at this time, and the throughput of link k follows with l_kIncreases linearly while the stream flowing through link k can be transmitted within an acceptable delay.

Based on the above descriptions of the model of the network, the definition of the flows in the network, the routing manner of the flows, and the congestion problem of the network, the routing decision problem of the network aims to: in a network with n source nodes and m physical links, under the condition of obtaining the link bandwidth occupied by each flow in the network at a certain moment, the most appropriate route is selected for each flow, so that the load balance state of the network is optimal, and the probability of congestion in the network is minimum. It can be understood that if a certain flow does not exist at a certain time, the occupied link bandwidth is 0.

In this embodiment, the SDN controller may acquire real-time traffic information in a network by collecting and summarizing the real-time traffic information of the network sent by all Openflow switches located on a data plane in the SDN network architecture. This process can be implemented by the prior art, and the present invention will not be described in detail.

In actual use, the SDN controller may periodically acquire real-time traffic information in the network at certain time intervals. Once the SDN controller acquires the real-time traffic information in the network, it may perform a routing decision for the acquired real-time traffic information. This shows that when the traffic information in the network changes in this embodiment, the routing policy of the network is adjusted accordingly. Therefore, when the traffic in the network is highly dynamically changed, the SDN controller acquires the real-time traffic information in the network in real time and adjusts the routing policy, so that the routing policy of the network can be always kept optimal.

The time interval may be determined according to the specific situation of the network. Specifically, the determination may be made according to the degree of traffic change of the network. If the traffic of the network changes faster, the time interval may be set to a smaller value; if the traffic of the network does not change fast, the time interval may be set to a larger value.

S102, determining the priority of each flow.

Through a large number of simulation experiments, the inventors observed that: when making routing decisions for a network with an asymmetric topology, the order of routing all flows in the network has a significant effect on the processing speed and effect of DQN (DQN in step S102 means: trained DQN), in some cases the processing speed of DQN is significantly increased, and in other cases the processing speed of DQN is very slow or even difficult to converge. Through comparison of multiple sets of simulation experiments, the processing speed and effect of DQN are found to be related to alternative routes of different flows: when there is an "ideal" route in the alternative routes of a certain flow, the "ideal" route here refers to: the load of other flows on the path through which the route flows is small, that is, the path through which the route flows is not easily congested, the preferential routing of the flow can enable the DQN to optimize the routing decision at a faster processing speed, and if the preferential routing of flows whose alternative routes are not "ideal" routes, the processing time of the DQN is longer, and the processing effect is not ideal. This is because: when one route in the alternative routes of one flow is obviously superior to other routes, the optimal routing strategy of the flow can be easily output by the DQN, when the routes are selected for all the flows in the network according to a certain sequence, the earlier the sequence of the flow is, the faster the DQN can output the optimal routing strategy of the flow, and meanwhile, the smaller the solution space required to be explored for optimizing the routing strategy of the whole network is, and the easier the processing is. The DQN treatment refers to: the best route for each flow in the network is determined based on the trained DQN.

Based on the above reasons, in this embodiment, before DQN processing, a method for determining priority of flows is proposed for a flow routing policy determination order. In one implementation, the determining the priority of each flow in step S102 may include the following steps:

s11, for each flow f_i，jDetermining x alternative routes for the flow

Wherein i represents the flow f_i，jJ denotes the flow f_i，jThe target node of (1).

The source node and the forwarding nodes in the network may form multiple routes. Then for each flow f_i，jSome alternative routes may be selected from the plurality of routes first, so as to further select one of the alternative routes as the flow f_i，jThe actual route of (2).

When the alternative route of each flow is selected, the alternative route of each flow can satisfy the following conditions:

condition 1: any alternative route for each flow is loop-free.

It can be understood that when a loop exists in an alternative route, it means that the data packet transmitted on the alternative route will not reach the destination node.

Condition 2: the path traversed by any alternative route for each flow is not exactly the same as the path traversed by the other alternative routes.

That is, for any flow, each alternative route for that flow is different from the other alternative routes for that flow. Since the purpose of this embodiment is: for any flow, an optimal actual route of the flow is selected from the multiple alternative routes of the flow, so that the multiple alternative routes can be different from each other for the convenience of comparing the multiple alternative routes.

Condition 3: the distance of any alternative path of each flow meets a preset value.

For any flow, when selecting the best route for that flow, it is generally desirable that the distance of this best route is short. A preset value can be set and routes with distances smaller than the preset value can be used as alternative routes of the flow. The distance of a route means: the distance from the source node (i.e., the starting node) to the destination node of the route can be measured by the number of routes or other common means. The preset value can be set according to actual needs. For different flows, the same preset value may be set, or different preset values may be set, which is not limited in the present invention.

S12, calculating the flow f by the following formula_i，jThe r-th alternative route of

Evaluation value EV of (1)_r：

Wherein l₁，l₂，l₃,., L represents the r-th alternative route

Listening for each link that is passing through,

indicating flow division f in the network_i，jAlternative routing of flows other than via links l₁，l₂，l₃,., each total number of L;

indicates the total number of times

Maximum value of (2).

After the alternative routes for each flow are determined, the alternative routes for the flow may be evaluated. Specifically, the situation that each link passed by each alternative route is occupied by other flows can be evaluated. In this embodiment, the evaluation value EV described above may be set_rThe situation that each link passed by each alternative route is occupied by other flows is evaluated.By the above evaluation value EV_rThe formula of (c) can conclude that: EV of one alternative route_rA larger value means that the utilization of the link traversed by this route is lower, flow f_i，jThe less likely congestion will occur after selecting this route.

S13, calculating the flow f by the following formula_i，jIs given priority reference value P_i，j：

P_i，j＝max(E)-max(E\{max(E)})

Wherein E represents a stream f_i，jE ═ EV, as a set of evaluation values of all candidate routes₁，EV₂，EV₃...EV_XAnd the maximum value in the set E is represented by (E) }, max (E) represents a new set formed by removing the maximum value max (E) in the set E, and max (E \ max (E) } represents the maximum value in the new set E \ max (E)).

Calculating the flow f_i，jAfter the evaluation values of all the candidate routes are obtained, the flow f can be further calculated according to the evaluation values of all the candidate routes_i，jIs given priority reference value P_i，j. The priority reference value P_i，jShows that: at flow f_i，jAmong all the candidate routes of (2), the difference between the evaluation value of the candidate route having the largest evaluation value and the evaluation value of the candidate route having the second largest evaluation value. That is, the priority reference value P_i，jShows that: flow f_i，jThe degree to which the most "ideal" of all the alternative routes of (a) is superior to the other routes. Priority reference value P_i，jThe larger the flow f_i，jThe higher the priority of (c).

S14, determining the priority of each flow according to the high-low order of the priority reference value of each flow; the priority of the stream with the highest priority reference value is 0, and the priority of the stream with the lowest priority reference value is N-1.

After calculating the priority reference values of all streams, the priority reference values of all streams may be sorted from high to low. The sorted order is the order in which all flows are routed. The higher the priority reference value of a flow, the greater the priority of the flow. Then the more forward the order of this flow is when making routing decisions.

In this embodiment, the priority of the flow with the highest priority reference value is represented as 0, and the priority of the flow with the lowest priority reference value is represented as N-1. After the priority of each flow is determined, the route of each flow can be sequentially determined according to the sequence from 0 to N-1.

S103, inputting real-time traffic information into a pre-trained deep Q network DQN, and sequentially determining the route of each flow according to the priority order of each flow; the DQN is obtained by training according to the sample traffic information and a sample routing strategy corresponding to the sample traffic information; the sample traffic information includes: the link bandwidth occupied by each sample flow, and the sample routing strategy comprises the following steps: and the sample route of each sample flow corresponding to the sample flow information.

In order to determine the route of each flow, the DQN may be trained according to the pre-obtained sample traffic information and a sample routing policy corresponding to the sample traffic information, so as to obtain a trained DQN. Furthermore, after the DQN is trained, real-time traffic information of the network can be input into the trained DQN, so that the trained DQN sequentially determines the route of each flow according to the priority order of each flow. The sample route of each sample flow corresponding to the sample traffic information may be considered as the optimal route of each sample flow, and therefore, the route of each flow determined by the DQN may be considered as the optimal route of each flow. Therefore, the training process is as follows: the best route to each sample stream is learned. Based on this, after training is finished, the real-time traffic information of the network is input into the trained DQN, and then the optimal route of each flow in the network can be output.

Before training, an environment for training may be set in advance. The training environment comprises a plurality of sample flows, a plurality of communication nodes (including source nodes and forwarding nodes), a plurality of links and a flow-level network flow-load model, and the corresponding relation between the network flow and the link load in the training environment can be obtained through the model. Because the routing decision of each sample flow in the training environment changes along with the change of the sample flow information, the real-time load of each link in the training environment can be determined in the process of determining the routing of each sample flow according to a group of sample flow information and based on the corresponding relation between the network flow and the link load in the training environment. It can be understood that, in the present solution, since the routes of the sample flows are determined according to the order of priority, after the route of one sample flow is determined, the load of each link changes, and the change affects the route determination process of the next sample flow. In this scheme, the link load can be viewed as a linear accumulation of all traffic on the link. Specifically, the load of a link is the sum of the link bandwidths occupied by the sample flows flowing through the link.

Based on the preset training environment, the DQN can be trained. In order to make the trained DQN suitable for routing each stream in the network, a training environment may be set to have a training network with a structure completely the same as that of the network, where the number of source nodes, the number of forwarding nodes, and the number of links in the training network are respectively the same as those in the network, and the bandwidth of each link in the training network is also respectively the same as that of each link in the network. The process of training DQN will be described in detail below.

The real-time traffic information is input into a pre-trained deep Q network DQN, and the routing process of each flow is determined in turn according to the priority order of each flow, which can refer to a learning process of one round for a group of sample traffic information described below.

According to the scheme provided by the embodiment of the invention, DQN obtained by training is obtained in advance according to the sample traffic information and the sample routing strategy corresponding to the sample traffic information, and then when the routing of each flow in the network is determined, after the real-time traffic information in the network is obtained, the real-time traffic information is input into the trained DQN, so that the DQN determines the routing of each flow in sequence according to the priority of each flow in the network. The embodiment of the invention can realize the load balance of the network in the networks with various topological structures, reduce the occurrence of network congestion and realize the optimization of the routing strategy in the network environment with highly dynamic change of network flow.

As shown in fig. 2, the process of training DQN in the embodiment of the present invention is described as follows, where the process of training DQN may include the following steps:

s201, constructing initial DQN.

In this embodiment, in order to train DQN, an initial DQN may be constructed. The initial DQN structure may include: a state input layer, at least one hidden layer, and an action output layer. The method includes that a group of sample traffic information can be input into the DQN at the state input layer, and after being processed by at least one hidden layer, a current route of each sample flow corresponding to the group of sample traffic information can be output at the action output layer. The current route is: and outputting a result after one-time learning of the DQN under the current parameters. In the initial DQN, the values of each parameter are initial values. The training process is to continuously optimize parameters in the DQN, so that the current route of each sample flow of DQN output after parameter optimization is continuously close to the sample route of each sample flow.

S202, obtaining the sample traffic information and a sample routing strategy corresponding to the sample traffic information.

After the initial DQN is constructed, the sample traffic information and the sample routing policy corresponding to the sample traffic information can be obtained. And training the DQN further according to the sample traffic information and a sample routing strategy corresponding to the sample traffic information.

When the network traffic information changes, the routing policy of the network needs to be adjusted accordingly, that is, the routing of each flow in the network needs to be adjusted. Therefore, in this embodiment, specifically for training DQN with respect to a set of sample traffic information, the training result is: so that the DQN can output sample routes for the respective sample streams corresponding to the set of sample traffic information. In practical application, the network traffic information may be a value of a link bandwidth occupied by each flow in any one group of networks, so that when training the DQN, a plurality of groups of different sample traffic information may be obtained, and training is performed on the plurality of groups of different sample traffic information respectively. In this way, when routing each flow in the network for a certain set of real-time traffic information in the network, the sample traffic information closest to the set of real-time traffic information may be determined first, and the DQN trained for the sample traffic information is directly used to determine the route of each flow in the network corresponding to the set of real-time traffic information.

And S203, inputting the sample flow information into the DQN, and obtaining the current route of each sample flow according to the preset priority order of each sample flow.

The process of pre-setting the priority order of each sample flow may refer to the aforementioned process of determining the priority of each flow in the network. After the priority order of each sample path is determined, the current route of each sample flow can be determined according to the order during each training, so that the training speed can be increased.

In one implementation, the inputting of the sample traffic information into the DQN in step S203, and obtaining the current route of each sample flow according to the preset priority order of each sample flow, may include the following steps:

s21, forming initial state information by the sample flow information, the initial value of the link load vector and the priority 0; the link load vector is a vector formed by link load values of each link in a preset training environment, and the link load value of any link is as follows: the sum of the link bandwidths occupied by the various sample flows through the link.

In this embodiment, one learning process for one set of sample traffic information is referred to as one round. In each round, DQN performs actions starting from an initial state, and then proceeds through a series of state transitions to an end state as shown in fig. 3. In each round, each time a set of state information is input, an action is output through the DQN, and the output action represents: a current route is determined for a sample stream. After the last action is output in a round and finished, all actions output in the round are represented as follows: the current routes of all sample flows are determined.

In each round, the state information input each time may consist of three parts: 1. the sample traffic information, i.e. the link bandwidth occupied by various streams, is expressed as: f. of_1，2，f_1，3，f_1，4...f_n，n-1(ii) a 2. A link load vector, denoted as (l)₁，l₂，l₃...l_m) (ii) a 3. Priority values for determining the order of the various sample streams. Specifically, the sample traffic information does not change with the state transition in a series of state transition processes in one round, because the routing decisions performed in the learning in one round are all the routing decisions performed on the set of sample traffic information. The link load vector represents: and the load vector of each link in the current state is continuously changed along with the state transition. After each state transition, the change of the link load vector is determined by the link load vector of the last state and the action output by the last state. The priority level indicates that: the priority of the routing decision for each sample stream is also used to determine the order of the states.

In this embodiment, the value of the priority in the initial state information is set to 0, and then the value of the priority of the new state after the state transition is increased by 1 after each action is performed.

In the initial state information, the link load vector is a 0 vector since no route has been determined for any one sample flow.

S22, the initial state information is input to the DQN, and the current route of the sample flow with priority 0 is output.

After inputting the initial state information into the DQN, the DQN can output the current route of the sample flow with priority 0 (referred to as sample flow 0 for short) based on the current parameters. Specifically, DQN may select one of a plurality of candidate routes previously determined for sample flow 0 as the current route of sample flow 0. For determining the alternative route of the sample flow 0, the aforementioned manner for determining the alternative route of each flow in the network may be referred to.

And S23, updating the load vector of the current link according to the initial state information and the current route of the sample flow with the priority of 0, and increasing the priority by 1.

In the process shown in fig. 3, the action output after each input of the state information affects the link load vector in the next state information. This is due to: when a flow f is determined_i，jR of_i，jThis flow f_i，jThe link load of the links through which it flows changes. Therefore, the load added by the sample flow 0 to each link through which the sample flow 0 flows can be calculated according to the current route of the sample flow 0 and the link bandwidth occupied by the sample flow 0 in the initial state information, and then the added load is added to the link load value of each link through which the sample flow 0 flows in the link load vector of the initial state information, so as to obtain an updated link load vector. The updated link load vector can be used as the link load vector in the next state information.

S24, setting S to 1, …, N-1, and executing the following steps a1-a3 in a round-robin order from small to large S, and outputting the current route of the sample flow with the priority of 1 to N-1, where N represents the number of sample flows:

a 1: and forming the s-th state information by using the sample traffic information, the updated link load vector and the current priority.

This step may refer to step S21.

a 2: and inputting the s-th state information into the DQN, and outputting the current route of the sample flow with the priority of s.

This step may refer to step S22.

a 3: and updating the load vector of the current link according to the s-th state information and the current route of the sample flow with the priority of s, and increasing the priority by 1.

This step may refer to step S23.

By circularly executing the steps a1-a3, the current routes of the sample flows with the priorities of 1-N-1 can be output in sequence. When the current route of the last sample flow, i.e., sample flow N-1, is output, the current routes of all sample flows are determined. The current route of all sample flows may further be compared with the best route of all sample flows to optimize the parameters of the DQN.

And S204, calculating the value of the preset loss function according to the current route of each sample flow and the sample routing strategy.

During the training of DQN, a loss function can be preset. The gap between the current route of each sample flow and the sample route of each sample flow can be measured by the loss function.

In one implementation, on the basis of the implementation in step S203 (i.e., steps S21-S24), calculating the value of the preset loss function according to the current route of each sample flow and the sample routing policy in step S204 may include the following steps:

s31, calculating a target link load vector according to the N-1 th state information and the current route of the sample flow with the priority of N-1; wherein the target link load vector comprises: and the real-time link load value of each link in the preset training environment corresponding to the sample flow information.

After the current routes of the sample flows with the priority of N-1 are determined, the current routes of all the sample flows are determined. Therefore, the real-time link load value of each link in the training environment can be calculated to form a target link load vector, and the load balance state of the training environment is further evaluated according to the target link load vector. The manner of calculating the target link load vector may refer to step S23.

And S32, calculating an incentive function value MLV corresponding to the sample flow information according to the target link load vector.

And evaluating the load balance state of the training environment by using the calculated target link load vector so as to further optimize the routing strategy of the training environment. Evaluating the load balancing state of the training environment, namely: the learning result of one round is evaluated. Wherein, the load balancing state of the training environment refers to: the load conditions on the various links in the training environment.

The purpose of the invention is as follows: the occurrence probability and the congestion degree of the network congestion are reduced as much as possible.Specifically, two requirements need to be specified: 1. when the link load value l on any link in the training environment_kBelow the maximum available bandwidth threshold t for that link_kWhen necessary, the link load value l is required to be made_kAs far away as possible from the bandwidth threshold t_k(ii) a 2. When the link load value l_kExceeding the bandwidth threshold t_kWhen necessary, the link load value l is required to be made_kAs close as possible to the bandwidth threshold t_k. In order to achieve these two requirements, it is necessary to first quantitatively describe the relationship between the link load Value and the bandwidth threshold Value of each link in the training environment, where a Maximum Load Value (MLV) of the training environment is defined, and its expression is:

MLV＝min((t₁-l₁)，(t₂-l₂)，(t₃-l₃)...(t_m-l_m))

wherein l₁，l₂，l₃，...，l_mRepresents the real-time link load value, t, of

link

1, 2, 3₁，t₂，t₃，...，t_mRepresenting the bandwidth thresholds of

links

1, 2, 3.

MLV table shows: the difference between the bandwidth threshold of the most heavily loaded link in the training environment and the real-time link load value. When the value of the MLV is positive, it indicates that the real-time link load values of all links in the training environment are less than the bandwidth threshold, and no congestion occurs in the training environment, and at this time, the larger the value of the MLV is, the more balanced the load in the training environment is considered. And when the value of the MLV is negative, the real-time link load value of at least one link in the training environment exceeds the bandwidth threshold value, and the training environment is congested.

Based on the above situation that MLV can represent, in the training of DQN, MLV can be used as a reward function, and the load balance state of the training environment can be evaluated by the reward function. That is, the rating may be expressed in terms of a value of a reward function. Each positive reward function value represents: rewarding the action output by the DQN in a round; each negative reward function value represents: penalizing the action output by DQN in one round. Training is complete after the DQN has gradually learned how to output actions to get larger values of the reward function through multiple rounds of learning. Based on the DQN after training, an optimal routing strategy can be provided aiming at the real-time traffic information of the network.

It will be appreciated that in the expression of MLV, the choice of using the most value rather than the mean value to describe the congestion situation of the training environment is due to: network congestion is often caused by uneven network load, and therefore any routing policy that may cause excessive concentration of network load needs to be penalized. The quality of the routing policy can be easily determined by measuring the link with the worst load in the network, and using the average value makes it difficult to distinguish between good and bad routing policies.

In this embodiment, the reward function value MLV corresponding to the sample flow information may be calculated by the calculation formula of the MLV according to the target link load vector, so as to further evaluate the learning result of one round according to the reward function value MLV.

And S33, calculating the value of the preset loss function according to the reward function value and the sample routing strategy.

In this embodiment, the preset value of the loss function may be calculated according to the reward function value and the sample routing policy by the following formula:

L(θ)＝E[(MLV+γmax_aQ(s′，a′|θ)-Q(s，a|θ)²]

wherein, L (theta) represents a loss function, MLV represents an incentive function value, gamma represents a discount factor, gamma is more than or equal to 0 and less than or equal to 1, theta represents the current network parameter of DQN, Q (s, a | theta) represents the accumulated incentive obtained after inputting initial state information s into DQN and outputting the current route of each sample flow, a represents the current route of each sample flow, max_aQ (s ', a' | θ) represents the optimal jackpot determined from the sample routing policy.

Specifically, s' represents the next state, max, to transition after performing action a_aQ (s ', a' | θ) representsAnd a 'represents all the alternative routes of the current sample flow corresponding to the state s'.

The above-described jackpot Q (s, a | θ) can be calculated from the reward function value MLV. Specifically, a method of calculating the jackpot Q (s, a | θ) by using the value of the jackpot function MLV, and a method of determining the optimal jackpot according to the sample routing policy are known in the art. The present invention will not be described herein.

S205, when the calculated loss function value is not lower than the first preset value, adjusting the network parameters of the DQN, returning to input the sample flow information into the DQN, and obtaining the current route of each sample flow according to the preset priority order of each sample flow.

When the calculated loss function value is not lower than the first preset value, it indicates that the training effect of the DQN has not reached the expected effect, so the network parameters of the DQN can be adjusted, and the step S203 is executed. Specifically, the network parameters of the DQN can be adjusted by using backward propagation and gradient descent methods.

Of course, the first preset value can be set according to actual needs.

And S206, when the calculated loss function value is lower than the first preset value, finishing the training to obtain the trained DQN.

When the calculated loss function value is lower than the first preset value, the expected DQN training effect is achieved, and the training can be finished, so that the trained DQN is obtained. That is, based on the network parameters of the trained DQN, the optimal route of each flow in the network can be output.

In addition, in this embodiment, on the basis of the embodiment shown in fig. 1, the following steps may be further included:

s104 (not shown in the figure), updating the local flow table according to the route of each flow; and sending the updated flow table to each openflow switch, so that each openflow switch performs corresponding operation on the data in the network according to the updated flow table.

The SDN controller may formulate a corresponding routing policy and forwarding mechanism according to the collected network traffic information. Thus, after determining the route of each flow, the SDN controller may update the local flow table directly according to the route of each flow. And then, the updated flow table may be sent to each openflow switch of a data plane in the SDN network architecture, so that each openflow switch performs a corresponding operation on the data in the network according to the updated flow table. For example, forwarding operations are performed on packets in the network. Therefore, each openflow switch can realize data transmission according to the optimal routing strategy of the network.

The step of updating the local flow table according to the route of each flow can be implemented by the prior art, and the present invention is not described herein again.

Corresponding to the foregoing method embodiment, an embodiment of the present invention provides a deep reinforcement learning-based route decision apparatus in an SDN architecture, which is applied to an SDN controller, and as shown in fig. 4, the apparatus may include:

a first obtaining module 401, configured to obtain real-time traffic information in a network; wherein the real-time traffic information includes: a link bandwidth occupied by each flow in the network;

a first determining module 402, configured to determine a priority of each of the flows;

a second determining module 403, configured to input the real-time traffic information into a pre-trained deep Q network DQN, and sequentially determine a route of each flow according to a priority order of each flow;

the DQN is obtained by training according to sample traffic information and a sample routing strategy corresponding to the sample traffic information; the sample traffic information includes: the link bandwidth occupied by each sample flow, the sample routing policy includes: and the sample route of each sample flow corresponding to the sample traffic information.

According to the scheme provided by the embodiment of the invention, DQN obtained by training is obtained in advance according to the sample traffic information and the sample routing strategy corresponding to the sample traffic information, and then when the routing of each flow in the network is determined, after the real-time traffic information in the network is obtained, the real-time traffic information is input into the trained DQN, so that the DQN determines the routing of each flow in sequence according to the priority of each flow in the network. Because the embodiment of the invention determines the route based on the pre-trained DQN network, and the DQM network can be trained according to the sample data of the network with the topological structure to be analyzed, the embodiment of the invention can reduce the occurrence of network congestion in the networks with various topological structures, and realize the optimization of the routing strategy in the network environment with highly dynamic change of network flow.

Further, on the basis of the embodiment shown in fig. 4, as shown in fig. 5, the routing decision device based on deep reinforcement learning in the SDN architecture provided by the embodiment of the present invention may further include:

a constructing module 501, configured to construct an initial DQN;

a second obtaining module 502, configured to obtain sample traffic information and a sample routing policy corresponding to the sample traffic information;

a third determining module 503, configured to input the sample traffic information into the DQN, and obtain a current route of each sample flow according to a preset priority order of each sample flow;

a calculating module 504, configured to calculate a preset value of a loss function according to the current route of each sample flow and the sample routing policy;

a first processing module 505, configured to adjust the network parameter of the DQN and trigger the third determining module 503 when the calculated value of the loss function is not lower than a first preset value;

a second processing module 506, configured to end the training when the calculated value of the loss function is lower than the first preset value, to obtain a trained DQN.

Optionally, the third determining module 503 may include:

the construction unit is used for constructing initial state information by using the sample flow information, the initial value of the link load vector and the priority 0; wherein the link load vector is a vector formed by link load values of each link in a preset training environment, and the link load value of any link is as follows: the sum of the link bandwidths occupied by each sample flow passing through the link;

a first output unit, configured to input the initial state information into the DQN, and output a current route of a sample stream with a priority of 0;

an updating unit, configured to update a current link load vector according to the initial state information and the current route of the sample flow with the priority of 0, and increase the priority by 1;

a second output unit, configured to set s to 1, …, and N-1, and execute the following steps a1-a3 in a round-robin order from small to large, and output a current route of sample streams with priorities of 1 to N-1, where N denotes the number of sample streams:

a 1: forming the s-th state information by using the sample flow information, the updated link load vector and the current priority;

a 2: inputting the s-th state information into the DQN, and outputting the current route of the sample flow with the priority of s;

Optionally, the calculating module 504 may include:

the first calculation unit is used for calculating a load vector of a target link according to the N-1 th state information and the current route of the sample flow with the priority of N-1; wherein the target link load vector comprises: the actual link load value of each link in the preset training environment corresponding to the sample traffic information;

the second calculation unit is used for calculating an incentive function value MLV corresponding to the sample flow information according to the target link load vector;

and the third calculation unit is used for calculating the value of a preset loss function according to the reward function value and the sample routing strategy.

Optionally, the second calculating unit is specifically configured to calculate, according to the target link load vector, an incentive function value MLV corresponding to the sample traffic information by using the following formula:

MLV＝min((t₁-l₁)，(t₂-l₂)，(t₃-l₃)...(t_m-l_m))

wherein l₁，l₂，l₃，...，l_mDenotes the actual link load value, t, of the

link

1, 2, 3₁，t₂，t₃，...，t_mBandwidth thresholds for

links

1, 2, 3,. ·, m, respectively;

a third calculating unit, configured to calculate, according to the reward function value and the sample routing policy, a preset value of the loss function according to the following formula:

L(θ)＝E[(MLV+γmax_aQ(s′，a′|θ)-Q(s，a|θ)²]

wherein L (θ) represents the loss function, MLV represents the reward function value, γ represents a discount factor, γ is greater than or equal to 0 and less than or equal to 1, θ represents a current network parameter of the DQN, Q (s, a | θ) represents an accumulated reward obtained by outputting a current route of each sample stream after the initial state information s is input to the DQN, a represents a current route of each sample stream, max_aQ (s ', a' | θ) represents the optimal jackpot determined from the sample routing policy.

Optionally, the first determining module 402 may include:

a first determination unit for determining for each of the flows f_i，jDetermining x alternative routes for the flow

Wherein i represents the flow f_i，jJ represents the flow f_i，jThe target node of (1);

a fourth calculation unit for calculating the flow f by the following formula_i，jThe r-th alternative route of

Evaluation value EV of (1)_r：

Wherein l₁，l₂，l₃,., L represents the r-th alternative route

The various links that are traversed by the link,

representing said flow f in said network_i，jAlternative routes for flows other than those, via the links l₁，l₂，l₃,., each total number of L;

representing said total number of times

Maximum value of (1);

a fifth calculation unit calculating the flow f by the following formula_i，jIs given priority reference value P_i，j：

P_i，j＝max(E)-max(E\{max(E)})

Wherein E represents the flow f_i，jE ═ EV, as a set of evaluation values of all candidate routes₁，EV₂，EV₃...EV_X(E) represents the maximum value in the set E, E \ max (E) represents a new set formed by removing the maximum value max (E) in the set E, and max (E \ max (E) } represents the maximum value in the new set E \ max (E));

a second determining unit, configured to determine the priority of each stream according to the order of the priority reference value of each stream; the priority of the stream with the highest priority reference value is 0, and the priority of the stream with the lowest priority reference value is N-1.

Optionally, the alternative route of each flow satisfies the following condition:

any alternative route of each flow is loop-free;

the path passed by any alternative route of each flow is not identical to the path passed by other alternative routes;

the distance of any alternative path of each flow meets a preset value.

In addition, an SDN controller is provided in an embodiment of the present invention, as shown in fig. 6, including a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to implement the routing decision method based on deep reinforcement learning in the SDN architecture according to any one of the foregoing embodiments when executing the program stored in the memory 603.

The communication bus mentioned in the SDN controller may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In another embodiment of the present invention, a computer-readable storage medium is further provided, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the computer is caused to execute a routing decision method based on deep reinforcement learning under the SDN architecture in any one of the foregoing embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The term "comprising" is used to specify the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof, but does not exclude the presence of other similar features, integers, steps, operations, components, or groups thereof.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device/SDN controller/storage medium embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for relevant points, refer to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A routing decision method based on deep reinforcement learning under an SDN architecture is applied to a Software Defined Network (SDN) controller, and the method comprises the following steps:

determining a priority for each of the flows;

the DQN is obtained by training according to sample traffic information and a sample routing strategy corresponding to the sample traffic information; the sample traffic information includes: the link bandwidth occupied by each sample flow, the sample routing policy includes: a sample route of each sample flow corresponding to the sample traffic information;

wherein the training process of the DQN comprises:

constructing initial DQN;

obtaining sample traffic information and a sample routing strategy corresponding to the sample traffic information;

inputting the sample traffic information into the DQN, and obtaining a current route of each sample flow according to a preset priority order of each sample flow;

calculating a preset loss function value according to the current route of each sample flow and the sample routing strategy;

when the calculated loss function value is not lower than a first preset value, adjusting the network parameters of the DQN, returning the sample traffic information to the DQN, and obtaining the current route of each sample flow according to the preset priority order of each sample flow;

and when the calculated loss function value is lower than the first preset value, finishing the training to obtain the DQN after the training is finished.

2. The method of claim 1, wherein the inputting the sample traffic information into the DQN to obtain the current route of each sample flow according to a preset order of priority of each sample flow comprises:

forming initial state information by using the sample flow information, the initial value of the link load vector and the priority 0; wherein the link load vector is a vector formed by link load values of each link in a preset training environment, and the link load value of any link is as follows: the sum of the link bandwidths occupied by each sample flow passing through the link;

inputting the initial state information into the DQN, and outputting a current route of a sample flow with the priority of 0;

updating the load vector of the current link according to the initial state information and the current route of the sample flow with the priority of 0, and increasing the priority by 1;

setting s to be 1, … and N-1, circularly executing the following steps a1-a3 in the order of s from small to large, and outputting the current route of the sample flow with the priority of 1-N-1, wherein N represents the number of the sample flows:

3. The method of claim 2, wherein calculating the value of the pre-set penalty function based on the current route of each sample flow and the sample routing policy comprises:

calculating a load vector of a target link according to the N-1 state information and the current route of the sample flow with the priority of N-1; wherein the target link load vector comprises: the real-time link load value of each link in the preset training environment corresponding to the sample flow information;

calculating an incentive function value MLV corresponding to the sample flow information according to the target link load vector;

and calculating a preset loss function value according to the reward function value and the sample routing strategy.

4. The method of claim 3,

calculating a reward function value MLV corresponding to the sample flow information according to the target link load vector by using the following formula:

MLV＝min((t₁-l₁)，(t₂-l₂)，(t₃-l₃)...(t_m-l_m))

wherein l₁，l₂，l₃，...，l_mRepresents the real-time link load value, t, of link 1, 2, 3₁，t₂，t₃，...，t_mBandwidth thresholds for links 1, 2, 3,. ·, m, respectively;

calculating the value of a preset loss function according to the reward function value and the sample routing strategy by the following formula:

L(θ)＝E[(MLV+γmax_aQ(s′，a′|θ)-Q(s，a|θ)²]

5. The method of claim 1, wherein the determining the priority of each flow comprises:

for each of said flows f_i，jDetermining x alternative routes for the flow

calculating the flow f by the following formula_i，jThe r-th alternative route of

Evaluation value EV of (1)_r：

Wherein l₁，l₂，l₃,., L represents the r-th alternative route

The various links that are traversed by the link,

representing said total number of times

Maximum value of (1);

calculating the flow f by the following formula_i，jPriority reference value F_i，j：

P_i，j＝max(E)-max(E\{max(E)})

determining the priority of each flow according to the high-low order of the priority reference value of each flow; the priority of the stream with the highest priority reference value is 0, and the priority of the stream with the lowest priority reference value is N-1.

6. The method of claim 5, wherein the alternative route for each flow satisfies the following condition:

any alternative route of each flow is loop-free;

the distance of any alternative path of each flow meets a preset value.

7. A routing decision device based on deep reinforcement learning under an SDN architecture is applied to an SDN controller, and the device comprises:

a first determining module for determining a priority of each of the streams;

the DQN is obtained by training according to sample traffic information and a sample routing strategy corresponding to the sample traffic information; the sample traffic information includes: the link bandwidth occupied by each sample flow, the sample routing policy includes: a route of each of the sample streams corresponding to the sample traffic information;

wherein the apparatus further comprises:

the construction module is used for constructing initial DQN;

the second acquisition module is used for acquiring the sample traffic information and a sample routing strategy corresponding to the sample traffic information;

a third determining module, configured to input the sample traffic information into the DQN, and obtain a current route of each sample flow according to a preset priority order of each sample flow;

a calculation module, configured to calculate a value of a preset loss function according to the current route of each sample flow and the sample routing policy;

the first processing module is used for adjusting the network parameters of the DQN and triggering the third determining module when the calculated value of the loss function is not lower than a first preset value;

and the second processing module is used for finishing the training when the calculated loss function value is lower than the first preset value to obtain the DQN after the training is finished.

8. An SDN controller, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, implementing the method steps of any of claims 1-6.