CN114221691A

CN114221691A - Software-defined air-space-ground integrated network route optimization method based on deep reinforcement learning

Info

Publication number: CN114221691A
Application number: CN202111558363.3A
Authority: CN
Inventors: 孙永亮; 廖森山; 陈沁柔
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-22

Abstract

The invention discloses a software-defined air-space-ground integrated network routing optimization method based on deep reinforcement learning. And establishing a deep reinforcement learning model according to network characteristics, taking the collected state data as the input of the deep reinforcement learning model, and outputting a link weight matrix of the network through training. When data forwarding is carried out, K Paths are calculated by using a K Shortest path Algorithm (K Shortest Paths Algorithm, KSP) and an alternative path set is formed, meanwhile, a proper path is selected for data forwarding according to real-time monitoring of a controller on a link state, and finally, the convergence speed of a deep reinforcement learning model is improved by calculating a reward value under the current state, so that optimization of software-defined air-ground integrated network routing is realized. The invention not only can effectively adapt to the dynamically changing network topology, but also obviously improves the average end-to-end time delay and the throughput in comparison with the prior method, and improves the data transmission efficiency of the air-space-ground integrated network.

Description

Software-defined air-space-ground integrated network route optimization method based on deep reinforcement learning

Technical Field

The invention relates to the field of wireless communication, in particular to a software-defined air-space-ground integrated network route optimization method based on deep reinforcement learning.

Background

With the development of communication technology and the increase of internet service demand, the requirements of users on the network communication range and the network communication quality are continuously increased. The traditional foundation network has good communication quality, but cannot cover areas with severe environments, such as forests, mountains, oceans and the like. The space-based network can ensure the global coverage of signals by using a satellite as a relay node, but due to the influence of the space environment, the space-based network has the problems of long time delay, high error rate and the like. With the continuous increase of user demands, the air-space-ground integrated network combining the foundation network and the space-base network becomes one of effective solutions. The air-ground integrated network has the characteristics of large coverage area, high communication speed, high reliability and the like, and can meet the requirements of different fields on network communication. However, due to the problems of dynamic change of network topology, poor link quality and the like, the space-air-ground integrated network needs to establish an effective route optimization strategy to improve the performance of the network.

Due to the complex characteristics of dynamic topology change, high error rate, large transmission delay and the like in the air-space-ground integrated network, the network is difficult to construct a stable end-to-end transmission path on the basis of guaranteeing the service quality. Because the topology which can not deal with the dynamic change in real time, the traditional static topology routing algorithm can not adjust the corresponding routing strategy according to the real-time change of the node and the link state. The dynamic topological routing algorithm has high requirements on the hardware conditions of the network, occupies a large amount of node resources, and cannot completely adapt to the characteristic that the node resources in the air-space-ground integrated network are limited. Therefore, realizing data forwarding under the condition of adapting to dynamically-changed air-space-ground integrated network topology becomes a problem which needs to be solved urgently.

In recent years, the deep reinforcement learning algorithm is widely applied to various scenes. The method combines deep learning on the basis of reinforcement learning, improves the perception capability to the environment while ensuring the decision-making capability, and can directly control the whole process from original input to output. According to different selection modes of actions in the optimization process, the deep reinforcement learning can be divided into value function-based deep reinforcement learning and strategy gradient-based deep reinforcement learning. Due to the development of network communication technology and the proposal of emerging network architecture, the deep reinforcement learning has the possibility of realizing dynamic routing under the air-space-ground integrated network architecture on the aspects of software and hardware. Therefore, the deep reinforcement learning algorithm is applied to the network routing module, and a new idea is provided for the optimization of the air-space-ground integrated network routing.

Disclosure of Invention

Aiming at the existing problems, the invention provides a software-defined air-space-ground integrated network route optimization method based on deep reinforcement learning, so that the dynamic network service quality requirement is met, the data transmission efficiency is improved, and the optimization of network routes is finally realized.

Different from the existing treatment method, the improved method of the invention is as follows: (1) according to the characteristics of the air-space-ground integrated network, a deep reinforcement learning model is established, so that the network environment is better perceived, and the stability and reliability of network routing are improved; (2) the time sequence characteristics between adjacent states are extracted by utilizing the capability of processing the internal relation between the states of the long-term and short-term memory network, so that the time sensitivity of the deep reinforcement learning model is improved; (3) and calculating an alternative path set by using a KSP algorithm, and selecting a proper path according to the network state detected by the controller, thereby avoiding the problem of local congestion caused by frequently using a single path. Compared with the prior art, the invention can realize the self-adaptive routing strategy according to the network link state, and realize the load balance of the network overall situation while acquiring the shortest path.

The method has the beneficial effects that: (1) establishing a deep reinforcement learning model according to the characteristics of the air-space-ground integrated network, and improving the adaptability of a routing algorithm to dynamic topology; (2) the average end-to-end delay and the throughput are remarkably improved, the data transmission efficiency of the air-space-ground integrated network is improved, and the method has higher theoretical value and practical significance.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a diagram showing a neural network structure in the present invention.

Fig. 3 is a diagram of a software-defined space-time-ground integrated network structure in an embodiment.

Fig. 4 is a graph comparing simulations of normalized throughput of the route optimization algorithm of the present invention with other route optimization algorithms in an embodiment.

Fig. 5 is a simulation comparison graph of average end-to-end delay of the route optimization algorithm of the present invention and other route optimization algorithms in the embodiment.

Detailed Description

The following describes a specific embodiment of a software-defined air-ground integrated network route optimization method based on deep reinforcement learning in detail with reference to the flow of fig. 1.

As shown in fig. 1, the method for optimizing a software-defined air-space-ground integrated network route based on deep reinforcement learning provided by the present invention includes:

step 1: and establishing a software-defined air-space-ground integrated network topology according to the software-defined network thought and the air-space-ground integrated network node parameters, and initializing a topology discovery module, a network perception module and a routing decision module.

Step 2: through the topology discovery module and the network sensing module initialized in the step 1, the controller monitors the network topology in the current state and the state data of each link in the current network, and stores the collected link state data as a state matrix L.

And step 3: modeling a network transmission process into a Markov decision process, establishing a deep reinforcement learning model based on a double-delay deep certainty strategy gradient algorithm, inputting the state matrix L in the step 2 into the deep reinforcement learning model, and outputting a link weight matrix W of the network topology through training.

The deep reinforcement learning model is built by combining the characteristics of the air-space-ground integrated network, and the specific building process is as follows:

firstly, modeling a network transmission process into a Markov decision process, wherein parameters needing to be designed comprise: state S, action A, reward value r;

a. taking the state matrix L collected by the controller as the state S of the Markov decision process, wherein the state S is expressed as follows:

wherein, b_ijAnd d_ijAre respectively a link l_ijN is the total number of nodes.

b. Taking a link weight matrix W output by the deep reinforcement learning model as an action A of the Markov decision process, wherein the action A is represented as follows:

wherein, w_ijIs a link l output through the Actor network_ijThe weight of (2).

c. Introducing the bandwidth and the delay of each link in the network to calculate the reward value r of the Markov decision process, wherein the calculation formula of the reward value r is as follows:

where α and β are adjustment factors determined according to the routing policy.

Then, because the deep reinforcement learning model is based on a dual-delay deep deterministic strategy gradient algorithm, a reasonably designed neural network structure is needed to update the cost function. In the present embodiment, the neural network structure is shown in fig. 2;

the neural network module of the double-delay depth deterministic strategy gradient algorithm comprises an Actor module, a Critic _1 module and a Critic _2 module, each module consists of an online network and a target network, the neural network structures of the online network and the target network are the same, and the network structures of the Actor module, the Critic _1 module and the Critic _2 module are as follows:

a. the Actor module comprises a Long Short-Term Memory (LSTM) layer and a Full Connected (FC) layer, wherein the input of the Actor module is a state matrix L, and the output of the Actor module is a weight matrix W formed by the weights of links in the network.

b. The critical _1 module and the critical _2 module comprise a layer of long-short term memory network and a layer of three layers of full-connection layer networks, the inputs of the critical _1 network and the critical _2 network are two parts, namely a state matrix L which is the same as the input of the Actor network and a weight matrix W output by the Actor network, and the outputs of the critical _1 module and the critical _2 module are Q values of corresponding actions in the current state.

And 4, step 4: inputting the link weight matrix W output by the deep reinforcement learning model in the step 3 into the routing decision module initialized in the step 2, acquiring an optimal path by executing a K Shortest path Algorithm (K Shortest Paths Algorithm, KSP), forwarding data, and calculating the reward value r of the current state_t。

a. Calculating k available paths by using a KSP algorithm, and forming an alternative path set P, wherein the alternative path set P is represented as follows:

P＝{p_i|i＝1，2，…，k} (4)

wherein p is_iIs the ith path in the alternative path set.

b. And sorting the paths in the alternative path set from low to high according to the path weights, and selecting one path with the lowest path weight as the optimal path for data packet transmission.

c. In the transmission process, when the controller monitors that the available bandwidth of a certain node in the path is smaller than the size of a data packet, a suboptimal path is selected from the alternative path set for forwarding.

And 5: obtaining the state matrix L of the current time slot t obtained in the step 2_tAnd with the next time slotState matrix L of_t+1And 3, outputting the link weight matrix W of the current time slot t_tAnd 4, calculating the reward value r of the current state_tWith (L)_t，W_t，r_t，L_t+1) The method is input into an experience playback pool of the deep reinforcement learning model, meanwhile, a random sampling strategy is adopted to update parameters, and iterative training is carried out according to the updated parameters until the model converges.

The invention is analyzed below by way of an example. The space-air-ground integrated network topology structure is composed of 3 Geosynchronous Orbit (GEO) satellites, 70 Low Orbit (LEO) satellites and 16 ground stations. Wherein the orbit height of the GEO satellite is 36000 kilometers, the number of the orbits is 1, and the inclination angle of the orbit is 0 degree. The orbital height of the LEO satellite is 550 kilometers, the number of orbits is 7, and the orbital inclination angle is 53 degrees. Meanwhile, a virtual switch is deployed on the LEO satellite, and a controller is deployed on the GEO satellite so as to realize a software-defined air-ground integrated network, as shown in fig. 3.

The invention carries out 150 times of data transmission simulation experiments in total, and uses bwm-ng tool and ping tool to monitor the standardized throughput and average end-to-end delay of the method and the comparison method provided by the invention in the experimental process, and the monitoring results are shown in fig. 4 and fig. 5. Compared with the result of a routing algorithm based on a deep deterministic strategy gradient algorithm, the algorithm provided by the invention has lower average end-to-end time delay and higher standardized throughput. The algorithm provided by the invention can adaptively adjust the routing strategy according to the network state, can well adapt to dynamically changing space-sky-ground one-day network scenes, and has higher theoretical value and practical significance.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or flow transformations made by using the contents of the specification and the drawings, or directly or indirectly applied to the related art, are included in the scope of the present invention.

Claims

1. A software-defined air-space-ground integrated network routing optimization method based on deep reinforcement learning is characterized by comprising the following steps:

the method comprises the following steps: according to the software-defined network idea and the air-space-ground integrated network node parameters, a software-defined air-space-ground integrated network topology is built, and a topology discovery module, a network perception module and a routing decision module are initialized;

step two: through the topology discovery module and the network sensing module initialized in the first step, the controller monitors the network topology in the current state and the state data of each link in the current network, and stores the collected link state data as a state matrix L;

step three: modeling a network transmission process into a Markov decision process, establishing a deep reinforcement learning model based on a double-delay deep certainty strategy gradient algorithm, inputting the state matrix L in the step two into the deep reinforcement learning model, and outputting a link weight matrix W of the network topology through training;

step four: inputting the link weight matrix W output by the deep reinforcement learning model established in the third step into the routing decision module initialized in the first step, acquiring an optimal path by executing a K Shortest path Algorithm (K Shortest Paths Algorithm, KSP), forwarding data, and calculating the reward value r of the current state_t；

Step five: obtaining the state matrix L of the current time slot t obtained in the step two_tAnd the state matrix L of the next time slot_t+1Step three, outputting the link weight matrix W of the current time slot t_tAnd step four, calculating the reward value r of the current state_tWith (L)_t，W_t，r_t，L_t+1) The method is input into an experience playback pool of the deep reinforcement learning model, meanwhile, a random sampling strategy is adopted to update parameters, and iterative training is carried out according to the updated parameters until the model converges.

2. The method for optimizing the routing of the software-defined air-space-ground integrated network based on the deep reinforcement learning of claim 1, wherein the process of modeling the transmission process into the markov decision process in the step three is as follows:

wherein, b_ijAnd d_ijAre respectively a link l_ijN is the total number of nodes;

wherein, w_ijIs the link l of the Actor network output_ijThe weight of (2);

where α and β are adjustment factors determined according to the routing policy, respectively.

3. The method for optimizing routing of a software-defined air-space-ground integrated network based on deep reinforcement learning according to claim 1, wherein the deep reinforcement learning model is established based on a double-delay depth deterministic policy gradient algorithm in step three, a neural network module of the double-delay depth deterministic policy gradient algorithm includes an Actor module, a critical _1 module and a critical _2 module, each module is composed of an online network and a target network, the neural networks of the online network and the target network have the same structure, and the network structures of the Actor module, the critical _1 module and the critical _2 module are as follows:

a. the Actor module comprises a Long Short-Term Memory (LSTM) layer and a Full Connected (FC) layer, wherein the input of the Actor module is a state matrix L, and the output of the Actor module is a weight matrix W formed by the weights of all links in the network;

4. The deep reinforcement learning-based software-defined air-space-ground integrated network routing optimization method according to claim 1, wherein the KSP algorithm in step four calculates the optimal path in the topology as follows:

the method comprises the following steps: calculating k available paths by using a KSP algorithm, and forming an alternative path set P, wherein the alternative path set P is represented as follows:

P＝{p_i|i＝1，2，…，k} (4)

wherein p is_iIs the ith path in the alternative path set;

step two: sorting the paths in the alternative path set from low to high according to the path weights, and selecting one path with the lowest path weight as an optimal path for data packet transmission;

step three: in the transmission process, when the controller monitors that the available bandwidth of a certain node in the path is smaller than the size of a data packet, a suboptimal path is selected from the alternative path set for forwarding.