CN117041129A

CN117041129A - Low-orbit satellite network flow routing method based on multi-agent reinforcement learning

Info

Publication number: CN117041129A
Application number: CN202311071886.4A
Authority: CN
Inventors: 赖俊宇; 刘华烁; 徐国尧; 朱俊宏; 甘炼强
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-08-24
Filing date: 2023-08-24
Publication date: 2023-11-10

Abstract

The invention discloses a low-orbit satellite network flow routing method based on multi-agent reinforcement learning, belonging to the technical field of computer networks and communication. The invention comprehensively utilizes the reinforcement learning and the routing technology based on the data flow. The number of times of carrying out the deep neural network model inference on the low orbit satellite broadband network data packet can be effectively reduced, and the accumulated time spent on the deep neural network model inference is obviously reduced. And the routing performance of the large-scale low-orbit satellite broadband network can be effectively improved and promoted, so that the network performance requirement of the large-scale low-orbit satellite broadband network can be better met.

Description

Low-orbit satellite network flow routing method based on multi-agent reinforcement learning

Technical Field

The invention belongs to the technical field of computer networks and communication, and particularly relates to a Flow-based routing (Flow-based routing) method based on Multi-agent reinforcement learning (Deep Reinforcement Learning, DRL) in a low-orbit satellite network.

Background

In recent years, with the rapid increase of the ubiquitous communication demands of human beings and the continuous emergence of various innovative applications, large-scale Low Earth Orbit (LEO) satellite networks, such as Starlink constellations proposed by SpaceX, have become research hotspots in industry and academia. Low-orbit satellite broadband network (LEO Satellite Broadband Network, LSBN) is widely regarded as an important complement to future terrestrial networks and will play a key role in the upcoming sixth generation (6G) mobile communication network system. Compared with the traditional high orbit satellite network, the low orbit satellite broadband network has the advantages of seamless coverage on the earth surface, small point-to-point communication delay and low communication transmission power consumption. However, the high dynamics and mobility of low-orbit satellites result in intermittent link connections and dynamic network topology, which makes conventional routing algorithms designed for terrestrial networks unsuitable for large-scale low-orbit satellite broadband networks directly.

On the other hand, artificial intelligence techniques based on Deep Reinforcement Learning (DRL) are finding increasing application in many scientific fields. Researchers have utilized deep reinforcement learning methods to implement routing and switch forwarding of data packets in conventional terrestrial networks. The academy has recently begun to study low-orbit satellite broadband network routing methods based on deep reinforcement learning. Preliminary experimental evaluation results show that the routing method based on deep reinforcement learning can outperform the performance of the traditional routing algorithm in the low-orbit satellite broadband network. However, most existing studies only assume that the routing decision process for a packet can be performed and completed immediately after the router receives a packet, and the above assumption of the decision process is over-idealized, ignoring the time required for Deep Neural Network (DNN) model reasoning in making a packet routing decision in an actual network environment. The time required for deep neural network model reasoning cannot be ignored in view of the limited computational resources on low orbit satellites. This will increase the transmission delay of the data packets in the network, increase the data packet loss rate, and ultimately limit the throughput of network traffic in the low-orbit satellite broadband network. Therefore, ignoring the time of reasoning of the deep neural network model threatens the correctness of the conclusions drawn by these prior research efforts.

Disclosure of Invention

In order to eliminate the negative effect of the deep neural network model inference time on the routing performance, the invention provides a Multi-agent reinforcement learning (MADRL) based Flow-based routing method which makes routing decisions for network data flows rather than for each individual data packet. Flow routing is formalized as a multi-agent decision problem based on a Partially Observable Markov Decision Process (POMDP). Each low orbit satellite acts as an Agent that can forward a network data stream to one of its neighboring satellites according to its own Policy (Policy). It is emphasized that the deep neural network model on the agent only infers when it routes the first packet in a particular data stream, with subsequent packets in that data stream being forwarded according to the same routing decisions as the first packet. The invention further provides a self-adaptive data flow routing updating method, which automatically updates the routing decision and adapts to the dynamically-changed network topology so as to enhance the performance of the proposed flow routing method, because the topology dynamic of the low-orbit satellite broadband network can cause routing failure and thus affect the flow routing performance.

The technical scheme adopted by the invention is as follows: a low-orbit satellite network flow routing method based on multi-agent reinforcement learning, the method comprising:

a1, constructing a low-orbit constellation broadband network distributed inter-satellite routing model;

firstly, constructing a low-orbit constellation network routing model; the model comprises modeling of key elements such as inter-satellite communication links, satellite motion trajectories, constellation network topologies, user distribution and the like; through deep analysis of the architecture and characteristics of a target system, an accurate low-orbit constellation network routing model is constructed;

the low-orbit satellite is denoted as Sat _i I∈ {1,2, …, total }, total represents the total number of low-orbit satellites; assume that each satellite establishes n inter-satellite links to communicate with its neighboring satellitesA letter; the links are respectively connected with a front satellite and a rear satellite on the same orbit and are connected with satellites on the left side and the right side of the adjacent orbits; link _i,j Representing Sat _i To Sat _j Wherein i represents the number of the satellite at the transmitting end and j represents the number of the satellite at the receiving end;

when the low orbit satellite receives a data packet, selecting a next-hop satellite according to an upper routing algorithm of the low orbit satellite, and forwarding the data packet to the next-hop satellite through an inter-satellite link; this process will introduce time delays, including decision delays and forwarding delays: decision delay refers to the time delay from receiving a data packet to making a routing decision; the forwarding delay refers to the time delay from making a routing decision to the next hop satellite receiving a data packet; specifically, for packet k, the decision delayComprising two parts: decision queuing delay->And decision making delay->Decision queuing delay refers to the queuing time required to queue a certain low orbit satellite for routing decisions, while decision making delay refers to the time required for the satellite to make routing decisions; in the packet forwarding process, forwarding delay +.>Comprising a plurality of parts: forwarding queuing delay->Transmission delayAnd propagation delay->Forwarding queuingDelay refers to the time required for a data packet to be queued for forwarding in a low-orbit satellite, transmission delay refers to the time required for the data packet to be transmitted through an inter-satellite link, and propagation delay refers to the time required for a signal to travel from one satellite to another along the inter-satellite link;

is arranged atUp-allocated bandwidth->For transmitting data packet k, the transmission delay on the link is +.>Calculated by the following formula:

wherein S is _k Is the size of packet k; if link is _i,j Temporarily no free bandwidth, packet k will be buffered to link _i,j Will introduce a forwarding queuing delay in the forwarding queue cache of (a)When the buffer memory reaches the maximum capacity, the next incoming data packet is discarded; on the other hand, assume time tsat _i And Sat _j Is (x) _i,t ,y _i,t ,z _i,t ) And (x) _j,t ,y _j,t ,z _j,t ) The method comprises the steps of carrying out a first treatment on the surface of the Spatial distance between these two satellites +.>Calculated by the following formula:

if link is assumed _i,j Is transmitted by (a)The distance of broadcastingThe formula can be used:

to calculate signal propagation delayWherein c is the speed of light in vacuum;

calculating low-orbit satellite Sat _i Total delay D of upper route packet k _i,k ：

If the next-hop low-orbit satellite is not the target node, the above procedure will be performed again on the next-hop low-orbit satellite;

a2: modeling the routing problem as a locally observable markov decision process;

the routing performance optimization problem of the low-orbit constellation broadband network is converted into a locally observable Markov decision process, so that the uncertainty and randomness of the system are better described, and the complex decision problem is effectively processed; the process P is described by the following 6-tuple:

P＝(S,A,T,R,O,γ)

wherein S is the global state space of the environment, a is the action set shared by the agents, T is the state transfer function of the environment, r=sxa is the global rewarding function shared by the agents, O is the local observation state space of the agents, γe [0,1] is the discount factor for balancing the long-term rewarding; the local observation state, action and reward functions are more specifically defined as:

the actions are as follows: after each intelligent agent receives the data packet, the routing decision is needed to be carried out on the data packet; intelligent body driven spaceSelecting an action to carry out data packet routing; wherein,and->Each representing the delivery of a data packet to one of its four adjacent satellites;

bonus function: the goal of each agent is to learn an optimal routing strategy to improve its routing performance, to ensure that each agent learns an optimal routing decision, sat _i A reward function r that routes data packet k at time t _i (t) is:

wherein, psi is penalty value to the agent when the data packet is lost, dis _j,k Representing the next hop satellite Sat _j And a normalized spatial distance between the target satellites,is the normalized forwarding delay of packet k, +.>Is to route the data packet k at Sat _j Normalized decision delay on; kappa (kappa) ₁ ,κ ₂ And kappa (kappa) ₃ Is a weight for balancing the above factors, and the cumulative discount prize is composed ofCalculation of gamma.epsilon.0, 1]Representing a discount factor;

local observation state: in a broadband network of low-orbit satellites, each low-orbit satellite serves as an intelligent agent, and the local observation state space is defined asEach satellite can be communicated with four adjacent satellites on the periphery, namely, the upper side, the lower side, the left side and the right side; wherein->Is satellite Sat _i Spatial distances of four adjacent satellites to the target satellite of the current data packet k; the invention uses Simplified General Perturbations (SGP 4) model to estimate the space position of adjacent satellite and target satellite; />Representing four connecting satellites Sat _i Network available bandwidth of inter-satellite links; />Is satellite Sat _i Current traffic load of the last four forwarding queues, +.>Is satellite Sat _i The load of the decision queue on four adjacent satellites; because the element values are different in range, normalization is needed before use;

a3: designing a routing method based on multi-agent deep reinforcement learning;

by utilizing a deep reinforcement learning technology, feedback and rewards are continuously obtained from a low-orbit constellation network environment through cooperation and learning among intelligent agents, and an inter-satellite routing strategy is optimized so as to improve the routing performance and throughput of the whole network; each satellite contains two deep neural networks: estimating Q of Q network _i (o _i ,a _i ；μ _i ) And target Q network Q _i ′(o _i ,a _i ；μ _i ') respectively from mu _i Sum mu _i ' parameterization of the corresponding network; at each decision time t, satellite Sat _i Consider its local observation o _i (t) and from action space A based on epsilon-greedy policy _i In selection action a _i (t)：

s _i (t) represents a global state, a _i (t) representing an action, when the agent selects an action based on the current observation, the agent interacts with the environment, the current state is changed to the next state o _i (t+1) while Sat _i Will get rewards r _i (t) setting an experience tuple { o } _i (t),o _i (t+1),a _i (t),r _i (t) } the experience tuples are recorded in an experience playback pool RB, which is a setting for breaking the correlation of training data, thereby optimizing the reinforcement learning training process, from which the agent randomly extracts a batch of experience tuples and trains with their parameter values updating the estimated Q network, in each iteration the target Q network is used to calculate each state-action pair (o _i (t),a _i (t)) fixed target Q value y _i (t) wherein the estimated Q network is used to obtain the next state o _i Maximum Q value of all actions on (t+1) and using target network parameter μ _i ', wherein y _i The calculation method of (t) is as follows:

wherein γ is a discount factor for determining the importance of future rewards; r is (r) _i (t) is the value for the state-action pair (o _i (t),a _i (t)) an instant prize; the loss function is:

Loss _i (t)＝(y _i (t)-Q _i (o _i (t),a _i (t)；μ _i )) ²

estimating the parameter value of the Q-network is updated by minimizing the mean square error between the estimated Q-value and the target Q-value using random gradient descent, estimating the parameter μ of the Q-network _i By copying the parameters mu of the target Q network at the end of each training iteration _i ' update:

wherein alpha is the learning rate, after each iteration, the parameters of the target Q network are updated softly according to the estimated Q network, and gradually, the estimated Q network can estimate the data packet routing decision of the agent more accurately;

a4: defining a data stream in a low orbit satellite constellation network distributed routing scene;

with the acceleration of the new generation low orbit satellite broadband network construction process and the rapid increase of the number of users, the satellite network communication frequency is further improved, and the development is continued towards the high flux and broadband network; the throughput requirement of a single satellite exceeds hundreds of Gbps, the MADRL algorithm is deployed on a low-orbit satellite and used for distributed routing forwarding, and the inherent reasoning time delay of a neural network model can lead to serious limitation of single satellite throughput and great improvement of packet loss rate, so that the high-bandwidth and low-time delay transmission requirement of a new generation low-orbit constellation broadband network can not be met; in order to fully optimize the MADRL-based distributed routing scheme, the invention provides a data Flow-based routing scheme;

a data Flow (Flow) refers to a set of ordered sequences of data packets with the same source node and destination node; in a low-orbit satellite broadband network scenario, the present invention focuses on inter-satellite routing of data packets. Thus, the low-orbit satellite nodes are used as the start and end points of the data stream, regardless of which ground user nodes the data packets in the sequence are transmitted and received by. The MADRL is a distributed algorithm, so that each low-orbit satellite node is an independent intelligent agent, and after receiving a data packet, the satellite node needs to determine the data flow to which the data packet belongs according to the network port information for receiving the data packet and the destination address information to which the data packet is sent. Thus, the definition of data flow in a low-orbit satellite network scenario is: an ordered set of packets received from the same portal on a low-orbit satellite, with the destination node being the same satellite, is defined as the same data stream.

A5: providing routing policy sharing mechanism based on data flow

The invention provides a flow routing method for effectively reducing the negative influence of the inference time of a deep neural network on the routing performance of a low-orbit satellite network. The method organizes the data packets into streams, considers factors such as characteristics of the data streams, time delay requirements, bandwidth requirements and the like, and selects an optimal routing path and a resource allocation strategy between low-orbit satellites through learning and decision on an intelligent body so as to optimize network routing performance to the greatest extent.

Considering that all data packets in the same data stream have the same destination address, each low-orbit satellite independently maintains a "flow routing table" for all data streams passing through it, similar to the traditional MADRL packet routing method, which requires the low-orbit satellite to use a deep neural model to make routing decisions for the 1 st data packet in the data stream. The obtained routing information is stored as a corresponding entry in a flow routing table, is used for a subsequent data packet in the same data flow, eliminates the need for carrying out deep neural network model reasoning on the subsequent data packet in the data flow, and obviously reduces the accumulated time spent for carrying out the deep neural network model reasoning in the low-orbit satellite broadband network. Thus, the routing performance (including end-to-end transmission delay, packet loss, and network throughput) will be significantly optimized to meet the large-scale low-orbit satellite broadband network performance requirements.

A6, designing a self-adaptive flow route updating method based on time delay jitter

The invention introduces a self-adaptive data flow route updating method, which automatically updates the data flow route decision by monitoring the transmission delay change of the inter-satellite data packet in real time. Specifically, the delay difference between two successful transmission data packets before and after calculation is based on and compared with a preset threshold. If the delay difference (delay jitter) exceeds the threshold, a multi-agent reinforcement learning algorithm will be triggered to re-route the flow, updating the routing decision. The implementation of this mechanism is completely independent of the complex network Model (Model-fe). When the satellite finishes forwarding the data packet, the time delay of the data packet transmitted in the hop is perceived and recorded, and two continuous data packets belonging to the same data stream are recordedDelay _i+1 And Delay _i Difference is made, and Delay variation delta Delay is calculated _i+1 ：

ΔDelay _i+1 ＝|Delay _i+1 -Delay _i |

Then based on the Delay jitter DeltaDelay _i+1 And judging the applicability of the routing strategy for the data flow in the routing table under the current network state. If Delay is delayed by DeltaDelay _i+1 Above a set threshold value theta _thr When the current routing path possibly has performance problems or abnormal conditions, calling the deep neural network model again when forwarding the next data packet of the data flow, executing the routing strategy output by reasoning to forward the data packet, and replacing the old strategy in the routing table, thereby completing the forwarding of the data packet and the updating of the routing strategy of the data flow.

Compared with the prior network routing technology, the invention comprehensively utilizes the reinforcement learning technology and the data flow-based routing technology. The number of times of carrying out the deep neural network model inference on the low orbit satellite broadband network data packet can be effectively reduced, and the accumulated time spent on the deep neural network model inference is obviously reduced. And the routing performance of the large-scale low-orbit satellite broadband network can be effectively improved and promoted, so that the network performance requirement of the large-scale low-orbit satellite broadband network can be better met.

Drawings

Fig. 1 is a schematic diagram of a fully distributed routing framework in a low-rail constellation network in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a flow routing method based on multi-agent reinforcement learning in an example of the present invention;

FIG. 3 is a graph of end-to-end delay test results for different numbers of ground users in an example of the present invention;

fig. 4 is a graph of packet loss rate test results for different numbers of ground users in an example of the present invention;

fig. 5 is a graph of network throughput test results for different numbers of ground users in an example of the invention.

Description of the embodiments

The following detailed description of specific embodiments of the invention is provided in connection with the accompanying drawings and specific examples. The following specific examples are given for the purpose of illustration only and are not intended to limit the scope of the invention. The specific implementation of the invention is divided into two stages: 1) Stage one. Generating data on the simulation platform to train the deep reinforcement learning model, 2) stage two. And deploying the trained deep reinforcement learning model in a real system to execute routing decisions.

Stage one: training phase

Step 1: constructing low-orbit constellation broadband network distributed inter-satellite routing model

The invention adopts classical Iridium (Iridium) constellation configuration as target network topology, and the low orbit satellite network comprises N _orbit Strip track, N _orbit =6, each track has N _{Sat_orbit} Low orbit satellite with evenly distributed particles, N _{Sat_orbit} =11. Specific parameter values for the iridium network topology are shown in table 1.

Table 1 iridium network topology parameter values

Parameter name	Symbolic representation	Parameter value
			Track number	N _orbit	6
Number of satellites per orbit	N _{Sat_orbit} ＝11	11
			Track height	h _orbit	780km
Satellite movement speed	v _sat	7.46km/s
			Longitude difference between homodromous tracks	β	31.6°
Longitude difference between opposite tracks	α	22°
			Track semi-long shaft	r _a	7185km
Eccentricity of orbit	e	0
			Near-site depression angle	ω	0°
Track tilt angle	i	86.4°

In this network, the low-orbit satellite is denoted as Sat _i I∈ {1,2, …, total } (total represents the total number of LEO satellites). Four Inter-Satellite Links (ISLs) may be established for each Satellite) For communication with four low-orbit satellites adjacent thereto. These links are respectively connected to the satellites on the upper and lower sides of the same orbit, and to the satellites on the left and right sides of the adjacent two orbits. By link _i,j To represent Sat _i To Sat _j Wherein i denotes the number of the transmitting side satellite and j denotes the number of the receiving side satellite.

Step 2: constructing a ground user distribution model and generating a communication request according to a user behavior model

The present invention uses GPW4 (Gridded Population of the World, version 4) as a data source for building a ground based user distribution model. GPW4 is a global population gridding dataset developed by Cooperation of the university of Columbia and the International center of earth science and application. The dataset provides grid data of global population quantity, density and distribution based on a variety of data sources including national census, remote sensing data, land utilization data, and the like. The invention selects proper resolution and data format to process the GPW4 data set, then divides the ground into M continuous areas with uneven user distribution according to the global population quantity and distribution information provided by the GPW4 data set, and the user position in each area is uniformly distributed, and the probability distribution function is as follows:

where a and b are the boundaries of the region.

The invention adopts a probability statistical model to represent the user communication request in a period of time, and assumes that all user behaviors independently and periodically send data packets to an access satellite, and the time intervals of two adjacent tasks of a single user are distributed in a negative exponential manner, and the probability density function is as follows:

wherein,the expected value of the request time interval is sent for a single user.

Step 3: the data packet is sent to a satellite, and the satellite acquires the local state information observed quantity

In a low-orbit satellite constellation network with large space scale characteristics, a centralized control node is difficult to acquire a network global state in time to realize real-time routing decision, so that the satellite is defined as an intelligent agent independent of each other, and the data packet routing decision is determined only according to local observation information. For each satellite, after receiving the data packet, its local observation state space is defined asWherein->Is satellite Sat _i Spatial distances of four adjacent satellites to the target satellite of the current data packet k. The present invention uses the existing SGP4 model to estimate the spatial locations of neighboring satellites and the target satellite. />Representing four connections Sat _i Is available bandwidth for the network of inter-satellite links. />For Sat _i Current traffic load of the last four forwarding queues, +.> For Sat _i Is the load of the decision queue on four adjacent satellites. Since the above element numerical ranges are not identical, they are normalized using the following formula:

step 4: satellite relies on the local state observed quantity obtained in the step 3, utilizes a strategy network in a deep reinforcement learning model to select the optimal action, and executes the routing decision of the data packet

After each satellite agent receives the data packet and obtains the local information observation quantity, the routing decision is needed for the data packet. The agent can move from the working spaceAnd selecting one action to perform data packet routing. Wherein (1)>And->Each representing the transfer of the data packet to one of four adjacent satellites as the next hop.

In the training stage, the selection of the routing strategy by the agent each time is divided into two cases of exploration and utilization, and the exploration and utilization are compromised based on probability by utilizing an epsilon-greedy algorithm: the intelligent agent tries to randomly explore with epsilon probability, and uses the current optimal strategy with 1-epsilon probability, so that training samples can be collected more widely to a certain extent.

Step 5: calculating low-orbit constellation broadband network node data packet routing delay

In this step, when a low-orbit satellite receives a data packet, it will select the next-hop satellite to process the data packet through the routing decision obtained in step 4, and forward the data packet to the next-hop satellite through the inter-satellite link. This routing process requires a certain time delay, including decision delays and forwarding delays. The decision delay refers to the time delay from receiving a data packet to making a routing decision, and the forwarding delay refers to the time delay from making a routing decision to receiving a data packet by the next hop satellite.

In particular, for a broadband network at a low-orbit satelliteUpstream-routed packet k, decision delayComprising two parts: decision queuing delay->And decision making delay->Decision queuing delay refers to the time required to queue a routing decision in a satellite, while decision-making delay refers to the time required for a satellite to make a routing decision. In the packet forwarding process, forwarding delay +.>Also comprising a plurality of parts: forwarding queuing delay->Transmission delay->And propagation delay->The propagation delay refers to the time required to wait for a packet to be forwarded in a satellite, the propagation delay refers to the time required for a packet to be transmitted over an inter-satellite link, and the propagation delay refers to the time required for a packet to travel along an inter-satellite link from one satellite to another.

If in link _i,j On which bandwidth is allocatedFor transmitting data packet k, the transmission delay on the link is +.>The formula can be used:

and (5) performing calculation. Wherein S is _k Is the size of packet k. If link is _i,j Temporarily no free bandwidth, packet k is buffered to a link _i,j This will introduce a forwarding queuing delay in the forwarding queue buffer of (1)When the buffer reaches maximum capacity, subsequent packets will be discarded. On the other hand, assume time tsat _i And Sat _j Is (x) _i,t ,y _i,t ,z _i,t ) And (x) _j,t ,y _j,t ,z _j,t ). Spatial distance between these two satellites +.>The formula can be used:

and (5) performing calculation.

If link is assumed _i,j Is of the propagation distance ofThe formula can be used:

to calculate signal propagation delayWherein c is the speed of light in vacuum.

In summary, at Sat _i The total delay of the upstream packet k may be as follows:

to calculate. If the next-hop satellite is not the target node, the above procedure will be performed again on the next-hop satellite.

Step 6: calculating rewards value of intelligent agent for route decision

In this step, if the current data packet is forwarded to a neighboring satellite, a corresponding reward is given to the agent according to the delay calculated in step 5. The goal of each agent is to learn an optimal routing strategy to improve routing performance, in order to ensure that each agent (i.e., satellite) learns an optimal routing decision, sat _i The bonus function for routing data packet k at time t is defined as follows

Wherein, psi is penalty value to the agent when the data packet is lost, dis _j,k Representing the next hop satellite Sat _j And a normalized spatial distance between the target satellites,is the normalized forwarding delay of packet k, +.>Is to route the data packet k at Sat _j Normalized decision delay on (a) the (b). Kappa (kappa) ₁ ,κ ₂ And kappa (kappa) ₃ Is a weight used to balance the above factors. Cumulative discount rewards are composed ofCalculation of gamma.epsilon.0, 1]Representing the discount factor.

Step 7: training a strategic network of reinforcement learning models for each agent

The multi-agent deep reinforcement learning algorithm is a strategy for optimizing the routing strategy of each low-orbit satellite data packet so as to realize the following steps ofA method of maximizing overall cumulative discount returns. And each satellite has two deep neural networks: estimating Q of Q network _i (o _i ,a _i ；μ _i ) And target Q network Q _i ′(o _i ,a _i ；μ _i ') respectively from mu _i Sum mu _i 'parameterization'. At each decision time t, satellite Sat _i Consider its local observation o _i (t) and from action space A based on epsilon-greedy policy _i In selection action a _i (t)：

When the agent selects actions according to the current observation, the agent interacts with the environment, and the current state is changed into the next state o _i (t+1) while Sat _i Will get rewards r _i (t). In this algorithm, the empirical tuple { o } needs to be set _i (t),o _i (t+1),a _i (t),r _i (t) }. The experience tuples will be recorded into a playback experience playback pool RB, which then the agent trains by randomly extracting a batch of experience tuples from the RB and updating the parameter values of the Q network with them. In each iteration, the target Q network is used to calculate each state-action pair (o _i (t),a _i (t)) fixed target Q value y _i (t). The target Q value is calculated using the Bellman equation, where the estimated Q network is used to obtain the next state o _i Maximum Q value of all actions on (t+1) and using target network parameter μ _i ′：

Wherein γ is a discount factor for determining the importance of future rewards; r is (r) _i (t) is the value for the state-action pair (o _i (t),a _i (t)) immediate rewards. The loss function is defined as:

Loss _i (t)＝(y _i (t)-Q _i (o _i (t),a _i (t)；μ _i )) ²

the parameter values of the Q network are updated by using random gradient descent to minimize the mean square error between the estimated Q value and the target Q value. Estimating the parameter μ of the Q network _i By copying the parameters mu of the target Q network at the end of each training iteration _i ' update:

where α is the learning rate. After each iteration, the parameters of the target Q network are soft updated according to the estimated network. Increasingly, Q networks can more accurately estimate packet routing decisions for agents.

Stage two: execution phase

Step 1: deploying the deep reinforcement learning model with stage one training completion into a real low orbit satellite network

In this embodiment, a real Iridium (Iridium) constellation network is used as an application scenario for executing a stage, and a DRL model trained in the stage one is deployed on each satellite in the Iridium constellation, so as to form a completely distributed low-orbit satellite network routing architecture to execute routing decisions. Specific parameter values for the iridium satellite based network topology are shown in table 2.

Table 2 iridium network topology parameter values

In an iridium satellite based network, the low-orbit satellite is denoted as Sat _i I∈ {1,2, …, total } (total represents the total number of LEO satellites). The present invention assumes that each Satellite in an iridium Satellite based network may establish four Inter-Satellite links (ISLs) for communication with its neighboring satellites. These links are connected to satellites above and below the same orbit, and to satellites to the left and right of the adjacent two orbits, respectively. The four links are respectively connected by link _i,j To represent Sat _i To Sat _j Wherein i denotes the number of the transmitting side satellite and j denotes the number of the receiving side satellite. Meanwhile, the invention uses GPW4 (Gridded Population of the World, version 4) to simulate the global distribution state of real low orbit satellite network users. GPW4 is a global population gridding dataset developed by Cooperation of the university of Columbia and the International center of earth science and application. The dataset provides grid data of global population quantity, density and distribution based on a variety of data sources including national census, remote sensing data, land utilization data, and the like. The invention selects proper resolution and data format to process the GPW4 data set, then divides the ground into M continuous areas with uneven user distribution according to the global population quantity and distribution information provided by the GPW4 data set, and the user positions in each area are uniformly distributed.

When a low-orbit satellite (e.g., LEO #23 in fig. 3) receives a data packet, it will make a data stream routing decision using the deep reinforcement learning model trained in stage one and forward the data packet to the next hop satellite over the inter-satellite link. If the next-hop satellite is not the target node, the above procedure will be performed again on the next-hop satellite.

Step 2: stream level routing decision making based on data stream and trained MADRL model

In this step, in order to mitigate the negative effect of the deep neural network inference time on the network routing performance, the present invention proposes a flow routing method for the optimization requirement of the flow level. The method organizes the data packets into streams, considers the factors of stream characteristics, time delay requirements, bandwidth requirements and the like, and selects the optimal routing path and resource allocation among satellites through the learning and decision of an intelligent agent so as to meet the requirements of user traffic to the greatest extent and optimize the overall performance of the network.

Considering that all packets in a flow have the same destination address, each satellite is specified to independently maintain a flow routing table for all traffic flows passing through it. Similar to the traditional MADRL packet routing method, the MADRL flow routing method requires the low-orbit satellite to use the DNN model to make routing decisions for the first packet in the traffic flow. The routing information is then stored as a corresponding entry in the satellite stream routing table and is available directly for use by subsequent packets in the same data stream. This effectively eliminates the need for subsequent data packets to make DNN model inferences, thereby significantly reducing the cumulative time spent making DNN model inferences in low orbit satellite broadband networks. Thus, it is expected that routing performance, such as end-to-end transmission delay, packet loss, and network throughput, will be significantly optimized and meet the needs of large-scale low-orbit satellite broadband networks. The present invention more vividly shows a routing table for a data stream by way of example, assuming that there are N satellite nodes in a certain low orbit constellation network, each satellite can only adopt four forwarding strategies (up, down, left, right), and the routing table pattern maintained by the kth (k e {1,2,..once., N }) satellite at a certain moment is shown in table 3. For the data stream which is not received yet and the target node is the satellite, the forwarding decision stored in the routing table is set as None, and in other cases, the data packet which is received subsequently is forwarded according to the existing strategy in the routing table.

Table 3 routing for data flows indicates intent

Step 3: adaptive update of flow routes based on delay jitter

In this step, an adaptive flow routing update method is introduced to monitor and adjust inter-satellite packet transmission delay in real time, thereby ensuring the routing performance of the network. This method is based on calculating the delay difference between two successful transmissions of the data packet before and after and comparing it with a preset threshold. If the delay variance exceeds the threshold, a multi-agent reinforcement learning algorithm is triggered for routing.

The present invention therefore proposes a delay jitter based adaptive policy update mechanism that is implemented completely independent of the complex network Model (Model-free). When the satellite finishes forwarding the data packet, the satellite senses and records the time delay of the data packet transmitted in the hop,and Delay the time Delay of two continuous data packets belonging to the same data stream _i+1 And Delay _i Difference calculation Delay jitter delta Delay _i+1 ：

ΔDelay _i+1 ＝|Delay _i+1 -Delay _i |

Then based on the Delay jitter DeltaDelay _i+1 And judging the applicability of the routing strategy for the data flow in the routing table under the current network state. If Delay is delayed by DeltaDelay _i+1 Above a certain threshold value theta _thr When the current routing path possibly has performance problems or abnormal conditions, and the next data packet of the data flow is forwarded, the routing strategy output by reasoning is executed by using the deep neural network model to forward the data packet, and the old strategy in the routing table is replaced, so that forwarding of the data packet and strategy updating of the data flow are completed.

Step 4: developing low orbit satellite stream level routing strategy performance assessment

Performance evaluation is carried out on the low-orbit satellite network flow routing strategy provided by the invention, and the performance evaluation indexes focused by the invention comprise end-to-end time delay, packet loss rate and throughput. Meanwhile, the reference algorithm for comparing the strategy development performance provided by the invention comprises the following steps:

1) OSPF (Open Shortest Path First): the shortest path of the current router to the different target nodes is calculated periodically and stored in the routing table. When a packet arrives at the router, the routing decision is made by looking up the routing table, and the delay of the routing decision is negligible.

2) ELB (Offloading To Access Satellite): and the unbalanced load condition in the satellite network is effectively avoided. Since the routing table in the load balancing algorithm is calculated in a similar manner to the OSPF algorithm, the routing decision delay of the load balancing algorithm is negligible.

3) MADRL-packet: one satellite would need to make DNN model inferences for each data packet, which would introduce a non-negligible cumulative decision delay.

The invention tests the algorithm performance by changing the number of users of the low orbit satellite constellation network, and ends toThe experimental results of the three performance indexes of the end delay, the packet loss rate and the throughput are shown in fig. 3, fig. 4 and fig. 5. Notably, the proposed MADRL-flow routing method can always exhibit superior performance over all baseline algorithms in three performance metrics. Furthermore, the invention is realized by arranging three different componentsThe performance of the proposed MADRL-flow routing method was investigated. As shown in FIG. 5, at +.>Of the three different threshold configurations set to 0.001, 0.005, and 0.008, respectively, MADRL-flow (0.005) is superior to MADRL-flow (0.001) and MADRL-flow (0.008). The intuitive reason is that when->When too large, the updating of the flow routing table may lag behind the real-time network state. Conversely, too little ∈>The values increase the need for DNN model reasoning, affecting the performance of the flow routing method.

The invention further researches how much the proposed MADRL-flow routing method can optimize the average decision delay of each data packet. The evaluation results are shown in Table 4. Since the lookup of the routing table is typically done instantaneously, the average decision delay of OSPF and ELB can be ignored. When the number of ground users is 27000, the average decision delay using the MADRL-packet method reaches 42.9 milliseconds, accounting for more than half of the total end-to-end delay. While whenThe MADRL-flow method can significantly reduce the average decision delay to around 1 millisecond when set to 0.008. In addition, the decision delay ratio using MADRL-flow routing method is also greatly reduced when +.>At 0.008, it is only about 2% of the total end-to-end delay.

Table 4 the decision delay test results table for different numbers of ground users in the example of the present invention.

/>

Claims

1. A low orbit satellite network flow routing method based on multi-agent reinforcement learning comprises a training stage and an execution stage;

the training phase comprises:

step A1: constructing a low-orbit constellation broadband network distributed inter-satellite routing model;

the low orbit satellite network comprises N _orbit Tracks, each track having N _{Sat_orbit} Uniformly distributed low-orbit satellites, denoted as Sat _i I epsilon {1,2, …, total }, total representing the total number of LEO satellites, each satellite may establish four inter-satellite links for communication with its neighboring four low-orbit satellites. These links are respectively connected to the satellites on the upper and lower sides of the same orbit, and to the satellites on the left and right sides of the adjacent two orbits. By link _i,j To represent Sat _i To Sat _j Wherein i represents the number of the satellite at the transmitting end and j represents the number of the satellite at the receiving end;

step A2: constructing a ground user distribution model, and generating a communication request according to a user behavior model;

dividing the ground into M continuous areas with uneven user distribution, wherein the user positions in each area are uniformly distributed; setting all user behaviors to independently and periodically send data packets to an access satellite;

step A3: the data packet is sent to a satellite, and the satellite acquires the local state information observed quantity;

defining satellites as mutually independent agents, determining packet routing decisions based on local observationsThe method comprises the steps of carrying out a first treatment on the surface of the For each satellite, after receiving the data packet, its local observation state space is defined asWherein the method comprises the steps of Is satellite Sat _i Spatial distances of four adjacent satellites to the target satellite of the current data packet k; />Representing four connections Sat _i Network available bandwidth of inter-satellite link, < + >>For Sat _i The current traffic load of the last four forwarding queues,for Sat _i The load of the decision queue on four adjacent satellites; normalizing the element values in the step A3;

step A4: c, the satellite relies on the local state observed quantity obtained in the step A3, a strategy network in the deep reinforcement learning model is utilized to select the optimal action, and the routing decision of the data packet is executed;

after each satellite agent receives the data packet and obtains the local information observation quantity, the satellite agent needs to make routing decision on the data packet, and the agent moves from the working spaceSelecting an action to route the data packet, wherein, < ->And->Each representing the transfer of the data packet to one of four adjacent satellites as the next hop;

step A5: calculating the routing delay of the low-orbit constellation broadband network node data packet;

when a low orbit satellite receives a data packet, it will select the next hop satellite to process the data packet through the routing decision obtained in step A4, and forward the data packet to the next hop satellite through the inter-satellite link; this routing process requires a certain time delay, including decision delays and forwarding delays; decision delay refers to the time delay from receiving a data packet to making a routing decision, while forwarding delay refers to the time delay from making a routing decision to receiving a data packet by the next hop satellite;

decision delay for packet k routed over a low-orbit satellite broadband networkComprising two parts: decision queuing delay->And decision making delay->Decision queuing delay refers to the time required to wait for a routing decision in a satellite, while decision making delay refers to the time required for a satellite to make a routing decision, and in the process of forwarding a data packet, the forwarding delay +.>Also comprising a plurality of parts: forwarding queuing delay->Transmission delay->And propagation delay->The transmission delay refers to the time required for the data packet to be transmitted through the inter-satellite link, and the propagation delay refers to the time required for the data packet to travel from one satellite to another along the inter-satellite link;

calculating; wherein S is _k Is the size of the data packet k, if link _i,j Temporarily no free bandwidth, packet k is buffered to a link _i,j This will introduce a forwarding queuing delay in the forwarding queue buffer of (1)When the buffer memory reaches the maximum capacity, the subsequent data packet is discarded; on the other hand, assume time tsat _i And Sat _j Is (x) _i,t ,y _i,t ,z _i,t ) And (x) _j,t ,y _j,t ,z _j,t ) The spatial distance between these two satellites +.>By the formula:

calculating;

to calculate signal propagation delayWherein c is the speed of light in vacuum;

calculating; if the next-hop satellite is not the target node, the above procedure will be performed again on the next-hop satellite;

step A6: calculating a reward value of the intelligent agent for routing decision;

if the current data packet is forwarded to the neighbor satellite, giving corresponding rewards to the intelligent agent according to the delay calculated in the step 5; the goal of each agent is to learn an optimal routing strategy to improve routing performance, to ensure that each agent learns an optimal routing decision, sat _i The bonus function that routes the data packet k at time t is defined as follows:

wherein, psi is penalty value to the agent when the data packet is lost, dis _j,k Representing the next hop satellite Sat _j And a normalized spatial distance between the target satellites,is the normalized forwarding delay of packet k, +.>Is to route the data packet k at Sat _j Normalized decision delay on, κ ₁ ,κ ₂ And kappa (kappa) ₃ Is a weight for balancing the above factors, and the cumulative discount prize is composed ofCalculation of gamma.epsilon.0, 1]Representing a discount factor;

step 7: training a strategy network of the reinforcement learning model of each agent;

each satellite contains two deep neural networks: estimating Q of Q network _i (o _i ,a _i ；μ _i ) And a target Q network Q' _i (o _i ,a _i ；μ′ _i ) Respectively from mu _i And mu' _i Parameterization of the corresponding network; at each decision time t, satellite Sat _i Consider its local observation o _i (t) and from action space A based on epsilon-greedy policy _i In selection action a _i (t)：

s _i (t) represents a global state, a _i (t) representing an action, when the agent selects an action based on the current observation, the agent interacts with the environment, the current state is changed to the next state o _i (t+1) while Sat _i Will get rewards r _i (t) setting an experience tuple { o } _i (t),o _i (t+1),a _i (t),r _i (t) } the experience tuples are to be recorded in an experience playback pool RB from which the agent randomly extracts a batch of experience tuples and trains them with their parameter values updating the estimated Q network, in each iteration the target Q network is used to calculate each state-action pair (o _i (t),a _i (t)) fixed target Q value y _i (t) wherein the estimated Q network is used to obtain the next state o _i Maximum Q value for all actions at (t+1) and using the target network parameter μ' _i Wherein y is _i The calculation method of (t) is as follows:

Loss _i (t)＝(y _i (t)-Q _i (o _i (t),a _i (t)；μ _i )) ²

estimating the parameter value of the Q-network is updated by minimizing the mean square error between the estimated Q-value and the target Q-value using random gradient descent, estimating the parameter μ of the Q-network _i By copying the parameter μ 'of the target Q network at the end of each training iteration' _i Updating:

wherein alpha is the learning rate, and after each iteration, the parameters of the target Q network are updated softly according to the estimated Q network;

the execution phase comprises:

step B1: deploying the deep reinforcement learning model completed in the training stage into a real low-orbit satellite network;

when a low-orbit satellite receives a data packet, a trained deep reinforcement learning model is used for making a data flow routing decision, and the data packet is forwarded to the next jumping satellite through an inter-satellite link; if the next-hop satellite is not the target node, the above procedure will be performed again on the next-hop satellite;

step 2: making a stream level routing decision based on the data stream and the trained MADRL model;

the data packets are organized into streams, the characteristics, the time delay requirement and the bandwidth requirement of the streams are considered, and the optimal routing path and resource allocation among satellites are selected through the learning and decision of an intelligent agent so as to meet the requirement of user traffic to the greatest extent and optimize the overall performance of the network;

providing that each satellite independently maintains a flow routing table for all traffic flows passing through it, considering that all packets in the flow have the same destination address; the low orbit satellite uses DNN model to make route decision for the first data packet in the traffic flow; the routing information is then stored as a corresponding entry in the satellite flow routing table and can be directly used for subsequent packets in the same data flow;

step 3: self-adaptively updating the flow route based on the delay jitter;

when the satellite finishes forwarding the data packet, the intelligent agent can sense and record the time Delay of the data packet transmitted in the hop and Delay the time Delay of two continuous data packets belonging to the same data stream _i+1 And Delay _i Difference calculation Delay jitter delta Delay _i+1 ：

ΔDelay _i+1 ＝|Delay _i+1 -Delay _i |

Then based on the Delay jitter DeltaDelay _i+1 Judging the applicability of a routing strategy for the data flow in a routing table in the current network state; if Delay is delayed by DeltaDelay _i+1 Higher than a set threshold value theta _thr When the current routing path possibly has performance problems or abnormal conditions, and the next data packet of the data flow is forwarded, the routing strategy output by reasoning is executed by using a deep neural network model to forward the data packet, and the old strategy in the routing table is replaced, so that the forwarding of the data packet and the strategy of the data flow are completedUpdating.

2. The method for routing a low-orbit satellite network based on multi-agent reinforcement learning according to claim 1, wherein the selection of the routing strategy by the agent in step 4 of the training phase is divided into two cases of exploration and utilization each time, the agent performs random exploration with epsilon probability, and utilizes the current optimal strategy with 1-epsilon probability.