CN116916409A

CN116916409A - Decision generation method for DQN-assisted low-orbit satellite switching

Info

Publication number: CN116916409A
Application number: CN202311053153.8A
Authority: CN
Inventors: 赵耀忠; 袁金祥; 陈泓睿; 张波; 张集; 郑安; 李国鹏; 张安萍; 房圆武; 陈豫蓉; 王晓阳; 刘强
Original assignee: Huaneng Yimin Coal and Electricity Co Ltd
Current assignee: Huaneng Yimin Coal and Electricity Co Ltd
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-10-20

Abstract

A decision generation method for DQN assisted low orbit satellite switching relates to the field of communication, and the method comprises the following steps: modeling a communication system in which a communication link is positioned in a low-orbit satellite communication switching process; analyzing the feasibility of the DQN assisted low orbit satellite switching; constructing a state space, an action space and a return function in a deep reinforcement learning algorithm; training the constructed deep reinforcement learning algorithm model through data, and executing a low orbit satellite switching target decision task by using the calculated deep reinforcement learning algorithm model. The invention can realize the following functions: in the switching process, if the user is covered by a plurality of low-orbit satellites at the same time, the network needs to improve the QoS of the communication link after switching according to the communication environment where the user is located under the condition that the satellite switching times are reduced as much as possible.

Description

Decision generation method for DQN-assisted low-orbit satellite switching

Technical Field

The invention relates to the technical field of communication, in particular to a decision generation method for DQN auxiliary low-orbit satellite switching.

Background

The 3GPP (3 rd Generation Partnership Project, third Generation partnership project) initiated work in Rel-15 to provide network connectivity for user equipment using satellites and air platforms, and continued research in Rel-16 was undertaken to accommodate the support of Non-terrestrial networks (Non-Terrestrial Network, NTN) by 5 GNR. Based on the results of Rel-15 and Rel-16, 3GPP decided to initiate a work item on NTN in Rel-17, the first item to develop spatial technology as part of the 5GNR ecosystem. With the discussion of specifications in Rel-18, NTN is considered a long-term study.

Because of the ability of satellites to provide global coverage, satellite communication networks in combination with terrestrial communication networks are considered to be an effective solution to achieve 6G global seamless network coverage and very large scale linking landscape. In the NTN scene, the coverage capability of the geosynchronous orbit satellites is very strong, and three geosynchronous orbit satellites can realize global coverage, but the geosynchronous orbit satellites face the problems of larger communication time delay and larger link attenuation due to higher orbit height.

In recent years, low-orbit satellite communication has received extensive attention due to low satellite orbit height (500-1500 km), small time delay required for communication, and small attenuation experienced by links, and many companies have already been taking their disposal of their own low-orbit satellite constellations, such as StarLink, oneWeb and the like. However, the coverage time of the low-orbit satellite for a specific area is only 5-10min due to its high mobility, which means that the use of the low-orbit satellite communication faces the problem of requiring frequent handover. Mobility enhancements to NTN are widely discussed in the 3GPP Rel-18 R2 working group, where the problem of triggering for low orbit satellite communication handoffs involves the problem of satellite handoffs. The 3GPP working group found that the low orbit satellites provided that the signal strength of the communication link within the coverage area did not vary as significantly as the terrestrial cellular network, which resulted in the communication link switching between satellites not being timely enough for the communication link to drop. To solve this problem, 3GPP working groups propose and pass a handover decision based on satellite service time and a handover decision based on positioning. The switching decision based on the service time provided by the satellite refers to that the low orbit satellite periodically broadcasts the residual time capable of providing service to users in the coverage area, and the users judge whether to need to switch according to the residual time and select a target satellite. The switching decision based on positioning refers to that a user acquires the self position through GPS positioning, then acquires the current position of a satellite according to satellite broadcast ephemeris, judges whether switching is needed or not through calculating the relative position between the user and the satellite, and selects a target satellite. The above two schemes proposed by the 3GPP working group define the minimum requirements to be met when the communication link is switched in the low-orbit satellite communication process, that is, the switching decision must be performed under the condition of meeting the requirements of the above two schemes, regardless of the requirements of the user. In fact, the problem of poor communication quality may be faced if the user performs a handover of the communication link according to the two minimum requirements described above.

Related studies have now proposed an optimization strategy for low-orbit satellite communication handover based on the minimum requirements of low-orbit satellite communication handover specified by 3GPP, for improving the quality of service (Quality ofService, qoS) of users during low-orbit satellite communication. The scheme comprehensively considers the bandwidth resource of a communication link and the satellite visible time which can be provided by a target satellite during switching, and a user selects the target satellite for switching according to the two factors influencing the service quality of the communication link by combining the communication environment of the user. In the scheme, the switching selection problem is modeled as a path selection problem of a weighted directed graph (Wang F, jiang D, wang Z, et al, seamless Handover in LEO Based Non-Terrestrial Networks: service Continuity and Optimization [ J ]. IEEE Transactions on Communications,2022,71 (2): 1008-1023.), the vertex of the graph is a target satellite, the directed edge of the graph represents the switching process from a source satellite to the target satellite, and the weight of the edge is composed of weights of two parameters of bandwidth resources and satellite visible time. Therefore, the selection of the handoff decision in satellite communication is converted into a problem of selecting the optimal path of the weighted directed graph, and finally the handoff decision of the target satellite is obtained by solving the shortest path of the weighted directed graph. The scheme converts the decision problem of target satellite selection into the solving problem of the shortest path in graph theory. However, with the above scheme, the user is required to first model the weighted directed graph during the handoff process, and then execute the shortest path search strategy to obtain the decision of the target satellite during the handoff. Under the influence of the high mobility characteristic of the low orbit satellites, each satellite can provide service time for users for about 5-10 minutes, which means that the users need to re-model the weighted directed graph according to the current communication environment every 5-10 minutes and search the shortest path, a great amount of signaling overhead is generated for the low orbit satellite communication, and the signaling resource overhead is increased continuously along with the increase of the number of users in the network, so that a great burden is caused to the satellite communication system. In addition, the above solution does not consider the situation that the communication link changes with time in the satellite communication process, after modeling the weighted directed graph and searching out the shortest path, the user will execute the switching decision in a time period T according to the searched out shortest path, but in the time period T, the bandwidth resource that the satellite can provide will change with the increase of the users in the network, so the switching decision corresponding to the searched out shortest path may not effectively guarantee the QoS of the communication in real time.

Disclosure of Invention

The invention aims to provide a decision generation method for DQN-assisted low-orbit satellite switching.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention discloses a decision generation method for DQN-assisted low-orbit satellite switching, which comprises the following steps:

step S1: modeling a communication system in which a communication link is positioned in a low-orbit satellite communication switching process;

step S2: analyzing the feasibility of the DQN assisted low orbit satellite switching;

step S3: constructing a state space, an action space and a return function in a deep reinforcement learning algorithm;

step S4: training the constructed deep reinforcement learning algorithm model through data, and executing a low orbit satellite switching target decision task by using the calculated deep reinforcement learning algorithm model.

Further, the specific operation flow of step S1 is as follows:

step S1.1: determining an index affecting the QoS of the user communication;

the index comprises: the satellite can provide the user with the residual service time at the switching trigger time, the channel capacity provided for the user by the satellite at the switching trigger time and the load state of the satellite at the switching trigger time;

step S1.2: determining a switching triggering condition;

The switching triggering conditions are as follows: the residual service time which can be provided by the current service satellite is 0; or, the data transmission rate requirement of the user exceeds the channel capacity which can be provided by the current service satellite;

step S1.3: modeling the problem;

under the condition that the switching trigger condition is not met, considering the optimal switching decision in the T time period in the user communication process, the following formula is used for quantifying the advantages and disadvantages of each switching decision:

wherein the method comprises the steps ofIndicating user selection of satellite LEO at time t _s As a return obtained when switching targets; w (w) ₁ 、w ₂ And w ₃ Are all weight factors, w ₁ +w ₂ +w ₃ =1 and 0.ltoreq.w ₁ ,w ₂ ,w ₃ 1, adjusting weight factors according to user demands;

normalizing the index affecting the QoS of the user communication,representing satellite LEO _s Can be user UE _n The ratio of the remaining service time provided to the maximum coverage time, the greater the ratio, the more indicative of satellite LEO _s The longer the communication service time can be provided; />Representing satellite LEO upon triggering a handoff _s Load margin of (2);indicating the degree to which the data transmission rate requirements of the user are satisfied after the handover is completed;indicating user selection of satellite LEO at time t _s As a switching target, +_>Indicating that at time t the user has not selected satellite LEO _s As a switching target; / >Indicating that a user can only select one satellite as an access target at the time t;

suppose that the user needs to perform τ times of handovers in a T period of time, useRecording satellite information selected by a user for tau times of switching in a T time period, wherein the problems to be solved are as follows: while meeting the requirements of user communication QoS indexes, the switching target decision problem of the switching times is reduced as much as possible, and the problem is modeled as the following form:

wherein the method comprises the steps ofDefining the switching decision evaluation time of the user;/>indicated at t _i Time of day user selection satellite +.>Obtained return (s)/(s)>Indicated at t _i At time, the user selects satellite->As a switching target.

Further, the specific operation flow of step S2 is as follows:

step S2.1: reinforcement learning analysis;

step S2.2: q-charging algorithm analysis;

step S2.3: DQN analysis of deep reinforcement learning algorithms.

Further, the specific operation flow of step S2.1 is as follows:

mapping the communication environment into an environment in reinforcement learning, constructing the satellite residual service time, the satellite load state and the user transmission rate requirement into a state space, constructing the selection of a target satellite into an action space and constructing the optimal value into return, so that the problem of solving the optimization is converted into the problem of solving the optimal switching decision in the reinforcement learning algorithm; meanwhile, the optimal switching decision solved by the reinforcement learning algorithm corresponds to an action sequence, and the action sequence is matched with a switching target decision problem of a user in a T time period considered in the invention; the optimal switching decision solved by the reinforcement learning algorithm is obtained according to the maximization of accumulated returns, and when a user switches by using a strategy obtained by the reinforcement learning algorithm, the executed strategy is a decision capable of maximizing the returns obtained in a future T time period, so that the objective of solving the optimization problem is met; and simultaneously, the cost function is utilized to approximate a direct fitting state cost function or an action cost function when reinforcement learning is carried out.

Further, the specific operation flow of step S2.2 is as follows:

the Q-learning algorithm is essentially solving a bellman optimal equation:

wherein S is _t S is at the time t, a _t The expression =a indicates that action a, S is performed at time t _t+1 Indicating the state of the next moment, r _t+1 Representing the return at the next moment, gamma being the discount factor, Q (s, a) representing the optimal action cost function obtained by the state s executing action a under the optimal switching decision; solving a Belman optimal equation by using an iterative algorithm:

wherein alpha is _k (s _t ,a _t ) Representing a parameter corresponding to the kth iteration, the parameter satisfying the following condition:

the iterative algorithm requires a set of data {(s) _t ,a _t ,r _t ,s _t+1 )} _t And (3) performing the iterative algorithm by using a large amount of data to obtain the optimal estimation of Q (s, a), namely the optimal switching decision to be solved by the invention.

Further, in step S2.2, the state space is limited when the Q-learning algorithm is used to solve the optimal switching decision, so that the problem is solved by combining the approximate fitting function with the Q-learning algorithm; the idea of combining the approximate fitting function with the Q-matching algorithm is expressed by the following formula:

wherein Q (S, A) represents an objective function,representing an approximate fit function, where S is a state variable, a is a motion variable, and w is a parameter to be optimized;

Construction of an approximate fitting functionSolving an optimization equation using a random gradient descent algorithm:

again because:

thus:

wherein w is _k And w _k+1 A value of an optimization parameter w corresponding to the current state k and the next state k+1, alpha _k Representing the discount factor corresponding to state k,representing the gradient operation of function J (w) on w,/->Represented in state s _t Executing action a _t Comprises an optimization parameter w _k Is a cost function of (a).

Further, the specific operation flow of step S2.3 is as follows:

the DQN is essentially solving the following optimization problem:

wherein, (S) _t ,A _t ,R _t ,S _t+1 ) Is a random variable, and simultaneously, the following steps:

let w in y be a constant w _T Two neural networks are generated simultaneously: main networkAnd a target networkThe optimization parameters of the main network are always updated along with the input of data, and w in the target network _T Will update to w after a certain number of updates _T After a plurality of training iterations, obtaining a target network corresponding to the optimization parameter w; the target network can approximate to fit an optimal action cost function Q (s, a) corresponding to the Q-learning algorithm, and when a state s is given, an action a is output, wherein the action a corresponds to a decision for maximizing the Q (s, a) value, namely the optimal switching decision to be solved by the invention.

Further, the specific operation flow of step S3 is as follows:

the user can undergo multiple switching in the T time period, the state space can be changed after each switching, the T time period is divided into ζ time slots with equal length, the length of each time slot is 60 seconds, and the time slot is expressed as T= [ (T) ₁ ,t ₂ ),(t ₂ ,t ₃ ),...,(t _ζ-1 ,t _ζ )]The method comprises the steps of carrying out a first treatment on the surface of the The state space is modeled asWherein (1)> Representing t _i Time of day user covered by satellite, +.>Representing t _i The time of day can cover the respective remaining service time of the user's satellites, < >>Representing the user at t _i Transmission rate requirement of time of day,/->Representing t _i Load status of satellites covering the user at the moment, < > on>Representing t _i Channel capacity which can be allocated to users by each satellite at moment; action space->Let->Indicated at t _i Time of day user selection satellite +.>As a switching target, wherein->Report letterThe number is defined as:

wherein w is ₁ 、w ₂ And w ₃ Are all weight factors, w ₁ +w ₂ +w ₃ =1 and 0.ltoreq.w ₁ ,w ₂ ,w ₃ 1, adjusting weight factors according to user demands;

therefore, the optimal switching decision problem to be solved by the invention is converted into the following optimization problem:

further, the specific operation flow of step S4 is as follows:

the state space is input to the deep neural network model at the time t by a user through the deep neural network to fit the Q (s, a) function, then the deep neural network model outputs action value functions corresponding to a plurality of actions, and then one action sequence capable of maximizing the accumulated return is selected as a strategy pi ^* To assist the user starting from time T, at [ T, t+T ]]The optimal handover decision is performed during this time period.

The beneficial effects of the invention are as follows:

when using a large scale constellation of low orbit satellites to provide communication services to the ground, the high mobility of the low orbit satellites may result in the communication link being required to undergo frequent handovers between satellites. In the switching process, if the user is covered by a plurality of low-orbit satellites at the same time, the network needs to improve the QoS of the communication link after switching according to the communication environment where the user is located under the condition that the satellite switching times are reduced as much as possible. In this regard, the present invention proposes a decision generation method for DQN (deep Q-network) assisted low-orbit satellite handoff, which is used to solve the problem of handoff target selection faced by low-orbit satellite communications.

According to the decision generation method for DQN-assisted low-orbit satellite switching, on the basis of considering the switching decision problem in a period of time in the low-orbit satellite communication process, the advantage of DQN is combined to convert the switching decision problem into the optimal DQN switching decision solving problem, and the aim of reducing switching times as much as possible while guaranteeing user communication quality is achieved by solving the optimal switching decision. Meanwhile, feasibility analysis is performed for the problem of applying the DQN to low-orbit satellite communication switching decisions, and the method can realize adjustment of switching decisions for changing the environment state, which cannot be realized by the low-orbit satellite switching decisions based on graph theory.

Drawings

Fig. 1 is a flow chart of a decision making method for DQN-assisted low earth orbit satellite switching according to the invention.

Fig. 2 is a schematic diagram of a low-orbit satellite communication scenario.

FIG. 3 is a schematic diagram of reinforcement learning.

Fig. 4 is a schematic diagram of a decision making method for DQN-assisted low orbit satellite handoff according to the invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, in the normal communication process of a user in a low-orbit satellite communication scene, along with the movement of the low-orbit satellite, the channel state of the user changes, for example, when the user approaches the coverage edge of the low-orbit satellite, the channel state is worse than that of the user at the coverage center of the low-orbit satellite, and even communication faults occur after the user leaves the coverage of the current service satellite, so that it is necessary to timely perform inter-satellite switching to ensure the communication quality of the user. The invention provides a decision generation method for DQN-assisted low orbit satellite switching aiming at the problem, further provides a switching triggering condition considering communication quality on the basis of 3GPP organization specifications, and triggers switching when the communication quality of a user is reduced to meet the set switching triggering condition, thereby selecting a switching target. Aiming at the problem of switching target selection, the invention uses a deep reinforcement learning algorithm to assist in selecting a switching target, namely after a user triggers switching, the current channel state information is required to be sent to a ground station, the ground station inputs the received user channel state information and low-orbit satellite constellation channel state information into a trained deep reinforcement learning algorithm model, a set of switching decision in a T time period can be output through the deep reinforcement learning algorithm model, and finally the ground station sends the switching decision to the user, the user selects the switching target and performs switching operation, so that the user normally communicates.

Specifically, consider a typical scenario in which a low-orbit satellite constellation comprising S low-orbit satellites is used to serve terrestrial users for communication, using s= { LEO ₁ ,LEO ₂ ,...,LEO _s The floor area contains N users, which need low-orbit satellites to provide communication services, and N= { UE is adopted ₁ ,UE ₂ ,...,UE _n And } represents. As shown in fig. 2, assuming that a user can ensure that at least two low-orbit satellites provide communication services for the user at the same time, the current time period is the user UE along with the movement of the low-orbit satellites ₂ The low orbit satellite providing service is LEO ₂ While low-orbit satellite LEO ₂ Gradually leaving the ground area, and then a handoff between low-orbit satellites is required to select low-orbit satellite LEO ₁ And low earth orbit satellite LEO ₃ Which one of them is the user UE as the next time period ₂ A satellite providing communication services that enables users to achieve better QoS in future T time periods is a problem to be solved by the present invention.

Referring to fig. 1 and 4 for explanation, the decision making method for DQN auxiliary low orbit satellite switching according to the present invention mainly includes the following steps:

after a user accesses a low-orbit satellite communication network, receiving channel state information periodically broadcasted by a current service satellite, wherein the channel state information comprises satellite positions and satellite load states in a current time period, the user calculates the residual service time of the current service satellite through GPS positioning and the position information of the satellite, the user calculates the channel capacity which can be provided by the current service satellite for the user through the satellite load states and the environment information of the user, and when a switching trigger condition is met, namely when the data transmission rate requirement of the user exceeds the channel capacity which can be provided by the current service satellite, or the residual service time of the current service satellite is equal to zero, the current triggering switching is carried out; after triggering switching, the user sends the data transmission rate requirement and the position information to the ground station for switching request; after receiving the handoff request sent by the user, the ground station determines a satellite set capable of covering the user in a period T according to the position information of the user and combining with ephemeris, the residual service time and the respective load state corresponding to each satellite in the satellite set, packages the satellite set covered by the user, the residual service time of the satellites in the set, the satellite load state in the set and the user data transmission rate requirement to generate a state space variable, sends the state space variable to the data processing center, after receiving the state space variable sent by the ground station, the data processing center inputs the state space variable to a trained deep reinforcement learning algorithm model, the deep reinforcement learning algorithm model gives a handoff decision for maximizing the accumulated return in the period T through calculation, the data processing center feeds back the calculated handoff decision to the ground station, and the user performs the selection of a handoff target according to the received handoff decision.

The invention relates to a decision generation method for DQN auxiliary low orbit satellite switching, which comprises the following specific operation flow:

step S1.1: determining an index affecting the QoS of the user communication;

in order to achieve a decision of a handover target in a low-orbit satellite handover scenario, a factor that affects the handover target to be considered in handover needs to be determined. Because satellites can send ephemeris of a low-orbit satellite constellation and load information of a current service satellite to ground users in a broadcast mode, the invention considers the following influencing factors which possibly change the user communication QoS index from the aspect of improving the user communication QoS index:

1. the satellite can provide the user with the residual service time at the switching trigger time

The user selects the switch target as leftThe longer the remaining service time, the fewer number of handovers between satellites that are experienced during the establishment of a communication link. Thus, to determine the situation where the user is covered by the satellite at time t, the method comprisesRepresenting the state of the user covered by the satellite at time t, wherein:

in the case of defining a minimum elevation angle for low-orbit satellite communications, λ is used to represent the maximum time that a single low-orbit satellite can serve a ground user. Using Indicating the remaining service time of the user covered by the low-orbit satellite at time t, in particular +.>Time indicates that the user is not at satellite LEO at time t _s Is within the coverage area of (1)>

2. Channel capacity that satellite can provide to user at handoff trigger time

To avoid interference between signals not occurring in the same coverage area, the user communication link uses orthogonal frequency division multiplexing (Orthogonal Frequency Division Multiplexing, OFDM) techniques, represented by shannon's formula r=blog ₂ (1+CNR), CNR is Carrier-to-Noise Ratio (CNR), it is known that the channel capacity (unit: mbps) that a satellite can provide to a user is determined by the allocated bandwidth and the link Carrier-to-Noise Ratio, and the Ku frequency band (12-18 GHz) is used as the low-orbit satellite communication working frequency band in consideration of factors such as atmospheric attenuation. To simplify the model, it is assumed that the frequency band resources allocated by each satellite are the same as B (unit: hz), and that multiple users served by the same low-orbit satellite share the lowBand resource of orbiting satelliteRepresenting satellite LEO at time t _s The frequency band resources allocated to n users included in the coverage area are:

wherein, the liquid crystal display device comprises a liquid crystal display device,and the frequency band resources allocated to the user i by the satellite s at the time t are shown.

In order to calculate the channel capacity allocated to the user by the satellite, the calculation of the carrier-to-noise ratio CNR needs to be considered, and under the condition that the transmitting power of the user is uniform and the gain of the satellite receiving antenna is the same, the main factors influencing the carrier-to-noise ratio CNR are various attenuations experienced in the transmission process of the communication link. PL is used herein to denote the attenuation (in dB) experienced during transmission of a communication link, and consists essentially of the atmospheric absorption attenuation PL _α And link loss PL resulting from free space propagation _β . PL can thus be expressed as:

PL＝PL _α +PL _β

according to the 3GPP working group in document (Study on New Radio (NR) to Support Non-Terrestrial Networks, document TR 38.811,3GPP, (Release 15), 2020.) section 6.6.4, the atmospheric attenuation PL is provided _α Can be expressed as:

wherein f _c Representing the carrier frequency of the carrier wave,representing user UE _n For satellite LEO _s Elevation angle L of (1) _zenith (f _c ) Representing different altitudes and environments on earth,the zenith attenuation is a function of the carrier frequency, and the main factor affecting this value is the absorption resonance line of oxygen and water vapor. Its reference under different weather conditions is described in literature (Attenuation by Atmospheric Gases, ITU-R, 2016) [ Online ]]Available: https:// www.itu.int/dms_pubrec/itu-R/REC/p/R-REC-P.676-11-201609S-! The following is carried out PDF-e.pdf), typically less than 10dB.

Link loss PL due to free space propagation _β Can be expressed as:

wherein the method comprises the steps ofRefers to a user UE at time t _n And satellite LEO _s The distance between them can be expressed as:

where R is _e Expressed as the earth radius, h _s Represented by satellite LEO _s Is arranged on the track of the track.

Based on the modeling of channel link attenuation above, the calculation of the carrier-to-noise ratio CNR (in dB) can be expressed as:

CNR＝P ₁ +G ₁ +G ₂ -PL-N

Wherein P is ₁ Represents ground station transmit antenna power (unit: dBW), G ₁ Represents gain (unit: dB) of ground station transmission antenna, G ₂ The gain (unit: dB) of the space station receiving antenna is represented, and N represents the equivalent noise power, specifically:

wherein k is a Boltzmann constant having a value of about 1.380649 ×10 ^-23 J/K，T _n The noise temperature (unit: K) is represented.

User acquired satellite LEO _s Number of users served at time t, m _s After that, the channel capacity that can be allocated after the handover can be calculated. To simplify the model, assume m for the T period _s Is unchanged. UsingIndicating satellite LEO at time t _s Channel capacity that can be allocated to users. Use->Representing the channel capacity that S satellites capable of covering the user can allocate to the user within a future T period.

Using phi ^t Indicating the data transmission rate requirement of the user at time t whenWhen this indicates that the demand for ground user data transmission rate exceeds the communication resources that can be provided by the current serving satellite, a handover needs to be triggered.

3. Load state of satellite at switching trigger time

The following gives a definition of the satellite load state, and a specific calculation expression is:

wherein the method comprises the steps ofFor satellite load status, +.>Representing user UE _i The data transmission rate requirements at time t (i=1, 2, m _s )，/>Representing satellite LEO at time t _s For user UE _i The channel capacity offered, here required->Indicating that the communication link is connected properly. The ratio of the data transmission rate requirement of each user at time t to the channel capacity can represent the satellite LEO corresponding to the current user _s And then weighted average is carried out to obtain satellite LEO _s Load state at time t.

UsingThe loading state of the S satellites at the time T is represented, in order to simplify the model, it is assumed that the loading state of the satellites does not change in the time T, that is, the data received by the ground station is the result of calculation of each satellite in the low-orbit satellite constellation according to the calculation expression of the satellite loading state at the time T, and the numerical value of the result remains unchanged in the time T of model training.

The above mentioned three indexes (the residual service time that the satellite can provide for the user at the switching trigger time, the channel capacity that the satellite can provide for the user at the switching trigger time, and the load state of the satellite at the switching trigger time) that need to be considered when switching, can be used to solve the switching target decision problem that the switching times are reduced as much as possible while meeting the requirements of the user communication QoS indexes.

Step S1.2: determining a switching triggering condition;

in the invention, the satellite is required to broadcast the number and load states of service users and the current position of the satellite to the ground users and the ground station by taking epsilon as a period (unit: s), the user can calculate the residual service time of the satellite according to the received broadcast information and the GPS positioning technology, and meanwhile, the user can calculate the channel capacity provided by the satellite for the user according to the environment of the user and the number of service users of the satellite. Therefore, the user can judge whether the switching between satellites needs to be triggered according to whether the channel capacity meets the self data transmission rate requirement or whether the satellite residual service time is close to 0, and if the switching needs to be triggered, the switching triggering time is recorded as t.

Thus, two handover trigger conditions can be summarized:

1. the coverage area of the current service satellite is about to leave the area of the user, and the residual service time provided by the current service satellite is 0, namely

2. The terrestrial user data transmission rate requirements exceed the channel capacity that current serving satellites can provide, i.e

Step S1.3: modeling the problem;

in order to solve the switching target decision problem of reducing switching times as much as possible while meeting the requirements of the communication QoS index of the user, firstly, the communication QoS index of the user is ensured to meet The number of handovers of the user is then reduced as much as possible, i.e. the remaining service time is selected +.>The largest satellite serves as the handoff target. The selection can well meet the selection of the switching targets among satellites of a certain user, but from the T time period, the current selection of the switching targets influences the selection of the switching targets later, and the switching decision in a certain time period T in the user communication process is considered to better meet the actual situation, so that the method has research significance. Therefore, the invention considers the optimal decision of the multiple switching of the user in the T time period, and the problem modeling is as follows:

in the case of not meeting the handover triggering condition, i.e. in the case ofAnd->Considering the optimal switching decision within a certain time period T in the user communication process, the merits of each switching decision need to be quantified, and the following formula is used for quantification:

wherein the method comprises the steps ofIndicating user selection of satellite LEO at time t _s As a return obtained when switching targets; w (w) ₁ 、w ₂ And w ₃ Are all weight factors, w ₁ +w ₂ +w ₃ =1 and 0.ltoreq.w ₁ ,w ₂ ,w ₃ And less than or equal to 1, and the weight factors can be adjusted according to the requirements of users. In order to equally consider three metrics that affect the QoS metrics of the user's communication, normalization of the three metrics is required, Representing satellite LEO _s Can be user UE _n The ratio of the remaining service time provided to the maximum coverage time, the greater the ratio, the more indicative of satellite LEO _s The longer the communication service time can be provided, the corresponding reduction of the switching times can be realized according to the strategy; />Representing satellite LEO upon triggering a handoff _s Since the load margin of (2) is normalized by itself, no additional normalization process is required; />Indicating the degree to which the data transmission rate requirements of the user are satisfied after the handover is completed; />Indicating user selection of satellite LEO at time t _s As a switching target, +_>Indicating that at time t the user has not selected satellite LEO _s As a switching target; />Indicating that at time t the user can only select one satellite as the access target.

Suppose that the user needs to perform τ times of handovers in a T period of time, useRecording satellite information selected by a user for tau times of switching in a T time period, the invention solves the problems as follows: while meeting user communication QoS index requirements, the handover objective decision problem of reducing the number of handovers as much as possible can be modeled as follows:

wherein the method comprises the steps ofDefining the switching decision evaluation time of the user; wherein->Indicated at t _i Time of day user selection satellite +. >Obtained return (s)/(s)>Indicated at t _i At time, the user selects satellite->As a switching target.

In the prior art, the above problems need a lot of calculation by a ground station when using graph theory to make decisions, while the invention uses a deep reinforcement learning algorithm to solve the problems, users and the ground station only need to send communication link information to a data processing center, the data processing center can generate corresponding switching decisions according to a trained deep reinforcement learning algorithm model, and model training and switching decisions are executed in different stages, so that the feasibility of the invention can be ensured.

Step S2: the feasibility analysis of the DQN assisted low-orbit satellite switching;

at present, most of existing researches for considering user switching decisions in a T time period are based on a graph theory mode, namely, a target satellite switched each time is used as a vertex of a graph, a directed edge is used as a switching process, overhead generated by executing the switching decisions is used as an edge weight, and then the optimal switching decisions in the T time period are obtained by searching a shortest path. But the amount of computation of the handover decision increases with the number of users. Accordingly, the principles and advantages of the deep reinforcement learning algorithm are mainly described below for providing a feasibility analysis of the deep reinforcement learning algorithm applied to the low-orbit satellite handoff decision problem during the T period.

Step S2.1: reinforcement learning analysis;

the reinforcement learning (reinforcement learning, RL) discusses the problem of how an agent can maximize the rewards it can get in a complex, uncertain environment. As shown in fig. 3, reinforcement learning consists of two parts: agents and environments. During reinforcement learning, agents are constantly interacting with the environment. After the agent acquires a state S e S in the environment, it uses the state to output an action a e a, also called decision pi. This action is then performed in the environment, which outputs the next state and the prize R e R from the current action based on the action taken by the agent. The goal of the agent is to obtain as much rewards from the environment as possible.

If the communication environment in the invention is mapped into the environment in reinforcement learning, the satellite residual service time, the satellite load state and the user transmission rate requirement are considered to be constructed into a state space, the selection of the target satellite is constructed into an action space, and the optimal value is constructed into return, so that the optimization problem can be well converted into the optimal switching decision solving problem in the reinforcement learning algorithm. Meanwhile, the optimal switching decision solved by the reinforcement learning algorithm corresponds to an action sequence, which is matched with the switching target decision problem of the user in the T time period considered in the invention. The optimal switching decision solved by the reinforcement learning algorithm is obtained according to the maximization of accumulated returns, which means that when a user switches by using the strategy obtained by the reinforcement learning algorithm, the executed strategy is the decision capable of maximizing the returns obtained in the future T time period, which also accords with the objective of solving the optimization problem.

However, there is still a problem that the above-mentioned optimization problem is solved by directly using the reinforcement learning algorithm, and the conventional reinforcement learning algorithm uses a form of a table to store a state cost function or an action cost function, so that such a method has a great limitation. For example, there are an infinite number of states for a low-orbit satellite communications scenario constructed in the present invention, in which case the cost function cannot be stored using a table. The most effective solution at present is to approximate a direct fitting state cost function or an action cost function by using a cost function, so that the requirement on a storage space is reduced, the generalization capability of a model is enhanced, and the problem of state space continuity is effectively solved.

Reinforcement learning algorithms are suitable for complex and uncertain environments where conventional rules and predefined algorithms may be difficult to adapt and solve. Reinforcement learning algorithms can come from the main learning and optimization strategy through interactions with the environment, making decisions in a constantly changing environment. Meanwhile, the reinforcement learning algorithm has learning capability, can accumulate experience from the actual trial-and-error process, and improves the decision strategy through learning. This learning capability makes the reinforcement learning algorithm excellent in tasks requiring continuous optimization and adaptation.

In consideration of the two advantages of the reinforcement learning algorithm, the reinforcement learning algorithm is applied to the problem of selecting the low-orbit satellite communication switching target, so that the problems of high real-time computing resource cost and poor environment change adaptability faced by the existing related research can be well solved. Therefore, the invention proposes to use reinforcement learning algorithm, and on the basis of reducing the number of times of switching between satellites, the QoS of user communication is improved as much as possible.

Step S2.2: q-charging algorithm analysis;

before introducing Deep Q Networks (DQN) with function fitting properties, the Q-learning algorithm needs to be introduced. The Q-learning algorithm is essentially solving a bellman best equation:

wherein S is _t S is at the time t, a _t The expression =a indicates that action a, S is performed at time t _t+1 Indicating the state of the next moment, r _t+1 Representing the return at the next moment, gamma being the discount factor, Q (s, a) representing the optimal action cost function obtained by the state s executing action a under the optimal switching decision, an iterative calculation being required to solve the above Belman's optimal equationThe method comprises the following steps:

wherein alpha is _k (s _t ,a _t ) Representing a parameter corresponding to the kth iteration, which affects the convergence of the iterative algorithm, in order for the iterative algorithm to converge, the parameter needs to satisfy the following conditions:

While running the iterative algorithm requires a set of data {(s) _t ,a _t ,r _t ,s _t+1 )} _t The invention uses Q-sparing algorithm to utilize optimized optimal estimation Q (s, a), when a state s is given, an action a is output, and the action a corresponds to the decision of maximizing the Q (s, a), namely the optimal switching decision to be solved by the invention.

In summary, it can be seen that if the Q-learning algorithm is to be used to solve the optimal switching decision, it is necessary to ensure that the state space is limited, and in order to solve the problem that the state space is continuous, i.e. the state space is wireless, the concept of a fitting function needs to be combined with the Q-learning algorithm.

Specifically, the idea of combining the approximate fitting function with the Q-matching algorithm can be expressed by the following formula:

wherein Q (S, A) represents an objective function,represents an approximate fit function, where S is the state variable, a is the action variable, and w is the parameter to be optimized. Constructing an approximate fitting function->There are two ways, one way is to construct a linear function, which has higher requirements for the selection of feature vectors, meaning that a better understanding of the optimization problem is required; the other is to use a neural network to construct a nonlinear function, which has better properties than the way to construct a linear function, so the present invention uses a neural network to construct a nonlinear function to fit an objective function.

Specifically, the above optimization equation is solved using a random gradient descent algorithm:

again because:

thus:

In summary, a large amount of data can be used to perform the iterative algorithm described above to fit an approximate fit function.

Step S2.3: DQN analysis of the deep reinforcement learning algorithm;

the DQN is a deep Q network with function fitting properties, and is essentially required to solve the following optimization problem:

here (S) _t ,A _t ,R _t ,S _t+1 ) Is a random variable, and simultaneously, the following steps:

solving using a random gradient descent algorithm presents a problem in that the parameter w to be optimized does not only appear inAlso present in y, it is therefore necessary to assume that w in y is a constant w _T This allows the problem to be solved using a random gradient descent algorithm. In order to realize the operation, two neural networks need to be generated, one network is the main networkThe other network is the target network->The optimization parameters of the main network are always updated with the input of data, and w in the target network _T Will update to w after a certain number of updates _T ＝w，After a plurality of training iterations, a target network corresponding to the optimization parameter w can be obtained. The target network can approximate to fit an optimal action cost function Q (s, a) corresponding to the Q-learning algorithm, and when a state s is given, an action a is output, wherein the action a corresponds to a decision for maximizing the Q (s, a) value, namely the optimal switching decision to be solved by the invention.

first, the definition of the state space S and the action space a and the reward function in the present invention will be described. The user may undergo multiple handovers in the T period, and the state space may change after each handover, and in order to evaluate the merits of the handover decision in the T period, the T period needs to be divided into multiple time slots. The invention divides the T time period into ζ equal-length time slots, each time slot has a length of 60 seconds, specifically expressed as T= [ (T) ₁ ,t ₂ ),(t ₂ ,t ₃ ),...,(t _ζ-1 ,t _ζ )]Thus the state space is modeled asWherein (1)> Representing t _i Time of day user covered by satellite, +.>Representing t _i The time of day can cover the respective remaining service time of the user's satellites, < > >Representing the user at t _i Transmission rate requirement of time of day,/->Representation oft _i Load status of satellites covering the user at the moment, < > on>Representing t _i Each satellite is able to allocate channel capacity to a user at a time. Action space->Let->Indicated at t _i Time of day user selection satellite +.>As a switching target, wherein->The payback function is defined as follows:

wherein w is ₁ 、w ₂ And w ₃ Are all weight factors, w ₁ +w ₂ +w ₃ =1 and 0.ltoreq.w ₁ ,w ₂ ,w ₃ And less than or equal to 1, and adjusting the weight factors according to the requirements of users.

step S4: training the constructed deep reinforcement learning algorithm model through data, and executing a low-orbit satellite switching target decision task by using the calculated deep reinforcement learning algorithm model;

the user can input the state space into the T moment by fitting the Q (s, a) function through the deep neural networkThe deep neural network model outputs action cost functions corresponding to a plurality of actions, and then selects one action sequence capable of maximizing accumulated return as a strategy pi ^* To assist the user starting from time T, at [ T, t+T ]]The optimal handover decision is performed during this time period. The optimal switching decision can well ensure the QoS of the user in the communication process.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The decision generation method for DQN assisted low orbit satellite switching is characterized by comprising the following steps:

2. The method for generating a decision for DQN auxiliary low earth orbit satellite handoff according to claim 1, wherein the specific operation flow of step S1 is as follows:

step S1.1: determining an index affecting the QoS of the user communication;

Step S1.2: determining a switching triggering condition;

step S1.3: modeling the problem;

normalizing the index affecting the QoS of the user communication,representing satellite LEO _s Can be user UE _n The ratio of the remaining service time provided to the maximum coverage time, the greater the ratio, the more indicative of satellite LEO _s The longer the communication service time can be provided; />Representing satellite LEO upon triggering a handoff _s Load margin of (2); />Indicating the degree to which the data transmission rate requirements of the user are satisfied after the handover is completed; /> Indicating user selection of satellite LEO at time t _s As a switching target, +_>Indicating that at time t the user has not selected satellite LEO _s As a switching target;indicating that a user can only select one satellite as an access target at the time t;

wherein the method comprises the steps ofDefining the switching decision evaluation time of the user; />Indicated at t _i Time of day user selection satelliteObtained return (s)/(s)>Indicated at t _i At time, the user selects satellite->As a switching target.

3. The method for generating a decision for DQN auxiliary low earth orbit satellite handoff according to claim 1, wherein the specific operation flow of step S2 is as follows:

step S2.1: reinforcement learning analysis;

step S2.2: q-charging algorithm analysis;

step S2.3: DQN analysis of deep reinforcement learning algorithms.

4. A decision making method for DQN auxiliary low earth orbit satellite switching according to claim 3, wherein the specific operation flow of step S2.1 is as follows:

5. The method for generating a decision for DQN auxiliary low earth orbit satellite handoff according to claim 4, wherein the specific operation flow of step S2.2 is as follows:

the Q-learning algorithm is essentially solving a bellman optimal equation:

wherein alpha is _k (s _t ,a _t ) Represents the kth iterationA corresponding parameter, which parameter fulfils the following conditions:

6. The method for generating a decision for switching between a low orbit satellite assisted by a DQN according to claim 5, wherein in step S2.2, the Q-learning algorithm is used to solve the problem that the state space is limited when the optimal switching decision is solved, so that a form of combining an approximate fitting function with the Q-learning algorithm is adopted; the idea of combining the approximate fitting function with the Q-matching algorithm is expressed by the following formula:

again because:

thus:

7. The method for generating a decision for DQN auxiliary low earth orbit satellite handoff according to claim 6, wherein the specific operation flow of step S2.3 is as follows:

the DQN is essentially solving the following optimization problem:

8. The decision making method for DQN auxiliary low earth orbit satellite switching according to claim 7, wherein the specific operation flow of step S3 is as follows:

the user can undergo multiple switching in the T time period, the state space can be changed after each switching, the T time period is divided into ζ time slots with equal length, the length of each time slot is 60 seconds, and the time slot is expressed as T= [ (T) ₁ ,t ₂ ),(t ₂ ,t ₃ ),...,(t _ζ-1 ,t _ζ )]The method comprises the steps of carrying out a first treatment on the surface of the The state space is modeled asWherein (1)> Representing t _i Time of day user covered by satellite, +.>Representing t _i The time of day can cover the remaining service time of each of the plurality of satellites of the user,representing the user at t _i Transmission rate requirement of time of day,/->Representing t _i Load status of satellites covering the user at the moment, < > on>Representing t _i Channel capacity which can be allocated to users by each satellite at moment; motion space a= = {1,2,..s }, let ∈ }>Indicated at t _i Time of day user selection satellite +.>As a switching target, wherein->The payback function is defined as:

wherein w is ₁ 、w ₂ And w ₃ Are all weight factors, w ₁ +w ₂ +w ₃ =1 and 0.ltoreq.w ₁ ,w ₂ ,w ₃ Root is less than or equal to 1Adjusting the weight factor according to the user demand;

9. the method for generating a decision for DQN auxiliary low earth orbit satellite handoff according to claim 8, wherein the specific operation flow of step S4 is as follows: