CN114499629A

CN114499629A - Dynamic resource allocation method for beam-hopping satellite system based on deep reinforcement learning

Info

Publication number: CN114499629A
Application number: CN202111609439.0A
Authority: CN
Inventors: 张晨; 韩永锋; 张更新
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-05-13
Anticipated expiration: 2041-12-24
Also published as: CN114499629B

Abstract

The invention discloses a dynamic resource allocation method for a beam hopping satellite system based on deep reinforcement learning, which comprises the following steps: step 1, establishing a service model of a forward link of a beam hopping GEO satellite system; step 2, storing the data packet of the service reaching the ground wave position in each time slot in a data packet buffer queue; step 3, a utilization degree reinforcement learning algorithm is adopted, a resource distribution module of the satellite is modeled into an intelligent agent, and state input of the intelligent agent, output decision-making action of the intelligent agent and reward of evaluation action are designed; step 4, simulating the deep reinforcement learning algorithm in the step 3, and continuously training decision neural network weight parameters of the deep reinforcement learning algorithm; and 5, completing dynamic resource allocation of the beam hopping satellite system by the decision neural network obtained by training in the step 4, and solving an optimal scheme of resource allocation of the beam hopping satellite system. The invention reduces the transmission delay of the data packet and improves the throughput of the beam hopping satellite system.

Description

Dynamic resource allocation method for beam-hopping satellite system based on deep reinforcement learning

Technical Field

The invention relates to the field of satellite communication, in particular to a dynamic resource allocation method for a beam hopping satellite system based on deep reinforcement learning.

Background

In conventional multi-beam satellite systems, the power and frequency resources allocated to each beam are relatively fixed. However, since the service requests between beams are non-uniform and time-varying, the conventional allocation algorithm cannot satisfy the service requests. The Beam Hopping (BH) technique is based on time slicing: only a portion of the beams are activated to operate in the same time slot. The beam hopping technology is driven by a service request, and the utilization rate of system resources can be greatly improved. The resource allocation algorithm of the forward link of the current beam hopping satellite system mainly has an heuristic algorithm, an iterative algorithm and a convex optimization algorithm. Both heuristic algorithms and iterative algorithms have large calculation amount and are not suitable for matching with the scene of ground service dynamic change in real time regions. The convex optimization calculation is suitable for a scene with smaller influence degree of co-channel interference between beams in a beam hopping satellite system.

On the other hand, Deep Reinforcement Learning (DRL) is one of the directions that have been most spotlighted in the field of artificial intelligence in recent years. The method combines the perception of deep learning and the decision of reinforcement learning, directly controls the behavior of an intelligent agent through the learning of high-dimensional perception input, and provides a way for solving the perception decision problem of a complex system. Some researches show that the deep reinforcement learning algorithm can obtain better performance in a satellite dynamic resource allocation system, and mainly relates to the aspects of inter-beam channel allocation of a multi-beam satellite system, multi-target optimization resource allocation of the multi-beam satellite and optimized transmission delay of a beam hopping satellite.

However, the existing beam hopping resource allocation algorithm based on deep reinforcement learning does not consider the problem of co-channel interference between beams. When the working beams are adjacent, interference is inevitable. In order to alleviate the problem of co-channel interference between beams, a dynamic resource allocation method for a beam hopping satellite system based on deep reinforcement learning needs to be designed on the basis of a criterion of considering interference avoidance.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a dynamic resource allocation method of a beam-hopping satellite system based on deep reinforcement learning.

The invention adopts the following technical scheme for solving the technical problems:

the invention provides a dynamic resource allocation method of a beam hopping satellite system based on deep reinforcement learning, which comprises the following steps:

step 1, establishing a service model of a forward link of a beam-hopping satellite system according to the characteristic of uneven time-space distribution of services of the beam-hopping GEO satellite system;

step 2, storing a data packet of a service reaching a ground wave position in each time slot in a data packet buffer queue according to the service model of the forward link of the beam hopping satellite system established in the step 1, wherein the data packet obeys the principle of first-come first-serve and establishes an optimization problem of minimizing the transmission delay of the data packet by combining the capacity which can be provided by a satellite;

step 3, introducing a deep reinforcement learning algorithm, modeling a resource allocation module of the satellite into an intelligent agent, and designing state input of the intelligent agent, output decision-making action of the intelligent agent and reward of evaluation action;

step 4, simulating the deep reinforcement learning algorithm in the step 3, initializing a satellite scene, setting parameters of the deep reinforcement learning algorithm, and continuously training decision neural network weight parameters of the deep reinforcement learning algorithm;

and 5, completing dynamic resource allocation of the beam hopping satellite system by the decision neural network obtained by training in the step 4, and solving an optimal scheme of resource allocation of the beam hopping satellite system.

As a further optimization scheme of the dynamic resource allocation method of the beam hopping satellite system based on deep reinforcement learning, a service model of a forward link of the beam hopping satellite system is established in step 1, and the method specifically comprises the following steps:

in a beam-hopping satellite system, the ground wave position psi is defined as psi ═

c

_n1,2,3, N, where N represents the total number of ground wave bits, c_nThe maximum working wave beam number is K which is less than or equal to N, and the wave beam jumping period is defined as T which is { T }₁,t₂,...,t_j,...,t_JWhere t is_jJ is more than or equal to 1 and less than or equal to J, wherein J is the total number of the beam hopping time slots;

t_jtime hopping beamsPattern(s)

Wherein the content of the first and second substances,

represents t_jWhen c is_nWhether it is illuminated by the operating beam or not,

represents t_jTime working beam ignition c_n，

Represents t_jNo working beam is on during the time c_n；

According to a hopping beam pattern

Calculating t_jWhen, c_nSignal to interference plus noise ratio of

Wherein, c_iIs the ith ground wave position,

represents t_jTime working beam ignition c_iTime pair c_nThe power gain includes satellite antenna transmit gain, free space loss, rain attenuation, and antenna receive gain;

represents t_jTime working beam ignition c_nTime pair c_nThe power gain of (a) is determined,

representt_jTime working beam to c_nThe power of the transmitted power of the satellite,

represents t_jTime working beam to c_iSatellite transmission power of, N₀Is the noise power spectral density, W is the satellite spectrum bandwidth,

represents t_jTime working beam ignition c_i；

t_jTime working beam ignition c_nSatellite beam transmission capacity of

Wherein f is_DVB-S2() is a piecewise function of the european telecommunications standards institute standard on signal to interference plus noise ratio and spectral efficiency.

As a further optimization scheme of the dynamic resource allocation method of the beam hopping satellite system based on deep reinforcement learning, the specific process of establishing the optimization problem of minimizing the transmission delay of the data packet in the step 2 is as follows:

t_jnew arrival of time c_nIs defined as

Storing data packets in a packet buffer queue

Wherein

Represents t_jWhen c is_nThe data packet buffer queue of (a) is,

representing the j-q th beam hopping time slot t_j-qTime of arrival c_nQ is not less than 0 and not more than T_th，T_thIs the maximum transmission delay of the data packet;

if the transmission delay of the data packet

Exceeding T_thThen the packet is discarded; wherein, the first and the second end of the pipe are connected with each other,

wherein, t_jIs the time slot, t, in which the data packet is transmitted_kIs the time slot for the data packet to arrive at the ground wave position;

in summary, the following optimization problem P for minimizing the packet transmission delay is established:

wherein, the first and the second end of the pipe are connected with each other,

represents t_kTo c_nThe data packet of (1). Equation (5) indicates that the largest operating beam in a single slot cannot exceed K,

represents t_jTime working beam ignition c_nEquation (6) shows that the sum of the powers of the operating beams in any time slot of the beam hopping period cannot exceed the total power P of the satellite_totThe power of a single working beam in any time slot of the beam hopping period shown in the formula (7) can not exceed the maximum power P of the single beam_bEquation (8) shows that the maximum transmission delay of the data packet cannot exceed T_th。

As a further optimization scheme of the dynamic resource allocation method of the beam hopping satellite system based on deep reinforcement learning, the step 3 is as follows:

step 301, state design of a deep reinforcement learning algorithm:

state s_tDefined as two attributes of the number of data packets and the average transmission delay of the data packets, and expressed by formula (9):

wherein the content of the first and second substances,

represents t_jTime of flight

The total number of data packets in the packet is defined as

Represents t_jTime of flight

The number of data packets in the packet stream,

to represent

The average transmission delay of the data packet in the inner is defined as

Represents t_jTime of flight

Average transmission delay of the data packets in the packet queue;

step 302, action design of a deep reinforcement learning algorithm:

hopping beam pattern

As agent at t_jOutput action of time

Obtaining a group of beam hopping pattern sets through an iterative algorithm as actions of a deep reinforcement learning algorithm

Wherein, X_kThe number of the k-th hopping beam pattern in the hopping beam pattern set is more than or equal to 1, k is less than or equal to num, and num is the number of the hopping beam patterns in the group of hopping beam pattern sets;

the iterative algorithm specifically comprises the following steps:

initializing ground wave position numbers, dividing the ground wave positions into M clusters, and setting a same frequency interference threshold value, wherein i is 1, and k is 1;

secondly, if all the ground wave positions in the ith cluster are contained in the Set, selecting one ground wave position from the ith cluster and lighting the ground wave position; otherwise, selecting one ground wave position which does not contain the Set from the ith cluster, and lightening the ground wave position;

step three, calculating the same frequency multiplexing distance of the working wave beam, if the same frequency multiplexing distance is larger than the same frequency interference threshold value, adding the ground wave position selected in the step two into the wave beam hopping pattern X_kOtherwise, the ground wave position which enables the same-frequency multiplexing distance to be maximum in the ith cluster is selected to be added into the hopping beam pattern X_k；

If i is not equal to M, i is equal to i +1, and the step (II) is returned; if i is equal to M, X is_kAdding into Set;

step five, if all elements in Set meet X₁∪X₂∪...∪X_kIf psi, the iteration is terminated; otherwise k equals k +1, i equals 1, and the step is returned;

step 303, reward design of the deep reinforcement learning algorithm:

deep reinforcement learning algorithm t_jReward of time

The following formula is set:

As a further optimization scheme of the dynamic resource allocation method of the beam hopping satellite system based on deep reinforcement learning, the step 4 is specifically as follows:

step I, initializing a satellite scene, and initializing a data packet buffer queue;

step II, initializing a satellite agent, initializing weight parameters of a decision neural network and a target network, initializing the training step number step of the decision neural network to be 0, and setting the updating step length of the target network to be G;

step III, initializing the capacity of the experience pool, setting a training period number E and a beam hopping time slot number J of each period, wherein the initialization training period E is 1, and the initialization time slot number J is 1;

step IV, t_jThe time data packet arrives at the ground wave position, and the satellite environment state information at the time is observed and extracted as

Step V, randomly selecting one beam hopping pattern X in the Set obtained in step 302 according to probability epsilon_kAs

Or selecting the action corresponding to the maximum action value output by the decision neural network according to the probability 1-epsilon as the action

Step VI, selected by performing step V

At which time the environment transitions to the next state

And obtain the prize at that time

Step VII, storing experience information

To an experience pool;

step VIII, randomly sampling a plurality of experience pieces of information from an experience pool, calculating a loss function, and training a decision neural network by using an Adam algorithm, wherein step is step + 1;

step IX, if step number step of training the decision neural network is a multiple of G, updating the weight parameter of the target network to be the weight parameter of the decision neural network, and executing step X;

if step number step of training decision neural network is not multiple of G, executing step X;

step X, firstly, judging whether J is equal to J or not, if J is not equal to J, then J is J +1, and returning to the step IV;

if J is equal to J, continuing to judge whether E is equal to E: if E is not equal to E, E +1, and the probability epsilon is reduced, the process returns to step IV, and if E is equal to E, the process is terminated.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

the invention provides a feasible scheme for resource allocation of a satellite communication system based on beam hopping, a deep reinforcement learning algorithm is introduced, a satellite is modeled into an intelligent body, a decision neural network of the algorithm is continuously trained by designing the state, action and reward of the deep reinforcement learning algorithm, and finally the decision neural network obtained by training is used for completing the resource allocation of the beam hopping satellite system.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of a forward link traffic model of a beam-hopping satellite system;

FIG. 3 is a deep reinforcement learning algorithm framework diagram;

FIG. 4 is a flow chart of an iterative algorithm for beam hopping pattern design;

FIG. 5 is a graph comparing the transmission delays of data packets for a deep reinforcement learning algorithm, a random distribution algorithm, and a fixed distribution algorithm;

FIG. 6 is a graph comparing average throughput for a deep reinforcement learning algorithm, a random distribution algorithm, and a fixed distribution algorithm system.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

referring to fig. 1, the specific implementation steps of the present invention are as follows:

step 1, establishing a service model of a forward link of a beam-hopping satellite system according to the characteristic of uneven time-space distribution of services of the beam-hopping GEO satellite system.

The forward link traffic model of a beam-hopping satellite system is shown in fig. 2. In a beam-hopping satellite system, the ground wave position psi is defined as psi ═

c

t_jtime hopping beam pattern

Wherein the content of the first and second substances,

represents t_jTime working beam ignition c_n，

Represents t_jNo working beam is on during the time c_n；

According to a hopping beam pattern

Calculating t_jWhen, c_nSignal to interference plus noise ratio of

Wherein, c_iIs the ith ground wave position,

represents t_jTime working beam to c_nThe power of the transmitted power of the satellite,

represents t_jTime working beam ignition c_i；

t_jTime working beam ignition c_nSatellite beam transmission capacity of

And 2, establishing an optimization problem of minimizing transmission delay, and solving an optimal scheme of hopping beam resource allocation.

t_jNew arrival of time c_nIs defined as

Storing data packets in a packet buffer queue

Wherein

Represents t_jWhen c is turned on_nThe data packet buffer queue of (a) is,

if transmission delay of data packet

Exceeding T_thThen the packet is discarded; wherein the content of the first and second substances,

wherein t is_jIs the time slot, t, in which the data packet is transmitted_kIs the time slot in which the packet arrived.

wherein the content of the first and second substances,

And 3, modeling the satellite into an intelligent agent by utilizing a deep reinforcement learning algorithm, and designing reward design of state input of the intelligent agent, output decision-making action of the intelligent agent and quality of action behaviors.

The deep reinforcement learning algorithm framework is shown in fig. 3. The decision neural network is a mapping function of action values and decides the behavior and the action of the intelligent agent. In addition, in order to improve the performance of the decision neural network, an algorithm framework based on deep reinforcement learning is added into a target network and an experience pool. The specific design method of the algorithm comprises the following steps:

step 301, state design of a deep reinforcement learning algorithm:

wherein the content of the first and second substances,

represents t_jTime of flight

The total number of data packets in the packet is defined as

Represents t_jTime of flight

The number of data packets in the packet stream,

to represent

The average transmission delay of the data packet in the inner is defined as

Represents t_jTime of flight

Average transmission delay of the data packets in the packet;

step 302, designing the action of the deep reinforcement learning algorithm:

hopping beam pattern

As agent at t_jOutput action of time

the beam hopping pattern design flow is shown in fig. 4, and the iterative algorithm specifically includes the following processes:

If i is not equal to M, if i is equal to i +1, returning to the step (II); if i is equal to M, X is_kAdding into Set;

step five, if all elements in Set meet X₁∪X₂∪...∪X_kIf psi, then iterateTerminating; otherwise k equals k +1, i equals 1, and the step is returned;

step 303, reward design of the deep reinforcement learning algorithm:

the optimization problem (4) aims at minimizing the transmission delay of the data packet, and is known from state design

Representing a time slot t_jInner wave position c_nInner packet transmission delay summation. Therefore, the smaller the total transmission delay of the data packet is, the larger the reward is set, and the reward of the deep reinforcement learning algorithm can be set as follows

And 4, setting parameters of a deep reinforcement learning algorithm, and continuously training weight parameters of an optimization decision neural network.

The algorithm comprises the following specific steps:

Or selecting the action corresponding to the maximum action value output by the decision neural network as the action with the probability of 1-epsilon

Step VI, selected by performing step V

At which time the environment transitions to the next state

And obtain the prize at that time

Step VII, storing experience information

To an experience pool;

And 5, finally, completing dynamic allocation of beam hopping resources of the decision neural network obtained by training.

The normalized traffic is first defined as the total packet traffic of the ground wave bits divided by the maximum available capacity of the satellite. And secondly, using the decision neural network obtained by training in the step 4 for dynamic resource allocation of the beam-hopping satellite system. And finally, comparing the performances of the data packet transmission delay and the system throughput under different normalized traffic conditions by using three algorithms of resource allocation, random allocation algorithm and fixed allocation algorithm based on deep reinforcement learning. Wherein the random allocation algorithm indicates that the operating beams are randomly selected per time slot and the fixed allocation algorithm indicates that each beam is allocated a fixed number of time slots. The simulation results are shown in fig. 5 and fig. 6.

The effects of the present invention can be further verified by the following simulation.

1. An experimental scene is as follows:

in order to illustrate the effect of the method, the result of a comparison experiment is given by adopting 36 ground wave position 6-beam GEO satellite system model simulation.

2. Experimental contents and results:

in order to verify the performance of the method, a 36-ground wave position 6-beam GEO beam hopping system model is adopted, the maximum transmission delay threshold of a data packet in satellite scene parameters is set to be 4s, the beam hopping time slot is set to be 100ms, and the number of beam hopping period time slots is set to be 256. The training period in the deep reinforcement learning algorithm is set to be 1000, the size of an experience pool is 5000, the activation function of the decision neural network is Relu, the initial exploration probability is 0.5, and the termination exploration probability is 0.01.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A dynamic resource allocation method for a beam-hopping satellite system based on deep reinforcement learning is characterized by comprising the following steps:

2. The method for dynamically allocating resources of a beam-hopping satellite system based on deep reinforcement learning according to claim 1, wherein a service model of a forward link of the beam-hopping satellite system is established in step 1, and specifically the following steps are performed:

in a beam-hopping satellite system, the ground wave position psi is defined as psi ═ c_n1,2,3, N, where N represents the total number of ground wave bits, c_nThe maximum working wave beam number is K which is less than or equal to N, and the wave beam jumping period is defined as T which is { T }₁,t₂,...,t_j,...,t_JWhere t is_jJ is more than or equal to 1 and less than or equal to J, wherein J is the total number of the beam hopping time slots;

t_jtime hopping beam pattern

Wherein the content of the first and second substances,

represents t_jTime working beam ignition c_n，

Represents t_jNo working beam is on during the time c_n；

According to a hopping beam pattern

Calculating t_jWhen, c_nSignal to interference plus noise ratio of

Wherein, c_iIs the ith ground wave position and is a reference ground wave position,

represents t_jTime working beam ignition c_i；

t_jTime working beam ignition c_nSatellite beam transmission capacity of

3. The method for dynamically allocating resources of a beam-hopping satellite system based on deep reinforcement learning according to claim 1, wherein the specific process of establishing the optimization problem for minimizing the transmission delay of the data packet in the step 2 is as follows:

t_jnew arrival of time c_nIs defined as

Storing data packets in a packet buffer queue

Wherein

Represents t_jWhen c is_nThe data packet buffer queue of (a) is,

if transmission delay of data packet

wherein the content of the first and second substances,

4. The method for dynamically allocating resources of a beam-hopping satellite system based on deep reinforcement learning according to claim 3, wherein the step 3 is as follows:

step 301, state design of a deep reinforcement learning algorithm:

represents t_jTime of flight

The total number of data packets in the packet is defined as

Represents t_jTime of flight

The number of data packets in the packet stream,

to represent

The average transmission delay of the data packet in the inner layer is defined as

Represents t_jTime of flight

Average transmission delay of the data packets in the packet;

step 302, designing the action of the deep reinforcement learning algorithm:

hopping beam pattern

As agent at t_jOutput action of time

the iterative algorithm specifically comprises the following steps:

step 303, reward design of the deep reinforcement learning algorithm:

deep reinforcement learning algorithm t_jReward of time

The following formula is set:

5. The method for dynamically allocating resources of a beam-hopping satellite system based on deep reinforcement learning according to claim 4, wherein the step 4 is as follows:

step III, initializing the capacity of the experience pool, setting the number E of training cycles and the number J of beam hopping time slots of each cycle, wherein the initialized training cycle E is 1, and the initialized time slot number J is 1;

Step VI, selected by performing step V

At which time the environment transitions to the next state

And obtain the prize at that time

Step VII, storing experience information

To an experience pool;