CN116505998A

CN116505998A - Multi-beam satellite communication resource distribution system and method based on deep reinforcement learning

Info

Publication number: CN116505998A
Application number: CN202310363998.0A
Authority: CN
Inventors: 王燕妮; 卫江涛; 徐丰
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-28

Abstract

The invention provides a multi-beam satellite communication resource distribution system and a multi-beam satellite communication resource distribution method based on deep reinforcement learning, wherein the multi-beam satellite communication resource distribution system and the multi-beam satellite communication resource distribution method comprise satellite equipment and ground equipment, the satellite equipment comprises a satellite communication module, satellite available resource distinguishing equipment and satellite resource distribution equipment, the ground equipment comprises a satellite beam coverage ground range user set, user service performance distinguishing equipment and a ground user communication module, the satellite communication module is connected with the satellite available resource distinguishing equipment, the satellite available resource distinguishing equipment is connected with the satellite resource distribution equipment, and the satellite resource distribution equipment is connected with the satellite communication module. The invention has stronger information perception capability and allocation decision capability, can realize the self-adaptive adjustment of satellite communication resources under the coverage of the satellite communication resource allocation system, fully considers the time domain correlation, and improves the stability of the allocation result of the system.

Description

Multi-beam satellite communication resource distribution system and method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of satellite communication resource allocation, in particular to a multi-beam satellite communication resource allocation system and method based on deep reinforcement learning.

Background

Satellite mobile communication system has become an important component of global mobile communication because of its advantages of high reliability, wide coverage area, large capacity, etc., and its advantages are embodied in that the communication quality is not affected geographically; mobile communication service can be provided for global users in a large span and a large range; under a 5G architecture, the expected communication traffic capacity can be as high as 10Gbps. Mobile communications currently have a number of areas including civilian and military use. On one hand, communication data flow is increased, and users are required to be continuously promoted for satellite communication service modes and quality, on the other hand, on-board power and channel resources are limited, so that development of an on-board resource allocation algorithm with high efficiency is important.

The current resource allocation algorithm is mainly aimed at minimizing the transmission power and system interference when meeting the user communication request. Among satellite antenna technology, multi-beam satellite antenna technology is one of the effective approaches to solve such a band-limited problem. Compared with the single-beam antenna technology, the multi-beam antenna uses a plurality of high-gain narrow beams to realize larger area coverage, can realize frequency division multiplexing and can realize beam space separation, and has the characteristics of high signal-to-noise ratio and flexible adjustment of coverage. The current researches on the multi-beam satellite communication resource allocation mode mainly comprise a fixed channel allocation mode (FCA) and a dynamic channel allocation mode (DCA), and the difference between the two modes is whether the bandwidth and the power allocated for a specific channel are fixed or not.

In the fixed channel allocation scheme, the allocated communication resources of a single channel do not change with the actual traffic, and the allocation scheme is limited in practical use because adaptive adjustment cannot be realized. In contrast, the dynamic channel allocation mode can dynamically allocate channel resources according to real-time traffic of the system, and multiple factors such as satellite power, single beam power, co-channel interference among beams and the like are generally comprehensively considered during allocation, so that the current user communication is not influenced, and the utilization rate of the system resources can be improved.

The DCA research in the current part mainly aims at channel resource allocation, comprehensively considers channel interference and communication service node requirements, and allocates channel resources proportionally. There are also some studies focusing on power allocation, such as using a water-filling algorithm and its variants, implementing dynamic allocation by establishing a link between power allocation and channel quality. As traffic categories increase, there is also a partial study to transition from single frequency or power allocation to joint allocation of resources. There is a literature that proposes an algorithm that aims to improve the quality of service for users at the beam edge, but the number of carriers supported in the beam coverage is limited, and the performance of the algorithm is greatly affected by the beam scenario. There is a literature that proposes a power channel joint allocation algorithm, which adopts a parallel multi-beam scheme to obtain power gain, and ensures fairness of power and channel allocation proportion, but the algorithm ignores co-channel interference existing between beams. The traditional DCA algorithm based on the manual design inevitably has the problem of algorithm adaptability, the applicable scene of the algorithm is relatively fixed, and the performance is greatly influenced by the scene. Furthermore, when deciding, the algorithms only pay attention to the distribution effect of the current moment, and cannot consider the far-reaching influence of the current decision result on future decisions.

With the rapid development of artificial intelligence technology, various deep learning technologies and mobile communication are combined to form a trend, and a good thought is provided for solving the problem of resource allocation of a satellite system. The reinforcement learning is the technology closest to the general artificial intelligence, and an intelligent body adopts an interactive learning mechanism, so that the existing strategy can be utilized, and new strategies can be continuously explored and learned based on environmental feedback. Reinforcement learning is also widely studied and applied in the problem of radio resource allocation at present. There are documents that use reinforcement learning techniques to learn from historical power allocation data the best allocation strategy in a device-to-device and cellular user coexistence scenario. The multi-agent reinforcement learning algorithm is used in literature, and a framework of centralized training and distributed exploration is adopted, so that the unmanned aerial vehicle dynamic channel allocation efficiency is effectively improved. The specific application of reinforcement learning in distributed cognitive radio is studied in the literature, and context-aware intelligence is achieved.

In combination with the problems and the deep reinforcement learning advantages of the current satellite communication system, the invention researches the dynamic channel allocation algorithm under the multi-beam satellite communication scene by using the deep reinforcement learning design. The invention respectively models the satellite communication system and the ground user as an agent and an environment, models the interaction process of the agent and the environment as a Markov decision process, and improves the dynamic channel allocation strategy of the system through interactive learning. Compared with an artificially designed allocation algorithm, the algorithm designed in the way can fully consider the time domain correlation of the channel and the power allocation, and meanwhile, the applicability of the algorithm is improved, so that the algorithm is effectively applied to various channel allocation places.

Disclosure of Invention

The invention aims to provide a multi-beam satellite communication resource distribution system and a multi-beam satellite communication resource distribution method based on deep reinforcement learning, which aim to solve the problems that a dynamic communication resource distribution system is designed manually, time domain correlation is not considered, or the distribution result of the communication resource distribution system based on the deep reinforcement learning is large in variance and unstable in distribution effect.

In order to achieve the above purpose, the present invention provides the following technical solutions: the multi-beam satellite communication resource distribution system and method based on deep reinforcement learning comprise a plurality of satellite communication devices, wherein the satellite communication devices are arranged in a preset satellite communication area and comprise a satellite modem device and a satellite communication antenna which are sequentially connected. The satellite sensing device is used for receiving data to be communicated sent by an antenna and a satellite sensing device in a preset satellite communication area.

The satellite available resource distribution equipment based on the deep reinforcement learning continuously self-learns and iteratively updates in the resource distribution, and optimizes the resource distribution result.

The satellite available resource discriminator counts the current available channel resources according to the current online user information and channel history allocation conditions under the condition that the upper limit of the single beam transmitting power of the satellite, the upper limit of the transmitting power of each channel of the satellite and the upper limit of the total transmitting power of the satellite are not exceeded.

The user service performance judging equipment obtains distribution feedback based on a shannon formula, and transmits the feedback to the resource distribution equipment, so that the resource distribution performance and distribution stability are improved.

The ground users are arranged in the range of satellite wave beams covering the ground.

The ground communication module comprises a modulation-demodulation device and a communication antenna which are sequentially connected, and is used for receiving communication data transmitted by the satellite equipment.

A multi-beam satellite communication resource allocation method based on deep reinforcement learning comprises the following steps:

s1, initializing:

s101, initializing relevant parameters of a satellite simulation platform;

s102, initializing deep reinforcement learning parameters;

s103, initializing Q network parameter theta and target network 1 parameter theta ₁ =θ, target network 2 parameter θ ₂ ＝θ；

S104, initializing scene parameters: w=Φ, N _block ＝0、N _arrival =0, term=0, number of steps i=0, where N _block For a given period of time, N is the cumulative number of blocked users _arrival The total number of user requests;

s2, training repetition stage:

s201, generating a new service distribution scene under the current iteration;

s202, generating new service request u ^t Combining the current on-line service information U ^t Channel allocation matrix W at current moment ^t ；

S203, constructing a state S according to the state definition in the MDP _t ；

S204, calculating the current available channel resources, and updating the action space A (S _t )；

S205 if A (S) _t ) For empty set, update blocking traffic number N _block ＝N _block +1, term=1, obtaining an immediate benefit value r _t ；

S206, if A (S) _t ) Non-null, selecting action according to epsilon-greedy strategyObtaining immediate benefit value r _t ；

S207, obtaining the next state S _t+1 Store the sequence (s _t ,a _t ,r _t ,s _t+1 Term) storeStoring the experience pool;

s208, updating parameters W, U, updating service arrival number N _arrival ＝N _arrival +1；

S209, updating a parameter epsilon according to an index updating strategy, if the parameter epsilon reaches a network training threshold, selecting a batch of data from an experience pool, and updating the network parameter according to a formula;

s210, copying the Q network parameter theta to two target Q networks theta at each interval of I steps ₁ 、θ ₂ According to N _block And N _arrival Obtaining a blocking rate P;

s211, finishing strategy training and outputting a channel allocation result.

Compared with the prior art, the invention has the beneficial effects that:

the invention models the dynamic channel allocation problem in the multi-beam constellation satellite communication system as an optimization problem, namely, the channel utilization rate is maximized under the condition of limited available channel number, so that the blocking probability of the system is minimized, the modeling optimization problem is a time domain associated sequence decision problem, and the method has the limitation of a plurality of constraint conditions, and deep reinforcement learning can effectively solve the sequence decision problem in a complex environment.

Drawings

FIG. 1 is a schematic view of an application scenario of a satellite communication resource allocation system according to the present invention;

FIG. 2 is a schematic diagram of a satellite resource allocation apparatus according to the present invention;

FIG. 3 is a schematic diagram of a satellite available resource discriminating apparatus according to the present invention;

fig. 4 is a schematic diagram of the ground service of the present invention uniformly distributing different beam service arrival rates;

fig. 5 is a schematic diagram of different beam service arrival rates when the ground service is unevenly distributed according to the present invention;

FIG. 6 is a graph of a DDQN-DCA training block of the present invention;

FIG. 7 is a graph of system blocking rate versus traffic arrival rate for a uniform distribution according to the present invention;

FIG. 8 is a graph of system blocking rate versus traffic arrival rate for non-uniform distribution in accordance with the present invention;

FIG. 9 is a table of simulation parameters of the satellite communication system of the present invention;

fig. 10 is a diagram of the beam user coordinate mapping relationship of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-10, the present invention provides a technical solution: a multi-beam satellite communication resource distribution system and method based on deep reinforcement learning specifically comprises the following steps:

1.1 deep reinforcement learning

Under the push of deep learning, reinforcement learning and deep learning are combined, and the deep reinforcement learning is gradually developed. By means of the strong characterization capability of the neural network, the deep reinforcement learning can solve the problem that the state space and the action space are high-dimensional characterization. In deep reinforcement learning, deep learning is mainly used to represent a value function or a strategy function in reinforcement learning. Further, the deep reinforcement learning can be classified into two types, model-based and model-free, according to whether or not the environment is modeled.

The design objective in the context of satellite communication systems is to propose a dynamic channel allocation optimization algorithm, which, unlike supervised learning, does not immediately evaluate the quality of the reinforcement learning current decision, but takes into account its long-term impact. The characteristics correspond to the characteristics that the system channel allocation also has time domain correlation in the multi-beam scene. Based on the method, the optimal allocation strategy is explored by using a model-free deep reinforcement learning algorithm, and the sequential decision problem with time domain correlation property is solved.

1.2 satellite communication System modeling

The invention models the dynamic channel allocation problem in the multi-beam satellite communication system as an optimization problem, and the optimization target is to maximize the channel utilization rate and minimize the system blocking probability under the condition of limited available channel number. When the optimization problem is solved by using a deep reinforcement learning method, a communication scene is firstly modeled.

In a multi-beam satellite communication system, a single satellite may transmit multiple beams to provide service to ground subscribers. Assuming that the satellite-borne multi-beam transmitter transmits N beams altogether, the satellite's on-ground beam may be represented as N e { n|n=1, 2,..n }, the areas covered by the N beams jointly constitute the satellite coverage area. Let the total bandwidth of satellite communication be B _total The total number of channels is M, bandwidth resources adopt an average allocation mode, and single channel resources are interacted and do not overlap, and the bandwidth occupied by each channel is B _single ＝B _total The system available channel is denoted M e { m|m=1, 2. The total number of service user terminals of the system is recorded as K, a single user terminal is expressed as K epsilon { k|k=1, 2,.. The system is provided with K }, when the K users access beams, the uniqueness of the beam access relationship of the users is ensured, the difference exists in the distribution of the users in the actual scene, and the difference exists in the number of the users accessed by each beam. During communication, the satellite transmits signals by using the beam access channel where the user k is located, and each user can only receive the signals of the access channel. Thus defining a beam-based channel allocation vector w _n ＝[w _n,1 ,w _n,2 ,…,w _n,M ] ^T Each item w _n,m E {0,1} indicates whether channel m in beam n is occupied. The channel allocation matrix of the entire system can be expressed as a two-dimensional matrix as follows: w= [ W ] ₁ ,w ₂ ,...,w _N ],W∈R ^M×N . Defining beam-based power allocation vector p _n ＝[p _n,1 ,p _n,2 ,…,p _n,M ] ^T Each term p _n,m Indicating the power allocated by beam n for channel m. The power allocation matrix of the whole system can be expressed as a two-dimensional matrix as follows: p= [ P ] ₁ ,p ₂ ,…,p _N ],P∈R ^M×N . The system performs normalization processing on satellite transmission signals, namely, the power distribution of each channel of each wave beam is average distribution.

When users with different wave beams communicate, co-channel interference exists among users occupying the same channel, and the magnitude of the co-channel interference is related to factors such as inter-user distance, antenna radiation gain, multi-wave beam antenna gain, user receiving gain, free space attenuation, propagation loss and the like. Under the condition of not considering natural cloud, rain and fog, the free space attenuation loss L is related to the working frequency f of the satellite and the distance d between the satellite and a ground user, and the calculation formula is as follows:

where f is in Khz and d is in km. Defining the gain of the transmitting antenna as G _s Since satellite antennas have directivity, the magnitude of the transmit antenna gain is affected by the beam bandwidth, the antenna radiation pattern, and the antenna off-axis angle. The multi-beam antenna radiation pattern of the invention is designed by referring to the ITU-S.672 antenna pattern in the STK. The design assumes that the channel gains of access users within the same beam are the same, and the values are equal to the corresponding beam center gain values. As the antenna off-axis angle θ increases, the beam antenna gain gradually decreases. According to the different beam center distances, the channel gains among the beams can be obtained. Further assume that user k receives a gain value G _r Is a fixed value, the magnitude of which is temperature dependent. The system overall channel gain matrix H is therefore denoted h= { H _i,j |1≤i≤N,1≤j≤N},H∈R ^N×N Wherein each item H _i,j Can be obtained from L, G described above _s 、G _r The three parts are calculated as follows: h _i,j ＝-L(dB)+G _s (dBi)+G _r (dBi). H in _i,j The average channel gain for the users within beams i through j is shown.

The user receives the signal and considers co-channel interference of the side lobe of the co-frequency wave beam antenna and downlink Gaussian white noise channel interference caused by frequency division multiplexing. Defining the total interference power of user k occupying channel m in beam n asCo-channel interference power->And noise power A _k Two-part construction->And A _k The calculation formula is as follows:

A _k ＝B _single N ₀ . Wherein p is _j,m Indicating the power level, N, of the satellite transmitted on channel m via beam j ₀ Is the noise power spectral density.

Furthermore, the useful signal power of user k on channel m in beam n is defined asThe expression is as follows:

in order to measure the quality of the received signal, the signal-to-interference-and-noise ratio of the receiving end needs to be calculated, and the expression of the signal-to-interference-and-noise ratio of the user k is as follows:

the channel capacity of user k occupying channel m in beam n can be calculated according to shannon's formula The calculation formula is as follows:

to ensure user service quality, user k obtains according to corresponding channel allocation modeIdeal achievable rate C _k Should not be lower than the set threshold C _th C, i.e _k ≥C _th Traffic may be transmitted normally at this point, otherwise the traffic will be blocked. Using gamma _t Indicating whether the new service request at time t is blocked.

In addition, when a new service request occurs, it is necessary to determine whether the user access beam has available channel resources. The criteria for measuring whether a channel is available include: whether the power of the individual beams is saturated, whether the power on the satellites is saturated, whether the current channel allocation is detrimental to the allocated user quality of service, etc. The constraint conditions based on the resource allocation matrix representation are as follows:

wherein P is _total Representing total power on the satellite, P _beam Representing the total power of a single beam, U ^t Representing the current user request information.

The defined system blocking rate describes the merits of the channel allocation algorithm of the current system. The calculation formula of the blocking probability is as follows:

P _block ＝N _block /N _arrival

wherein N is _block For a given period of time, N is the cumulative number of blocked users _arrival For the total number of user requests. The goal of the optimization is therefore to minimize the system blocking rate.

1.3 dynamic channel optimization algorithm based on deep reinforcement learning

Algorithm architecture

When the problem is solved based on reinforcement learning, in order to find an optimization algorithm for the satellite communication system to allocate channel resources for the ground users, the satellite communication system is modeled as an intelligent agent, the ground users are regarded as environments, and the interaction process of the satellite communication system and the ground users is modeled as a Markov decision process. The Markov decision process is expressed as follows: from the initial time t ₀ From the beginning, the satellite-represented agent perceives all observations in the current environment ₀ Giving action a in this state based on the current policy ₀ This action may cause the state of the environment to change. According to the state transition rule set by the environment, the environment observed by the satellite becomes s ₁ The environment gives a score r for this action at the same time ₁ . And continuously repeating the process to perform interaction until the set termination condition is reached, and ending the single interaction round.

As can be seen from the above description, the channel output corresponding to the motion is a discrete value, and thus a model-free reinforcement learning algorithm of a discrete motion space is selected. The Deep Q Network (DQN) algorithm proposed by Mnih et al combines convolutional neural networks with reinforcement learning for the first time, and originally proposes the deep reinforcement learning algorithm. The algorithm uses a Q network to estimate the Q value, replaces the Q table in the Q-learning algorithm, and can complete the representation of the continuous state space. The Q network estimates each discrete Q value, performs the action of selecting the highest Q value once, and simultaneously acquires multi-element training data by using an E-greedy strategy. DDQN (Deep Double Q-Network) is an improvement of DQN, and aims to solve the overestimation problem in the DQN training process and improve the training stability and speed by training two target Q networks simultaneously and selecting a Q value auxiliary strategy output with smaller differential value. The invention adopts a DDQN algorithm in combination with research background and algorithm application environment, and the deep reinforcement learning dynamic channel allocation algorithm is named as DDQN-DCA.

The DDQN-DCA algorithm is schematically shown in fig. 2, and after a new service request occurs, the agent outputs an action according to the current environmental state provided by the satellite, and then obtains a benefit and transitions to the next state. The experience strips of the interaction process are stored in an experience pool. When the set threshold is reached, randomly sampling part of the data to complete the network updating. As interactions proceed, the data in the experience pool is also updated continuously with the promotion of policies.

Markov decision process

The markov decision process satisfies the markov property and thus uses the state transition probabilities p (s _t+1 |s _t ,a _t ) An overall intrinsic model of the system may be described.

In reinforcement learning, the goal of agent interactive learning is to maximize discount accumulation report, whose expression is as follows:

where the discount factor is represented by gamma, which is used to attenuate the contribution of future rewards to the current state value, where gamma is close to 1, focusing more on long-term rewards, close to 0, focusing more on short-term rewards.

The invention correspondingly strengthens the study each part as follows:

state space design

The state is used to express all useful information of the environment that can be observed by the sensor at the current moment, so as to facilitate the decision of the agent. The main factors influencing policy output comprise the current online service information U ^t Channel allocation matrix W at current moment ^t Current new service request information u ^t . Observation o at the present moment _t The expression is as follows:

o _t ＝(U ^t ,W ^t ,u ^t )

the strategy network of the invention adopts the form that a convolutional neural network is connected with a fully-connected network. Due to the state s _t The method needs to be read into the neural network to finish policy output, so that the representation form of the state needs to be reconstructed, and network policy learning is facilitated. Before inputting the state into the network, the human feature extraction is first performed. Considering that the common-frequency signal in the adjacent beam of the beam where the new service request is located at the current moment is the main factor of channel interference, when the current online service information is represented, the online service information is simplified and processed, and only the current online service information is consideredConsider online traffic in the vicinity of the beam where the new traffic request is located. State s in combination with channel allocation _t The expression is as follows:

s _t ＝{s _i,j,k |1≤i≤10,1≤j≤10,1≤k≤M+1},

s _t ∈R ^10×10×M+1

specifically, first, all-zero three-dimensional state tensors s are initialized _t And extracting all online services of the beam where the new service request is located and the two surrounding beams. Taking service k as an example, two-dimensional coordinates (i, j) of the service k in the 10×10 grid can be obtained according to the beam grid mapping relation of fig. 10. If the user k is a new service request, setting (i, j, m+1) to 1, if the user k is a history online service, judging the channel m where the user k is located, and setting (i, j, m) to 1. All online services near the circulating beam, finish s _t Is a representation of (c).

When the system has no available channel resource or is allocated to all service requests, the system reaches a termination state and ends a single round.

Motion space design

Action space A _t Defined as the set of available channels in the current state, action a _t Defined as satellite communication system in state s _t The channels allocated for service requests are a subset of the action space. Based on current environment observation and Q network policy, the satellite adopts epsilon-greedy policy to randomly sample actions in an action space or selects the action with the largest state action value as policy output. If the current action space is an empty set, the channel resources cannot be allocated for the service request, and the system is blocked.

Reward design

Prize value r _t As a feedback evaluation of the current policy output by the system, the definition of this value directs the direction of learning of the policy. The prize design is as follows:

wherein R is ₁ And R is ₂ For two super parameters, m _t Is a systemThe number of users on-line,for the blocking probability at the previous moment,is the blocking probability at the current time. The first term of the reward function is related to the real-time change of the system blocking rate, and when the system blocking rate is reduced, the first term value is positive, so that the agent is encouraged to explore towards the current strategy direction. The second term is related to the complexity of system policy allocation, and as the number of system users increases, the difficulty of policy allocation further increases, and if the system blocking rate can still be reduced, additional rewards are added under the condition of considering the number of system online users, otherwise, the second term is zero. The positive and negative changes of the rewarding value caused by the system blocking rate can effectively avoid the situation that the strategy learning effect is greatly influenced by sampling due to the fact that the rewarding value is absolute positive.

1.4 Algorithm implementation

Network architecture

The invention adopts DDQN algorithm, and comprises three neural networks of Q network and two target Q networks. Wherein the Q network structure is composed of 2 convolutional layers and 2 fully-connected layers, wherein the first convolutional layer conv1 is composed of 32 convolutional kernels of size 5 x 5 and a Relu activation function, and the second convolutional layer conv2 is composed of 64 convolutional kernels of size 7 x 7 and a Relu activation function. The third layer of full connection layer is formed by matching with a Relu function. The dimension of the output of the last full-connection layer is M dimension. The Q network outputs a probability of each action output. The network structure of the Q target network is completely consistent with that of the Q network.

Network update

In updating the Q network parameter θ, the following loss function is optimized using an Adam optimizer.

loss＝E[(y _i -Q(s _j ,a _j ；θ)) ² ]

Wherein y is _i The expression of (2) is as follows ^[17] ：

y _i ＝r _j +(1-Term _j+1 )×γ×

When the system reaches the end state, term _j+1 And setting 1, otherwise setting 0.

Target network parameter θ ₁ And theta ₂ And copying parameters of the Q network through fixed steps to finish updating.

Algorithm implementation

According to the design of each part of the Markov decision process, the implementation flow of the satellite dynamic channel resource allocation algorithm based on deep reinforcement learning is as follows:

s1, initializing:

s101, initializing relevant parameters of a satellite simulation platform;

s102, initializing deep reinforcement learning parameters;

s103, initializing Q network parameter theta, and enabling target network 1 and target network 2 to parameter theta ₁ ＝θ、θ ₂ ＝θ；

s2, training repetition stage:

s201, generating a new service distribution scene under the current iteration;

S207, obtaining the next state S _t+1 Store the sequence (s _t ,a _t ,r _t ,s _t+1 Term) to a pool of experiences;

s211, finishing strategy training and outputting a channel allocation result.

When the epsilon-greedy strategy is used for strategy output, epsilon adopts a linear descent scheme, gradually improves the utilization weight of the agent along with the learning of the strategy, and reduces the exploration weight of the agent. When reaching the termination epsilon _final When this is the case, the algorithm ends.

1.5 experimental simulation and analysis

The simulation experiment is based on a Python experiment platform, a network model is written and realized by a deep learning framework Pytorch, and training and testing are performed by using a NVIDIA GeForce RTX 3090 display card. The satellite communication system simulation parameters and reinforcement learning simulation parameters used in the experiment are shown in fig. 9. The service arrival model obeys poisson distribution, the service arrival time interval obeys exponential distribution, and the service duration is a random value within 3-6 minutes. Experimental results show that the algorithm provided by the invention has lower service blocking rate and better performance in different satellite communication scenes.

In order to approach the real satellite communication scene, the number of users to be allocated in a single iteration is set to 1850 in the training process, and the distribution number of the users in the wave number is uniformly distributed. The traffic arrival rate is set to 70 and the duration of the user communication is a random value within 3-6 minutes. And determining to be blocked when the communication rate allocated to the current application user does not reach the set threshold.

Fig. 6 is a graph of the occlusion profile of a DDQN-DCA training process, which is seen to be substantially the same as the expected profile, with the system occlusion rate gradually decreasing as the number of trains increases. The initial blocking rate of the iteration drops faster, and after approximately 300 iterations, the iteration converges to a lower blocking rate value. In the subsequent iteration, since the randomness of the model scene always exists, the reward value can fluctuate within a small range, but the fluctuation range is smaller, which further explains that the construction of the double-target network effectively ensures the stability of the system. This result directly demonstrates the effectiveness of our scene modeling and algorithm design.

To further compare experimental results, the optimization algorithm obtained by reinforcement learning training was compared with Fixed Channel Allocation (FCA) and mixed channel allocation (HCA), respectively. In contrast, we consider two cases of uniform distribution and non-uniform distribution of services respectively. Fig. 4 and fig. 5 are schematic diagrams of the arrival rates of traffic in different beams when the traffic is uniformly distributed and unevenly distributed, respectively. The service duration set for both distribution cases was 3 minutes, and fig. 7 and 8 compare the performance of the three algorithms at different service arrival rates in uniform and non-uniform distributions, respectively.

As shown in fig. 7, the system blocking rate of each of the three algorithms increases as the traffic arrival rate increases. When the arrival rate of the traffic is low in the early stage, the request of the newly arrived traffic can be met through the inter-beam channel allocation, so that the blocking rate of the system is low. And the algorithm performance improvement becomes no longer obvious when the traffic load is large. This is mainly because the number of available channels becomes smaller when the traffic arrival rate is large, and it is difficult to satisfy more new traffic requests even by dynamic scheduling. Compared with the other two algorithms, the DDQN-DCA algorithm of the invention is superior to the other two algorithms under different service arrival rates, and particularly, the DDQN-DCA algorithm is outstanding under high service arrival rates, because the DDQN-DCA algorithm of the invention considers not only the current state but also the time domain relevance when selecting the channel with the best signal quality, and therefore, the service bearing capacity is strong;

fig. 8 is a performance simulation analysis in a traffic non-uniform distribution scenario. The overall trend of the three algorithms is still similar to that of fig. 7. Compared with fig. 7, the FCA algorithm and the HCA algorithm are greatly influenced by the service distribution, and the performance of the algorithm of the invention is not greatly changed under the two service distributions, so that the robustness of the algorithm of the invention is further proved.

Claims

1. The multi-beam satellite communication resource distribution system based on deep reinforcement learning comprises satellite equipment and ground equipment, and is characterized in that: the satellite equipment comprises a satellite communication module, satellite available resource distinguishing equipment and satellite resource distribution equipment, wherein the ground equipment comprises a satellite beam coverage ground range user set, user service performance distinguishing equipment and a ground user communication module, the satellite communication module is connected with the satellite available resource distinguishing equipment, the satellite available resource distinguishing equipment is connected with the satellite resource distribution equipment, the satellite resource distribution equipment is connected with the satellite communication module, the satellite beam coverage ground range user set is connected with the user service performance distinguishing equipment, the user service performance distinguishing equipment is connected with the ground user communication module, the satellite beam coverage ground range user set is connected with the ground user communication module, and the ground user communication module is connected with the satellite communication module.

2. The deep reinforcement learning based multi-beam satellite communication resource allocation system of claim 1, wherein: the satellite communication module is arranged in a preset satellite communication area and used for receiving data to be communicated sent by an antenna and a satellite sensing device in the preset satellite communication area, the ground user communication module is arranged in a satellite beam coverage ground range and used for receiving communication data transmitted by the antenna in the communication area, and the satellite beam coverage ground range user set is composed of all online users under the current satellite wave number.

3. The deep reinforcement learning based multi-beam satellite communication resource allocation system of claim 1, wherein: the satellite communication module and the ground user communication module both comprise a modulation-demodulation device and a communication antenna which are sequentially connected, the satellite available resource discriminator is also provided with a memory, the memory is used for storing historical resource allocation information, historical user information, satellite single-beam transmitting power upper limit, satellite per-channel transmitting power upper limit and satellite total transmitting power upper limit which are sent to the ground user by the satellite communication resource allocation system, the satellite available resource discriminator outputs available channel resources of current satellite equipment based on the stored historical information and a discrimination algorithm, the satellite communication equipment and the ground user adopt wireless channels to transmit data, and white noise interference and wireless channel natural signal attenuation inside the communication equipment are considered in the transmission process.

4. The deep reinforcement learning based multi-beam satellite communication resource allocation system of claim 1, wherein: the user service performance judging device is provided with a memory, the memory is used for storing the lower limit of the user communication speed, the user service performance judging device calculates the signal-to-interference-and-noise ratio by using the shannon theorem, and the resource allocation effect is estimated.

5. The multi-beam satellite communication resource allocation method based on deep reinforcement learning, which is applied to the multi-beam satellite communication resource allocation system based on deep reinforcement learning as set forth in any one of claims 1 to 4, is characterized in that: the method comprises the following steps:

s1, initializing:

s101, initializing relevant parameters of a satellite simulation platform;

s102, initializing deep reinforcement learning parameters;

S104, initializing scene parameters: w=Φ, N _block ＝0、N _arrival =0, term=0, number of steps i=0, where，N _block For a given period of time, N is the cumulative number of blocked users _arrival The total number of user requests;

s2, training repetition stage:

s201, generating a new service distribution scene under the current iteration;

S206, if A (S) _t ) Non-null, then act a is selected according to the ε -greedy policy _t ＝argmax _a∈A(st) Q (s, a), obtaining immediate benefit value r _t ；

s211, finishing strategy training and outputting a channel allocation result.

6. The method for allocating multi-beam satellite communication resources based on deep reinforcement learning according to claim 5, wherein: in updating the Q network parameter θ, the Adam optimizer is used to optimize the following loss function:

loss＝E[(y _i -Q(s _j ,a _j ；θ)) ² ]

wherein y is _i The expression of (2) is as follows:

when the system reaches the end state, term _j+1 Setting 1, otherwise setting 0;

target network Q ₁ Parameter θ ₁ And target network Q ₂ Parameter θ ₂ And copying the parameter theta of the Q network through fixed steps to finish updating.

7. The method for allocating multi-beam satellite communication resources based on deep reinforcement learning according to claim 5, wherein: in step S203, the state is used to express all useful information of the environment observable by the sensor at the current moment, so as to facilitate the decision of the agent, and the main factors affecting the policy output include the current online service information U ^t Channel allocation matrix W at current moment ^t Current new service request information u ^t Observation o at the present moment _t The expression is as follows:

o _t ＝(U ^t ,W ^t ,u ^t )

the strategy network of the invention adopts the form of connecting the convolutional neural network with the fully-connected network, and the state s is due to _t The method is characterized in that the method is required to be read into a neural network to finish policy output, so that a representation form of a state is required to be reconstructed, network policy learning is facilitated, before the state is input into the network, artificial feature extraction is firstly carried out, the common-frequency signals in adjacent beams of the beam where a new service request is located at the current moment are considered as main factors of channel interference, when the current online service information is represented, the online service information is simplified, only online services near the beam where the new service request is located are considered, and a channel allocation mode is combined, so that the state s _t The expression is as follows:

s _t ＝{s _i,j,k |1≤i≤10,1≤j≤10,1≤k≤M+1},

s _t ∈R ^10×10×M+1

specifically, first, all-zero three-dimensional state tensors s are initialized _t Secondly, extracting all online services of the beam where the new service request is located and two beams around the beam, mapping 19 beams in a 10 multiplied by 10 grid, taking service k as an example, obtaining two-dimensional coordinates (i, j) of the beam in the grid, setting (i, j, m+1) to 1 if the user k is the new service request, judging a channel m where the user k is located if the user k is the historical online service, setting (i, j, m) to 1, cycling all online services around the beam, and finishing s _t Is a representation of (2);

8. The method for allocating multi-beam satellite communication resources based on deep reinforcement learning according to claim 5, wherein: the action space A _t Defined as the set of available channels in the current state, action a _t Defined as satellite communication system in state s _t The channel allocated for the service request is a subset of the action space, based on the current environment observation and the Q network policy, the satellite adopts epsilon-greedy policy to randomly sample actions in the action space or select the action with the largest state action value as policy output, if the current action space is an empty set, the channel resource cannot be allocated for the service request, and the system is blocked;

prize value r _t As a feedback evaluation of the current strategy output by the system, the definition of the value guides the learning direction of the strategy, and the reward design is as follows:

wherein R is ₁ For two super-parameters, the two super-parameters,for the blocking probability of the last moment, +.>For the blocking probability at the current moment, the rewarding function is related to the real-time change of the blocking rate of the system, when the blocking rate of the system is reduced, the first item value is positive, the agent is encouraged to explore towards the current strategy direction, and the positive and negative changes of the rewarding value caused by the blocking rate of the system can effectively avoid the situation that the strategy learning effect is greatly influenced by sampling due to the absolute positive value of rewarding.