CN114599100A

CN114599100A - Beam resource allocation method and device

Info

Publication number: CN114599100A
Application number: CN202210231703.XA
Authority: CN
Inventors: 王磊; 赵中天; 徐静; 李凡; 樊思萌
Original assignee: 32039 Unit Of Chinese Pla; Xian Jiaotong University
Current assignee: 32039 Unit Of Chinese Pla; Xian Jiaotong University
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-06-07
Anticipated expiration: 2042-03-10
Also published as: CN114599100B

Abstract

The invention provides a method and a device for allocating beam resources, which relate to the technical field of satellite communication and comprise the following steps: firstly, acquiring beam capacity demand information of a user terminal; then, the beam capacity demand information is input into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on the beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process. According to the method, the obtained beam resource allocation model can be ensured to be actually attached by updating the model parameters based on the near-end strategy, so that the timeliness of the beam resource allocation calculation is ensured, and the accuracy of the beam resource allocation result can be considered.

Description

Beam resource allocation method and device

Technical Field

The present invention relates to the field of satellite communications technologies, and in particular, to a method and an apparatus for allocating beam resources.

Background

At present, a satellite internet system needs to efficiently distribute power and frequency band resources for each user beam according to user requirements so as to meet the user capacity requirements as much as possible and adapt to the scene that the user requirements rapidly and dynamically change along with time. The existing meta-heuristic resource allocation algorithm can provide a solution to the problem, but the algorithm usually needs multiple times of iterative computation to be converged, consumes more time, and influences the timeliness of the resource allocation scheme. In order to solve the timeliness problem, researchers have developed the resource allocation problem in the satellite communication system by using reinforcement learning technology, and proposed a communication satellite resource allocation method based on reinforcement learning, many researchers use dqn (deep Q network) algorithm to deal with such problems, but when facing the continuous resource allocation problem, discretization processing is often required, which affects the accuracy of the allocation result, and the degree of discretization also adds the computational complexity of the algorithm, even causes a disaster in dimension.

Disclosure of Invention

The invention aims to provide a beam resource allocation method and a beam resource allocation device, which are used for solving the technical problem that the allocation result obtained under the scene with timeliness in the prior art is poor in accuracy.

In a first aspect, a method for allocating beam resources provided by the present invention includes: acquiring beam capacity demand information of a user terminal; inputting the beam capacity demand information into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process.

Further, the beam resource allocation method further includes: constructing an initial beam resource allocation model; the initial beam resource allocation model comprises an agent and an environment module, wherein the agent comprises an initial actor neural network model and an initial critic neural network model; acquiring different sample beam capacity requirement information of a user terminal, and carrying out interactive operation on the initial actor neural network model and the environment module based on the different sample beam capacity requirement information to obtain at least one strip state data; the number of the state data is the same as the number of times of interactive operation, and the state data comprises states, actions and rewards corresponding to the states; and training the initial actor neural network model by adopting a near-end optimization strategy based on the at least one strip state data until the parameters of the initial actor neural network model are converged to obtain a target actor neural network model.

Further, the performing interactive operation on the initial actor neural network model and the environment module to obtain at least one bar state data includes: inputting the different sample beam capacity requirement information serving as an initial state into the initial actor neural network model in a mode of controlling the environment module to execute an initial action; controlling the initial actor neural network model to output a next action and sending the next action to the environment module; based on the next action, controlling the context module to determine a reward and a next state corresponding to the initial state; determining the initial state, the initial action and a reward corresponding to the initial state as a piece of state data; and sending the next state to the initial actor neural network model through the environment module so as to enable the initial actor neural network model to continuously interact with the environment module to obtain a plurality of pieces of state data.

Further, the training the initial actor neural network model based on the at least one strip-shaped datum by using a near-end optimization strategy until the parameter of the initial actor neural network model converges to obtain a target actor neural network model includes: determining a dominance estimation set based on the at least one strip state data; each piece of state data corresponds to one advantage estimation; calculating a first loss value based on the dominance estimation set and a preset first loss function; updating the parameters of the initial actor neural network model based on the first loss value, and determining the actor neural network model after parameter updating as the target actor neural network model.

Further, before calculating a first loss value based on the dominance estimation set and a preset first loss function, the method further includes: and modifying the initial loss function by adopting a mode that an objective function limits the upper and lower bounds of the initial loss function to obtain the first loss function.

Further, the beam resource allocation method further includes: and training the initial critic neural network model by adopting a near-end optimization strategy based on the at least one strip state data until the parameters of the initial critic neural network model are converged to obtain a target critic neural network model.

Further, the training the initial critic neural network model by using a near-end optimization strategy based on the at least one strip state data until the parameters of the initial critic neural network model converge to obtain a target critic neural network model includes: calculating a second loss value based on the dominance estimation set and a preset second loss function; and updating the parameters of the initial critic neural network model based on the second loss value, and determining the critic neural network model after parameter updating as the target critic neural network model.

In a second aspect, a beam resource allocation apparatus provided in the present invention includes: an obtaining unit, configured to obtain beam capacity requirement information of a user terminal; the input unit is used for inputting the beam capacity demand information into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process.

In a third aspect, the present invention further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program operable on the processor, and the processor executes the computer program to implement the steps of the beam resource allocation method.

In a fourth aspect, the present invention also provides a computer readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to execute the beam resource allocation method.

The invention provides a method and a device for allocating beam resources, which comprise the following steps: firstly, acquiring beam capacity demand information of a user terminal; then, the beam capacity demand information is input into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on the beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process. According to the method, the obtained beam resource allocation model can be ensured to be actually attached by updating the model parameters based on the near-end strategy, so that the timeliness of the beam resource allocation calculation is ensured, and the accuracy of the beam resource allocation result can be considered.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a beam resource allocation method according to an embodiment of the present invention;

fig. 2 is a flowchart of another beam resource allocation method according to an embodiment of the present invention;

fig. 3 is a flowchart of another beam resource allocation method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a beam resource allocation apparatus according to an embodiment of the present invention.

Icon:

11-an acquisition unit; 12-input unit.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In recent years, satellite internet systems represented by constellations such as Starlink, Oneweb, and Kuiper have become a focus of attention in the field of global aerospace information. On one hand, in the process of serving users in the satellite Internet system, the users have higher and higher requirements on the system service quality, the user communication capacity demand rapidly increases in the global range, the users are in a non-uniform distribution situation geographically and have dynamic variability; on the other hand, with the rapid development of satellite payload technology in the satellite internet space segment, the beam size of the user is continuously increased, and a large amount of flexible communication loads (the loads support the parameters of on-orbit adjustment loads) are configured, so that the dynamic allocation of beam power and bandwidth resources according to the user requirements can be supported, and the resource management difficulty and complexity of the satellite internet system are higher than those of the traditional communication satellite system due to the factors. Therefore, in order to improve the service user quality of the satellite internet system and fully mine the limited resource efficiency, the accuracy and the timeliness of the user beam resource allocation need to be improved.

That is, the satellite internet system needs to efficiently allocate power and frequency band resources to each user beam according to the user requirement, so as to meet the user capacity requirement as much as possible and adapt to the characteristic that the user requirement is rapidly and dynamically changed along with time. The existing meta-heuristic resource allocation algorithm can provide a solution to the problem, but the algorithm usually needs multiple times of iterative computation to converge, is more time-consuming, and can influence the timeliness of the resource allocation scheme. Meanwhile, reinforcement learning, which is a technology rapidly developed in recent years, exhibits excellent performance in many decision-making problems. The trained reinforcement learning model can adapt to different scenes and efficiently provide an optimization decision scheme. Therefore, researchers have developed the problem of resource allocation in satellite communication systems by using reinforcement learning techniques and proposed a reinforcement learning-based method for allocating resources to communication satellites, but the method usually assumes that problem decision variables are discrete variables, or discretizes continuous variables and then performs subsequent processing, so that the quality of the resource allocation scheme is affected by the discretization degree of the decision variables. If the existing discretization processing method is adopted, when the discretization degree of the user beam power and the frequency band decision variable is low, the optimal solution is difficult to be solved in the original continuous solution space, namely, the optimal resource allocation scheme cannot be obtained; when the discretization degree is too high, the reinforcement learning action space is obviously expanded, and the calculation cost is increased. Therefore, there is a need to design a more efficient and fast joint optimization method and apparatus for the new generation of satellite internet system user beam power and frequency band resource allocation problem.

Based on this, the present invention provides a method and an apparatus for allocating beam resources, which can implement reasonable allocation of user beam power and frequency band resources in a satellite internet system, and can also consider the accuracy of the beam resource allocation result while ensuring timeliness. To facilitate understanding of the present embodiment, a beam resource allocation method disclosed in the embodiment of the present invention is first described in detail.

Example 1:

in accordance with an embodiment of the present invention, there is provided an embodiment of a beam resource configuration method, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system, such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flowchart of a method for configuring beam resources according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

step S101, obtaining the beam capacity requirement information of the user terminal.

Step S102, the beam capacity requirement information is input into the beam resource allocation model, and a beam resource allocation result of the user terminal is obtained.

In the embodiment of the invention, the user terminal can be understood as a satellite internet user, the beam is the abbreviation of user beam, and the beam resource in the satellite internet system comprises power and bandwidth, so the beam resource allocation result comprises a power allocation result and a bandwidth allocation result. Based on this, the embodiment of the invention is essentially a method for fast joint optimization of satellite internet user beam power and frequency band resources, and belongs to the field of satellite internet system resource management technology. In order to solve the practical problems provided by the embodiment of the invention, the power and the bandwidth distributed by each user beam of the satellite internet system are determined as decision variables. It should be noted that both power and bandwidth are continuously variable.

In order to facilitate understanding of the beam resource allocation method according to the embodiment of the present invention, the embodiment of the present invention is based on a satellite internet systemThe following extensions are followed: the satellite internet system (hereinafter referred to as a system) is composed of a space section, a ground section, and a user section. The space section consists of a low orbit constellation, and the low orbit constellation generally adopts a multi-beam mode to cover the whole world; the ground section consists of a gateway station and an operation control center, and mainly provides a feed link for the system and bears functions of user access, access and system operation and maintenance management and control; the user segment comprises various types of user terminals, the use frequency band is Ka, the polarization mode is circular polarization, and the user terminals are distributed in the system beam coverage range. The number of beams is set to N, and

is the set of all beams. It is assumed that there is one ue in the center of each beam whose capacity requirement is the sum of the capacity requirements of all ues in the coverage area of the beam. And assuming that the feeder link is noiseless, and the beam downlink channel is an additive white Gaussian noise channel. The system adopts a four-color frequency multiplexing system and consists of two frequency bands and two polarization modes (left-handed polarization and right-handed polarization). The system adopts the second generation digital satellite television broadcasting expansion standard (DVB-S2X), which comprises an adaptive MODulation and CODing strategy, namely, the system selects the optimal MODulation and CODing (MODCOD) system according to the link signal-to-noise ratio. UCD (Unmet Capacity Demand) of the satellite internet system is a key index for measuring system performance, as shown in the following formula (1):

wherein D is_iFor the capacity requirement of the user terminal in the coverage of beam i, R_iThe capacity provided by the satellite internet system for the beam i after the power and bandwidth distribution of all beams is completed and the link calculation is performed. D_iIt can also be understood that the capacity requirement of the user terminal for beam i is a scalar. As can be seen from formula (1): if the user client has a capacity requirement D for the beam i_iGreater than the capacity R that the beam can provide_iThen the unmet capacity requirement for beam i is UCD_i＝D_i-R_i(ii) a Conversely, the unmet capacity requirement for beam i is UCD_i0 means that the beam can completely satisfy the capacity requirement of all the user terminals in the coverage area.

The problem of joint distribution of power and bandwidth of a satellite internet system in a continuous space is an NP-hard problem, and the requirement of an actual application scene on timeliness is difficult to meet by adopting the conventional method. The embodiment of the invention adopts a deep reinforcement learning method to solve the joint Optimization problem of two continuous variables of power and bandwidth, and meanwhile, in order to reasonably distribute the power and the bandwidth in a continuous space, the strategy of a neural network is updated by using an algorithm based on an Actor-critical framework, namely Proximal Policy Optimization (PPO). Specifically, the embodiment of the present invention is divided into two parts, i.e., learning and working, wherein the above steps S101 to S102 are working parts, and the beam capacity requirement information in step S101 is actual beam capacity requirement information applied by the working parts. The learning part includes various operations performed before step S101, and details are given in steps S103 to S107 below, which are not described herein again. The plurality of beam capacity requirement information used by the learning section are different sample beam capacity requirement information. The beam resource allocation model is a neural network model obtained by performing model training based on beam resource allocation results corresponding to different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process.

In general, embodiments of the present invention are directed to fast joint optimization of satellite internet user beam power and frequency band resources. The beam resource allocation model in the embodiment of the invention is based on a deep reinforcement learning framework, can solve the problem of long time consumption of intelligent optimization algorithms such as a classical genetic algorithm and the like, and can meet the timeliness requirement of system resource allocation. Meanwhile, parameters of a neural network in a deep reinforcement learning framework are updated by using a near-end strategy optimization technology, so that the limitation that the existing deep reinforcement learning method adopts discretization processing resource allocation decision variables is overcome, and finally a high-quality user beam power and frequency band resource joint optimization scheme is obtained through solving. In other words, the embodiment of the present invention executes steps S101 to S102, and can ensure that the obtained beam resource allocation model is actually fitted in a manner of updating the model parameters based on the near-end policy, thereby ensuring timeliness of the beam resource allocation calculation and also considering accuracy of the beam resource allocation result.

In an optional embodiment, as shown in fig. 2, the method further includes the following steps S103 to S105, wherein: step S103, constructing an initial beam resource allocation model; the initial beam resource allocation model comprises an agent and an environment module, wherein the agent comprises an initial operator neural network model and an initial critic neural network model; step S104, acquiring different sample wave beam capacity requirement information of the user terminal, and carrying out interactive operation on the initial actor neural network model and the environment module based on the different sample wave beam capacity requirement information to obtain at least one strip-shaped data; the number of the state data is the same as the number of times of interactive operation, and the state data comprises states, actions and reward data corresponding to the states; and S105, training the initial actor neural network model by adopting a near-end optimization strategy based on at least one strip state data until the parameters of the initial actor neural network model are converged to obtain a target actor neural network model.

The core idea of deep reinforcement learning in the embodiment of the invention is to model a satellite internet system as an intelligent agent and model relevant factors such as user requirements, service conditions and the like as an environment (namely the environment module). The intelligent agent and the environment continuously learn interactively to achieve the aim of maximizing the long-term income. As can be seen from the above description, the embodiment of the present invention is divided into two parts, i.e., learning and working, so that the embodiment of the present invention can provide the data required by multiple groups of beams (i.e., the above D) for the intelligent agent during the learning process_i) Learning by the agent until the mapping from the beam demand scene to the power and frequency band resource allocation result; in the working process, based on the well-learned deep neural network (namely the target actor neural network model), a user beam power and frequency band resource combined optimization scheme is quickly given, and repeated iteration optimization is not needed, becauseThis is time efficient.

The beam resource allocation model comprises a target actor neural network model and a target critic neural network model. In the embodiment of the present invention, the target actor neural network model in the beam resource allocation model can be established by executing the steps S103 to S105. The embodiment of the invention introduces the concrete steps in the model establishing process in detail: step S104, performing interactive operation on the initial actor neural network model and the environment module to obtain at least one bar state data, which comprises the following steps S201 to S205, wherein: step S201, inputting different sample beam capacity requirement information serving as an initial state into an initial operator neural network model in a mode of controlling an environment module to execute an initial action; step S202, controlling the initial actor neural network model to output the next action and sending the next action to the environment module; step S203, based on the next action, the control environment module determines a reward corresponding to the initial state and a next state; step S204, determining the initial state, the initial action and the reward corresponding to the initial state as one piece of state data; and S205, sending the next state to the initial actor neural network model through the environment module so as to enable the initial actor neural network model to continuously interact with the environment module to obtain a plurality of pieces of state data. The purpose of obtaining multiple pieces of state data in the embodiment of the invention is to provide a data basis for a near-end optimization strategy.

In an optional embodiment, in step S105, training the initial operator neural network model by using a near-end optimization strategy based on at least one piece of strip-shaped data until a parameter of the initial operator neural network model converges to obtain a target operator neural network model, including the following steps S301 to S303, where: step S301, determining a dominance estimation set based on at least one strip state data; each piece of state data corresponds to one advantage estimation; step S302, calculating a first loss value based on the advantage estimation set and a preset first loss function; step S303, updating the parameters of the initial actor neural network model based on the first loss value, and determining the actor neural network model after the parameters are updated as the target actor neural network model. It should be noted that the foregoing steps S301 to S303 are specific implementation manners of the near-end optimization strategy, and are intended to update parameters of the initial actor neural network model, so as to provide a real and effective target actor neural network model for beam resource allocation.

In an optional embodiment, before calculating the first loss value based on the dominance estimation set and the preset first loss function in step S302, the method further includes: step S304, modifying the initial loss function by adopting a mode that the target function limits the upper and lower bounds of the initial loss function to obtain a first loss function. The above-mentioned objective function may be referred to as a clip function. In the embodiment of the invention, the setting of the first loss function can avoid overlarge updating of the parameters of the initial actor neural network model, so that the whole learning and updating process of the model is more stable and has smaller fluctuation.

In an optional embodiment, as shown in fig. 2, the method for configuring beam resources further includes: and step S106, training the initial critic neural network model by adopting a near-end optimization strategy based on at least one strip state data until the parameters of the initial critic neural network model are converged to obtain a target critic neural network model. As shown in fig. 2, the embodiment of the present invention may further include step S107 of forming the target actor neural network model and the target critic neural network model into a beam resource allocation model.

In step S106, training the initial critic neural network model by using a near-end optimization strategy based on the at least one strip state data until the parameters of the initial critic neural network model converge, so as to obtain a target critic neural network model, including: step S401, calculating a second loss value based on the advantage estimation set and a preset second loss function; and step S402, updating the parameters of the initial critic neural network model based on the second loss value, and determining the critic neural network model after the parameters are updated as the target critic neural network model. The initial critic neural network model described above is used to determine a state value function corresponding to each piece of state data, which is used to calculate a corresponding dominance estimate. Therefore, the step S301 of determining the dominance estimation set based on at least one strip state data includes the following steps S501: inputting the at least one strip state data into an initial critic neural network model to obtain a state value function corresponding to the at least one strip state data; step S502, calculating advantage estimation corresponding to each state data according to the state value function corresponding to the at least one strip state data to obtain an advantage estimation set.

The embodiment of the invention relates to a user beam power and frequency band resource rapid combined optimization method for a satellite Internet system, which is based on a deep reinforcement learning framework, solves the problem that time is consumed for resource allocation calculation of intelligent optimization algorithms such as a classical genetic algorithm and the like, updates parameters of a neural network by using a near-end strategy optimization technology, and overcomes the limitation of discretization processing of resource allocation decision variables in the existing reinforcement learning method. The deep reinforcement learning method provided by the embodiment of the invention can greatly improve the timeliness of resource allocation calculation at the cost that the acceptable loss system does not meet the capacity requirement, and meanwhile, a high-quality user beam power and frequency band resource combined optimization scheme is obtained.

Example 2:

fig. 3 shows a flowchart of another method for allocating beam resources according to an embodiment of the present invention. For convenience of understanding, the embodiments of the present invention are described in two parts:

the first part defines states, actions, and rewards for deep reinforcement learning as follows:

state s_t: the state is the input to the neural network (i.e., the operator network in FIG. 3) and is the basis for the agent to take action. In the modeling process of the embodiment of the present invention, a factor influencing the decision of the agent is the capacity requirement of each beam, so that it can be used as a state in a Deep Reinforcement Learning (DRL) model, as shown in the following formula (2):

s_t＝D_t＝[D_1，t，D_2，t，...，D_N，t] (2)

wherein D is_tIs a matrix containing a plurality of scalars, each scalar D_i，tFor the user terminal inthe capacity requirement for beam i at time t.

Action a_t: in the problem of joint allocation of beam power and bandwidth of a satellite internet user, which is to be solved by the embodiment of the present invention, the intelligent agent allocates resources for each beam by analyzing the capacity demand of the user terminal for each beam at the current moment, and further provides a scheme for allocating power and bandwidth for each beam, so that the action can be defined as follows according to equation (3):

a_t＝{P_t，B_t} (3)

wherein, P_tIs composed of multiple scalars P_i，tForming a matrix, i.e. P_t＝[P_1，t，P_2，t，...，P_N，t]，P_i，tDistributing power for the beam i at the time t for the satellite Internet system; b is_tIs composed of multiple scalars B_i，tForming a matrix, i.e. B_t＝[B_1，t，B_3，t，...，B_N，t]，B_i，tAnd allocating the bandwidth for the beam i at the time t for the satellite internet system.

Because the user wave beams of the satellite internet system considered by the embodiment of the invention adopt the four-color frequency reuse technology, adjacent wave beams with the same polarization are limited by the total bandwidth. In order to reduce UCD as much as possible, reduce the dimensionality of the neural network output, and facilitate the learning convergence of the agent, the embodiment of the present invention satisfies the limitation of the bandwidth by the following formula (4):

wherein: BW (Bandwidth)_mRepresents the bandwidth allocated by the beam m in the adjacent beam set; BW (Bandwidth)_nRepresents the bandwidth allocated by the beam n adjacent to the beam m in the adjacent beam set; BW (Bandwidth)^totalRepresenting the total bandwidth available for the system user beam, AP_pRepresenting a set of adjacent beams with polarization mode p; the above BW^totalAnd AP_pIs a known amount, and BW_mAnd BW_nAre all noneAnd (4) knowing the quantity. Therefore, in the beam set N, only one of the two adjacent beams with the same polarization needs to be allocated with bandwidth resources, so that the bandwidth allocation results of all the beams at the current moment can be obtained. That is, according to the features of the technical problem to be solved by the embodiments of the present invention, the reinforcement learning reward function r is designed specifically_tR in the examples of the present invention_tA value of not more than 0, r_tThe closer to 0, the better the strategy, r_tSmaller, the less the strategy is.

Prize r_t: reward is the passage of the agent through the observation state s_tOutput action a_tThe environment then gives feedback to the agent. Agent taking action a_tThen, it is necessary to know whether the action meets or approaches the optimization goal to decide to increase or decrease the action a when the action selection is performed later_tTo better meet the requirements of system design, and to learn the best joint power and bandwidth allocation strategy to minimize the UCD of the system. Prize r_tAs environment to action a_tIs to the action a_tAnd (4) performing an index of quality evaluation. The optimization goal in embodiments of the present invention is to minimize the UCD of the system, so reward r_tShould be related to optimizing the objective function, and since it is found in practical implementation that when the power bandwidth limit of a single beam is satisfied, if the relevant operation is performed on the result after the output of the neural network, it will be unfavorable for the learning convergence of the neural network, so the power bandwidth limit is added to r_tIn the design of (1), r_tThe calculation of (a) is designed as the following formula (5):

wherein, P_i，tRepresenting the power allocation of beam f at the time of death, p^maxRepresenting the maximum power available for allocation of any beam, B_i，tAnd (3) allocating the bandwidth for the beam i at the moment t for the satellite Internet system, wherein zeta and eta are different coefficients, and the two coefficients need to be subjected to trial calculation for many times and are optimized and selected according to experience.

As shown in fig. 3, in the process of fast joint optimization of beam power and frequency band resources of users in the satellite internet system, the agent first observes the current environment s_tThen the operator network will s_tAs input, output action a_tActing on the environment to change state to s_t+1And obtain a prize r_tThe PPO algorithm module (PPO belongs to a part of DRL) collects stored states, rewards and other information in the process, each state in the stored L pieces of data is sequentially input into the critic network after a certain time interval to obtain a corresponding state value function, corresponding advantage estimation, a first loss value and a second loss value are further obtained through calculation, and finally an Adam optimizer is used for updating parameters of the actor network, so that the action output by the actor neural network can better meet the optimization target. It should be noted that the L states are not input simultaneously, and the criticc network functions to give a corresponding state value function to one state, and the L states are input one by one in the order of the sequence.

And in the second part, a user beam power and frequency band resource rapid joint optimization implementation process based on deep reinforcement learning is consistent with the code flow chart description of a user beam power and frequency band resource rapid joint optimization algorithm based on deep reinforcement learning given in table 1. Considering that the PPO algorithm is a strategy gradient algorithm based on an Actor-Critic framework and has a good effect in processing the problem of continuous action space, the embodiment of the invention designs the PPO algorithm to update the deep neural network parameters. The main implementation flow of the algorithm is shown in table 1:

TABLE 1 code flow chart of satellite internet user beam power and frequency band resource fast joint optimization algorithm based on deep reinforcement learning

As can be seen from table 1, in step 1, relevant parameters of the satellite system are initialized, where the relevant parameters include, but are not limited to, the number of beams, the orbital altitude of a satellite, the total power of a satellite load, the maximum beam power, the center frequency of the beam, the total available bandwidth of a downlink beam, the aperture of a satellite transmitting antenna, the output backoff of a satellite power amplifier, the maximum gain of the satellite transmitting antenna, the aperture of a terminal receiving antenna, the maximum gain of the terminal receiving antenna, a roll-off coefficient, the system noise temperature, the power ratio of a useful signal to polarization interference, the power ratio of a useful signal to third-order intermodulation interference, and the power ratio of a useful signal to adjacent satellite interference in table 2 below.

TABLE 2 satellite System related parameter settings

Step 2, initializing relevant parameters of the DRL model; wherein, the related parameters of the DRL model comprise at least one of the following parameters: a discount factor, a learning rate, an update interval, a number of times a set of data update networks, a track length, a Clip function clipping range, a reward design scale parameter, and a reward design scale parameter.

And 3, constructing 4 layers of fully-connected neural networks, each layer comprises 10 × N neurons, and adding a nonlinear relation to the data by adopting a Relu activation function after each hidden layer. Note that N is the total number of user beams. The purpose of this step is to build a fully-connected neural network structure that is independent of DRL parameter initialization.

Step 4, training a deep neural network: firstly, generating T different capacity demand scenes D as the materials for network training, wherein the capacity demand scene D is generated at each moment T_tThe capacity requirement of each wave beam in the system is in accordance with the uniform distribution U (m, n), state s_tInput to the operator network, and network output action a_tThe multivariate normal distribution mean vector can be established according to a given covariance matrix and sampled to obtain a_t＝{P_t，B_tBringing the obtained resource allocation scheme into a link model to calculate to obtain r_tThen obtain L groups of data [ s ]_t，a_t，r_t]Thereafter, the dominance estimate A is calculated according to the following equation (6)_l(l＝1，2，...，L)：

A_l＝-V(s_l)+r_l+γr_l+1+…+γ^L-l+1r_L-1+γ^L-lV(s_T) (6)

Wherein, V(s)_l) Is the criticc network received state s_lThe latter output, γ, is the discount factor. It should be noted that since l is a variable, it may take different values, and the result of the dominance estimation is different when taking different values, so it can be understood that: one state corresponds to one dominance estimation. Then, the loss value of the actor neural network is calculated according to the following equations (7) and (8), and the loss value of the actor neural network is calculated according to the following equation (9):

L_actor(θ)＝E_l[min(p_l(θ)A_l，clip(p_l(θ)，1-∈，1+∈)A_l)] (7)

the above equation (7) can be understood as the first loss function in embodiment 1, the loss value of the operator network (i.e., the first loss value in embodiment 1) is calculated by using the first loss function, the upper and lower bounds of the first loss function are limited by using the clip function, and the parameter of the target operator network is prevented from being updated too much, so that the whole learning and updating process is more stable and has smaller fluctuation. The above equation (9) can be understood as the second loss function in embodiment 1, and the loss value of the criticc network (i.e., the second loss value in embodiment 1) is calculated using the second loss function. And after the first loss value and the second loss value are obtained, updating parameters of the operator network and the critic network by using an Adam optimizer or a gradient descent algorithm. Finally copying the parameter theta of the updated operator network and critic network to theta_oldAnd continuing to interact with the environment. The method enables data obtained by the old network and the environment to be updated for many times for the operator network and the critic network, and can accelerate the learning and training speed. P in formula (8)_θ(a_l|s_l) At input state s of the network with the expression parameter theta_lTime output action a_lIs calculated using the following equation (10):

where y is the dimension of the motion space, a_k,xRepresents an action a_kThe x-th dimension of (1), setting each action independent distribution, mu, when building multivariate normal distribution_xAnd σ_xIs the mean and variance of the x-th dimension of the motion.

And 5, inputting the beam capacity requirement scene into the trained operator network, wherein the network output is the user beam power and frequency band resource allocation combined optimization scheme. It should be noted that in practical applications, the critic network no longer works.

The following analyses were carried out by way of example: and allocating power and frequency band resources of a single satellite internet satellite user beam to simplify an evaluation scene. Assuming that a single internet satellite has 23 user beams, the method provided by the embodiment of the invention is adopted to analyze the joint optimization problem of beam power and frequency band resources. In order to verify the beneficial effects brought by the embodiment of the invention, the method is compared with a classical genetic algorithm and an improved genetic algorithm in the same scene. Because the genetic algorithm can obtain a better solution by increasing the number of iterative optimization times at the cost of simulation time, the embodiment of the invention simulates the OSGA algorithm and the DRLGA algorithm for 10 times and 20 times, and the simulation parameters of the algorithm are shown in Table 3:

TABLE 3 Algorithm simulation parameter set-up

Respectively adopting a classical genetic algorithm GA _20 and an improved genetic algorithm OSGA _20(briefly, the OSGA algorithm can avoid the solution in the solution space from changing to a worse direction in the crossing and variation operations of each iteration due to the addition of a feasible solution optimization strategy, so that compared with the classical genetic algorithm, the probability of obtaining a better solution in each iteration is greatly improved, and a better solution can be obtained under a certain iteration number) and the method develop the combined optimization of the user beam power and the frequency band resource under the scene of 50 different beam requirements to obtain the average value UCD of the UCD of which the system does not meet the capacity requirement_meanAverage RATIO of UCD to all beam requirements and RATIOs_meanAnd averaging the demand scenes of each beam to obtain the simulation time T of the distribution scheme_meanListed in table 4:

TABLE 4 assignment results of different algorithms

As can be seen from Table 3, the three algorithms are in USC_meanThere is a gap in this parameter index. Algorithm USC provided by application_meanThe value is between OSGA algorithm and GA algorithm, and is close to USC obtained by OSGA algorithm distribution_meanThe accuracy of the method provided by the application depends on factors such as a sample data set, a neural network structure, PPO algorithm parameters and the like used in the neural network learning process. Because the state space and the action space in the technical problem to be solved by the application are continuous, the state space is huge, and the training process is difficult to traverse all states, better results can be obtained through a further abundant training data set theoretically.

Further, the capacity demand scenario, i.e., the state, in the system is generated according to a uniform distribution, and therefore it can be considered as continuous, in the PPO algorithm used in the training process, the direct output of the actor network is a mean vector of multivariate normal distribution, the multivariate normal distribution can be established by using the mean vector and a covariance matrix set in advance, and then sampling is performed from the vicinity of the mean in the distribution to obtain the final output action a_tPower bandwidth allocation result required for system (corresponding toStep 1 of training the neural network, and steps 8-9 in table 1), so the actions can also be considered continuous.

From the simulation time index, the iteration time of the GA algorithm and the OSGA algorithm is far longer than that of the algorithm provided by the application, the operation time of the genetic algorithm is approximately proportional to the iteration times, a link calculation model and an interference analysis model of a satellite communication system need to be called in the process of each iteration, and the method provided by the application receives the state s every time_tAfter the input, the neural network directly gives the power and frequency band distribution result of each wave beam without complex link calculation and interference analysis processes, thereby greatly reducing the time required by calculation and being far superior to other algorithms in the aspect of resource distribution calculation timeliness.

Simulation results show that the deep reinforcement learning method provided by the embodiment of the invention can greatly improve the timeliness of resource allocation calculation at the cost of acceptable loss of a system which does not meet capacity requirements, the calculation time required by the same beam requirement scene is less than 1/1000 of a genetic algorithm, and a high-quality user beam power and frequency band resource joint optimization scheme can be obtained.

Example 3:

an embodiment of the present invention provides a beam resource allocation device, which is mainly used to execute the beam resource allocation method provided in the foregoing content of embodiment 1, and the following describes the beam resource allocation device provided in the embodiment of the present invention in detail.

Fig. 4 is a schematic structural diagram of a beam resource allocation apparatus according to an embodiment of the present invention. As shown in fig. 4, the beam resource allocation apparatus mainly includes: an acquisition unit 11 and an input unit 12, wherein:

an obtaining unit 11, configured to obtain beam capacity requirement information of a user terminal;

the input unit 12 is configured to input the beam capacity requirement information to the beam resource allocation model, so as to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on the beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process.

Through the action of the acquisition unit 11 and the input unit 12, the embodiment of the invention can ensure that the obtained beam resource allocation model is actually attached in a mode of updating the model parameters based on the near-end strategy, thereby ensuring the timeliness of the beam resource allocation calculation and also considering the accuracy of the beam resource allocation result.

Optionally, the beam resource allocation apparatus further includes: the device comprises a construction unit, an acquisition interaction unit and a first training unit, wherein:

the building unit is used for building an initial beam resource allocation model; the initial beam resource allocation model comprises an agent and an environment module, wherein the agent comprises an initial operator neural network model and an initial critic neural network model;

the interactive unit is used for acquiring different sample beam capacity requirement information of the user terminal, and carrying out interactive operation on the initial actor neural network model and the environment module based on the different sample beam capacity requirement information to obtain at least one strip of state data; the number of the state data is the same as the number of times of interactive operation, and the state data comprises states, actions and reward data corresponding to the states;

and the first training unit is used for training the initial actor neural network model by adopting a near-end optimization strategy based on at least one strip state data until the parameters of the initial actor neural network model are converged to obtain the target actor neural network model.

Optionally, the obtaining interaction unit includes: the device comprises a control execution module, a control output module, a first determination module, a second determination module and a sending module, wherein:

the control execution module is used for inputting the different sample beam capacity requirement information serving as an initial state into the initial actor neural network model in a mode of executing an initial action through the control environment module;

the control output module is used for controlling the initial actor neural network model to output the next action and sending the next action to the environment module;

a first determination module to control the environment module to determine a reward and a next state corresponding to the initial state based on a next action;

the second determination module is used for determining the initial state, the initial action and the reward corresponding to the initial state as a piece of state data;

and the sending module is used for sending the next state to the initial actor neural network model through the environment module so as to enable the initial actor neural network model to continuously interact with the environment module and obtain a plurality of pieces of state data.

Optionally, the first training unit comprises: a third determination module, a first calculation module, and a first update determination module, wherein:

the third determining module is used for determining a dominance estimation set based on at least one strip state data; each piece of state data corresponds to one advantage estimation;

the first calculation module is used for calculating a first loss value based on the advantage estimation set and a preset first loss function;

and the first updating determination module is used for updating the parameters of the initial actor neural network model based on the first loss value and determining the actor neural network model after the parameters are updated as the target actor neural network model.

Optionally, the apparatus further comprises a modification module, wherein:

and the modification module is used for modifying the initial loss function in a mode of limiting the upper and lower boundaries of the initial loss function by adopting the target function to obtain a first loss function.

Optionally, the apparatus further comprises a second training unit, wherein:

and the second training unit is used for training the initial critic neural network model by adopting a near-end optimization strategy based on at least one strip state data until the parameters of the initial critic neural network model are converged to obtain a target critic neural network model.

Optionally, the second training unit comprises a second calculation module and a second update determination module, wherein:

the second calculation module is used for calculating a second loss value based on the advantage estimation set and a preset second loss function;

and the second updating and determining module is used for updating the parameters of the initial critic neural network model based on the second loss value and determining the critic neural network model after the parameters are updated as the target critic neural network model.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In an optional embodiment, the present embodiment further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program operable on the processor, and the processor executes the computer program to implement the steps of the method of the foregoing method embodiment.

In an alternative embodiment, the present embodiment also provides a computer-readable medium having a non-volatile program code executable by a processor, wherein the program code causes the processor to perform the method of the above method embodiment.

In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as being fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the embodiments provided in the present embodiment, it should be understood that the disclosed method and apparatus may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present embodiment or parts of the technical solution may be essentially implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims

1. A method for allocating beam resources, comprising:

acquiring beam capacity demand information of a user terminal;

inputting the beam capacity demand information into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process.

2. The beam resource allocation method according to claim 1, further comprising:

constructing an initial beam resource allocation model; the initial beam resource allocation model comprises an agent and an environment module, wherein the agent comprises an initial actor neural network model and an initial critic neural network model;

acquiring different sample beam capacity requirement information of a user terminal, and carrying out interactive operation on the initial actor neural network model and the environment module based on the different sample beam capacity requirement information to obtain at least one strip state data; the number of the state data is the same as the number of times of interactive operation, and the state data comprises states, actions and rewards corresponding to the states;

and training the initial actor neural network model by adopting a near-end optimization strategy based on the at least one strip state data until the parameters of the initial actor neural network model are converged to obtain a target actor neural network model.

3. The method of claim 2, wherein the performing interaction between the initial actor neural network model and the environment module to obtain at least one strip state data comprises:

inputting the different sample beam capacity requirement information serving as an initial state into the initial actor neural network model in a mode of controlling the environment module to execute an initial action;

controlling the initial actor neural network model to output a next action and sending the next action to the environment module;

based on the next action, controlling the context module to determine a reward and a next state corresponding to the initial state;

determining the initial state, the initial action and a reward corresponding to the initial state as a piece of state data;

and sending the next state to the initial actor neural network model through the environment module so as to enable the initial actor neural network model to continuously interact with the environment module to obtain a plurality of pieces of state data.

4. The beam resource allocation method according to claim 2, wherein the training of the initial actor neural network model based on the at least one strip-state data by using a near-end optimization strategy until a parameter of the initial actor neural network model converges to obtain a target actor neural network model comprises:

determining a dominance estimation set based on the at least one strip state data; each piece of state data corresponds to one advantage estimation;

calculating a first loss value based on the dominance estimation set and a preset first loss function;

updating the parameters of the initial actor neural network model based on the first loss value, and determining the actor neural network model after parameter updating as the target actor neural network model.

5. The beam resource allocation method according to claim 4, further comprising, before calculating a first loss value based on the dominance estimation set and a preset first loss function:

and modifying the initial loss function by adopting a mode that an objective function limits the upper and lower bounds of the initial loss function to obtain the first loss function.

6. The beam resource allocation method according to claim 4, further comprising:

and training the initial critic neural network model by adopting a near-end optimization strategy based on the at least one strip state data until the parameters of the initial critic neural network model are converged to obtain a target critic neural network model.

7. The method according to claim 6, wherein the training the initial critic neural network model by using a near-end optimization strategy based on the at least one strip state data until a parameter of the initial critic neural network model converges to obtain a target critic neural network model comprises:

calculating a second loss value based on the dominance estimation set and a preset second loss function;

and updating the parameters of the initial critic neural network model based on the second loss value, and determining the critic neural network model after parameter updating as the target critic neural network model.

8. A beam resource allocation apparatus, comprising:

an obtaining unit, configured to obtain beam capacity requirement information of a user terminal;

an input unit, configured to input the beam capacity requirement information to a beam resource allocation model, so as to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of any of claims 1 to 7.