CN114599100B

CN114599100B - Beam resource allocation method and device

Info

Publication number: CN114599100B
Application number: CN202210231703.XA
Authority: CN
Inventors: 王磊; 赵中天; 徐静; 李凡; 樊思萌
Original assignee: 32039 Unit Of Chinese Pla; Xian Jiaotong University
Current assignee: 32039 Unit Of Chinese Pla; Xian Jiaotong University
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2024-01-19
Anticipated expiration: 2042-03-10
Also published as: CN114599100A

Abstract

The invention provides a beam resource allocation method and a device, which relate to the technical field of satellite communication and comprise the following steps: firstly, acquiring beam capacity demand information of a user terminal; then, the beam capacity demand information is input into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process. The invention can ensure that the obtained beam resource allocation model is fit practically by updating the model parameters based on the near-end strategy, thereby ensuring timeliness of beam resource allocation calculation and simultaneously considering accuracy of beam resource allocation results.

Description

Beam resource allocation method and device

Technical Field

The present invention relates to the field of satellite communications technologies, and in particular, to a method and an apparatus for allocating beam resources.

Background

At present, the satellite internet system needs to rapidly and efficiently allocate power and frequency band resources for each user beam according to user requirements so as to meet the user capacity requirements as much as possible and adapt to the scene that the user requirements change dynamically with time. The existing meta-heuristic resource allocation algorithm can provide a solution to the problem, but the algorithm usually needs multiple iterative computations to converge, is more time-consuming and affects timeliness of resource allocation scheme formulation. In order to solve the timeliness problem, the prior scholars research the resource allocation problem in the satellite communication system by using the reinforcement learning technology, and put forward a communication satellite resource allocation method based on reinforcement learning, and many scholars use DQN (Deep Q Network) algorithm to treat the problem, but discretization is needed when facing the continuous resource allocation problem, so that the accuracy of the allocation result is affected, the degree of discretization also increases the computational complexity of the algorithm, and even causes dimension disasters.

Disclosure of Invention

The invention aims to provide a beam resource allocation method and device, which are used for solving the technical problem that the accuracy of an allocation result obtained in a scene with timeliness in the prior art is poor.

In a first aspect, the present invention provides a beam resource allocation method, including: acquiring beam capacity demand information of a user terminal; inputting the beam capacity demand information into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process.

Further, the beam resource allocation method further includes: constructing an initial beam resource allocation model; the initial beam resource allocation model comprises an intelligent agent and an environment module, wherein the intelligent agent comprises an initial actor neural network model and an initial critic neural network model; acquiring different sample beam capacity requirement information of a user terminal, and performing interactive operation on the initial actor neural network model and the environment module based on the different sample beam capacity requirement information to obtain at least one piece of data; wherein the number of the state data is the same as the number of the interactive operations, and the state data is data containing states, actions and rewards corresponding to the states; based on the at least one piece of strip data, training the initial actor neural network model by adopting a near-end optimization strategy until parameters of the initial actor neural network model are converged, and obtaining a target actor neural network model.

Further, the performing an interactive operation on the initial actor neural network model and the environment module to obtain at least one strip data includes: the beam capacity requirement information of the different samples is input into the initial actor neural network model as an initial state by controlling the environment module to execute an initial action; controlling the initial actor neural network model to output a next action, and sending the next action to the environment module; controlling the environment module to determine a prize and a next state corresponding to the initial state based on the next action; determining the initial state, the initial action and the reward corresponding to the initial state as a piece of state data; and sending the next state to the initial actor neural network model through the environment module so as to enable the initial actor neural network model to interact with the environment module continuously, and obtaining a plurality of pieces of state data.

Further, training the initial actor neural network model by adopting a near-end optimization strategy based on the at least one piece of state data until parameters of the initial actor neural network model converge to obtain a target actor neural network model, including: determining a dominance estimation set based on the at least one strip of data; wherein each piece of data corresponds to one dominance estimation; calculating a first loss value based on the dominance estimation set and a preset first loss function; and updating parameters of the initial actor neural network model based on the first loss value, and determining the actor neural network model with the updated parameters as the target actor neural network model.

Further, before calculating the first loss value based on the dominance estimation set and the preset first loss function, the method further includes: and modifying the initial loss function by adopting a mode of limiting the upper and lower bounds of the initial loss function by adopting an objective function, so as to obtain the first loss function.

Further, the beam resource allocation method further includes: and training the initial critic neural network model by adopting a near-end optimization strategy based on the at least one piece of strip-shaped data until the parameters of the initial critic neural network model are converged, so as to obtain a target critic neural network model.

Further, the training the initial critic neural network model by adopting a near-end optimization strategy based on the at least one piece of strip data until the parameters of the initial critic neural network model converge, to obtain a target critic neural network model, includes: calculating a second loss value based on the dominance estimation set and a preset second loss function; and updating parameters of the initial critic neural network model based on the second loss value, and determining the critic neural network model with the updated parameters as the target critic neural network model.

In a second aspect, the present invention provides a beam resource allocation apparatus, including: the acquisition unit is used for acquiring the beam capacity requirement information of the user terminal; the input unit is used for inputting the beam capacity requirement information into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process.

In a third aspect, the present invention further provides an electronic device, including a memory and a processor, where the memory stores a computer program executable on the processor, and the processor executes steps of the beam resource allocation method implemented by the computer program.

In a fourth aspect, the present invention also provides a computer readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to perform the beam resource allocation method.

The invention provides a beam resource allocation method and device, comprising the following steps: firstly, acquiring beam capacity demand information of a user terminal; then, the beam capacity demand information is input into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process. The invention can ensure that the obtained beam resource allocation model is fit practically by updating the model parameters based on the near-end strategy, thereby ensuring timeliness of beam resource allocation calculation and simultaneously considering accuracy of beam resource allocation results.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a beam resource allocation method according to an embodiment of the present invention;

fig. 2 is a flowchart of another beam resource allocation method according to an embodiment of the present invention;

fig. 3 is a flowchart of another beam resource allocation method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a beam resource allocation apparatus according to an embodiment of the present invention.

Icon:

11-an acquisition unit; 12-input unit.

Detailed Description

The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In recent years, satellite internet systems typified by a constellation such as Starlink, oneweb, kuiper have become a focus of attention in the field of global space information. On the one hand, in the process of serving users by a satellite Internet system, the requirements of the users on the service quality of the system are higher and higher, the requirements on the communication capacity of the users are rapidly increased in the global scope, and the users are in a geographically non-uniform distribution situation and have dynamic variability; on the other hand, with the rapid development of satellite effective load technology in satellite internet space segment, the user beam scale is continuously increased, and flexible communication load (the load supports on-orbit adjustment of load parameters) is configured in a large amount, so that the dynamic allocation of beam power and bandwidth resources according to user requirements can be supported, and the above factors cause greater difficulty and complexity of resource management of the satellite internet system than those of the traditional communication satellite system. Therefore, in order to improve the quality of service users of the satellite internet system, the efficiency of limited resources is fully mined, and the accuracy and timeliness of the allocation of the user beam resources are required to be improved.

That is, the satellite internet system needs to rapidly and efficiently allocate power and frequency band resources to each user beam according to user demands, so as to meet the user capacity demands as much as possible, and adapt to the characteristics of rapid dynamic changes of the user demands with time. The existing meta-heuristic resource allocation algorithm can provide a solution to the problem, but the algorithm usually needs multiple iterative computations to converge, is more time-consuming and can influence timeliness of resource allocation scheme formulation. Reinforcement learning, as a technique that has been rapidly developed in recent years, has been shown to have excellent performance in many decision-making problems. The trained reinforcement learning model can adapt to different scenes and efficiently provide an optimized decision scheme. Therefore, a learner studies a resource allocation problem in a satellite communication system by using a reinforcement learning technology, and proposes a reinforcement learning-based communication satellite resource allocation method, but the method generally assumes that a decision variable of the problem is a discrete variable, or discretizes a continuous variable and then performs subsequent processing, so that the quality of a resource allocation scheme is affected by the discretization degree of the decision variable. If the existing discretization processing method is adopted, when the discretization degree of the user beam power and the frequency band decision variable is low, the optimal solution is difficult to be solved in the original continuous solution space, namely the optimal resource allocation scheme cannot be obtained; when the degree of discretization is too high, the reinforcement learning action space is obviously expanded, and the calculation cost is increased. Therefore, it is necessary to design a more efficient and rapid joint optimization method and device for the problem of beam power and frequency band resource allocation of new generation satellite internet system users.

Based on the above, the invention aims to provide a beam resource allocation method and device, which can realize reasonable allocation of beam power and frequency band resources of a satellite Internet system user, ensure timeliness and simultaneously also consider the accuracy of a beam resource allocation result. For the convenience of understanding the present embodiment, a detailed description will be first given of a beam resource allocation method disclosed in the embodiment of the present invention.

Example 1:

according to an embodiment of the present invention, there is provided an embodiment of a beam resource allocation method, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that herein.

Fig. 1 is a flowchart of a beam resource allocation method according to an embodiment of the present invention, as shown in fig. 1, where the method includes the following steps:

step S101, acquiring beam capacity requirement information of the user terminal.

Step S102, the beam capacity requirement information is input into a beam resource allocation model to obtain a beam resource allocation result of the user terminal.

In the embodiment of the invention, the user terminal can be understood as a satellite internet user, the beam is short for user beam, the beam resource in the satellite internet system comprises power and bandwidth, and therefore, the beam resource allocation result comprises a power allocation result and a bandwidth allocation result. Based on the above, the embodiment of the invention is essentially a method for fast joint optimization of the beam power and the frequency band resources of the satellite internet users, and belongs to the category of satellite internet system resource management technology. In order to solve the practical problem proposed by the embodiment of the present invention, the power and bandwidth allocated to each user beam of the satellite internet system are determined as decision variables. It should be noted that both power and bandwidth are continuous variables.

In order to facilitate understanding of the beam resource allocation method according to the embodiment of the present invention, the embodiment of the present invention extends the following based on a satellite internet system: the satellite internet system (hereinafter referred to as system) is composed of a space section, a ground section and a user section. The space band is composed of a low-orbit constellation, and the low-orbit constellation generally adopts a multi-beam mode to cover the whole world; the ground section is composed of a gateway station and an operation control center, and mainly provides a feed link for the system to bear the functions of user network access, access and system operation and maintenance control; the user section comprises user terminals of various types, the frequency band is Ka, the polarization mode is circular polarization, and the user terminals are distributed in the coverage area of the system beam. The number of beams is set to N, and recorded Is the set of all beams. Assuming that there is one user terminal in each beam center, its capacityThe requirement is the sum of all the user terminal capacity requirements within the beam coverage area. The beam downlink channel is an additive white gaussian noise channel, assuming that the feed link is noiseless. The system adopts a four-color frequency multiplexing system and consists of two frequency bands and two polarization modes (left-hand polarization and right-hand polarization). The system adopts a second algebraic satellite television broadcast extension standard (DVB-S2X), wherein the standard comprises an adaptive MODulation and CODing strategy, namely, the system selects the optimal MODulation and CODing system (MODCOD) according to the signal to noise ratio of a link. UCD (Unmet Capacity Demand, insufficient capacity requirement) of the satellite Internet system is a key index for measuring the system performance, and the key index is shown in the following formula (1):

wherein D is _i For capacity requirement of user terminal in beam i coverage area, R _i The capacity provided by the satellite internet system for the beam i after the power and bandwidth allocation of all beams are completed and link calculation is performed. D (D) _i It is also understood that the capacity requirement of the user terminal for beam i is a scalar. As can be seen from formula (1): if the capacity requirement D of the user client for the beam i _i Greater than the capacity R that the beam can provide _i Then the unmet capacity requirement of beam i is UCD _i ＝D _i -R _i The method comprises the steps of carrying out a first treatment on the surface of the Conversely, the unmet capacity requirement for beam i is UCD _i =0, indicating that the beam is able to fully meet all user terminal capacity requirements within the coverage area.

The problem of joint allocation of power and bandwidth of a satellite Internet system in a continuous space is NP-hard, and the requirement of actual application scenes on timeliness is difficult to meet by adopting the existing method. The embodiment of the invention adopts a deep reinforcement learning method to solve the problem of joint optimization of two continuous variables, namely power and bandwidth, and simultaneously uses a near-end strategy optimization (Proximal Policy Optimization, PPO) algorithm based on an Actor-Critic framework to update the strategy of the neural network in order to reasonably allocate the power and the bandwidth in the continuous space. Specifically, the embodiment of the present invention is divided into two parts, i.e. learning and working, wherein the steps S101 to S102 are working parts, and the beam capacity requirement information in step S101 is actual beam capacity requirement information applied by the working parts. The learning part includes various operations performed before step S101, and details of the operations are shown in the following steps S103 to S107, which are not described in detail herein. The plurality of beam capacity requirement information employed by the learning section is different sample beam capacity requirement information. The beam resource allocation model is a neural network model obtained by performing model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process.

Overall, the purpose of the embodiments of the present invention is to perform fast joint optimization of satellite internet user beam power and band resources. The beam resource allocation model in the embodiment of the invention is based on a deep reinforcement learning framework, can solve the problem of long calculation time consumption of intelligent optimization algorithms such as classical genetic algorithms and the like, and can meet the timeliness requirement of system resource allocation. And meanwhile, the parameters of the neural network in the deep reinforcement learning framework are updated by utilizing a near-end strategy optimization technology, so that the limitation that the existing deep reinforcement learning method adopts discretization processing resource allocation decision variables is overcome, and finally, a high-quality user beam power and frequency band resource joint optimization scheme is obtained. In other words, the embodiment of the invention executes the steps S101 to S102, and the fitting reality of the obtained beam resource allocation model can be ensured by updating the model parameters based on the near-end strategy, so that the timeliness of the beam resource allocation calculation is ensured, and the accuracy of the beam resource allocation result can be considered.

In an alternative embodiment, as shown in fig. 2, the method further comprises the following steps S103 to S105, wherein: step S103, constructing an initial beam resource allocation model; the system comprises an initial beam resource allocation model, an initial beam resource allocation model and an environment module, wherein the initial beam resource allocation model comprises an intelligent agent and an environment module, and the intelligent agent comprises an initial actor neural network model and an initial critic neural network model; step S104, acquiring beam capacity requirement information of different samples of the user terminal, and performing interactive operation on the initial actor neural network model and the environment module based on the beam capacity requirement information of the different samples to obtain at least one strip of data; wherein the number of the state data is the same as the number of the interactive operations, and the state data is data containing states, actions and rewards corresponding to the states; step S105, training the initial actor neural network model by adopting a near-end optimization strategy based on at least one piece of strip data until parameters of the initial actor neural network model are converged to obtain a target actor neural network model.

The core idea of deep reinforcement learning in the embodiment of the invention is to model a satellite internet system as an agent and model factors related to user demands, service conditions and the like as environments (namely the environment modules). Continuous interactive learning of the intelligent agent and the environment is adopted to achieve the goal of long-term benefit maximization. As can be seen from the above description, the embodiment of the present invention is divided into two parts, i.e. learning and working, so that the embodiment of the present invention can provide the intelligent agent with data of multiple beam requirements (i.e. D as described above) _i ) The agent learns to map the beam demand scene to the power and frequency band resource allocation result; in the working process, based on the well-learned deep neural network (namely the target actor neural network model), a user beam power and frequency band resource joint optimization scheme is rapidly given, and repeated iterative optimization is not needed, so that the method has timeliness.

The beam resource allocation model comprises a target actor neural network model and a target critic neural network model. The embodiment of the invention can realize the establishment of the target actor neural network model in the beam resource allocation model by executing the steps S103 to S105. The embodiment of the invention introduces the specific steps in the model building process in detail: step S104, performing interactive operation on the initial actor neural network model and the environment module to obtain at least one piece of strip data, including the following steps S201 to S205, wherein: step S201, the beam capacity requirement information of different samples is input into an initial actor neural network model as an initial state by controlling the mode of the environment module to execute initial actions; step S202, controlling the initial actor neural network model to output a next action, and sending the next action to an environment module; step S203, based on the next action, the control environment module determines a prize and a next state corresponding to the initial state; step S204, determining the initial state, the initial action and the rewards corresponding to the initial state as a piece of state data; step S205, the next state is sent to the initial actor neural network model through the environment module, so that the initial actor neural network model and the environment module continue to interact to obtain multi-strip data. The purpose of obtaining a plurality of pieces of state data is to provide a data basis for a near-end optimization strategy.

In an optional embodiment, step S105, training the initial actor neural network model by using a near-end optimization strategy based on at least one piece of status data until parameters of the initial actor neural network model converge, to obtain a target actor neural network model, includes the following steps S301 to S303, where: step S301, determining a dominance estimation set based on at least one piece of strip data; wherein each piece of data corresponds to one dominance estimation; step S302, calculating a first loss value based on the dominance estimation set and a preset first loss function; step S303, updating parameters of the initial actor neural network model based on the first loss value, and determining the actor neural network model after the parameter updating as a target actor neural network model. It should be noted that, the foregoing steps S301 to S303 are specific implementation manners of the near-end optimization strategy, so as to update parameters of the initial actor neural network model, and further provide a real and effective target actor neural network model for beam resource allocation.

In an alternative embodiment, before calculating the first loss value based on the set of dominance estimates and the preset first loss function in step S302, the method further comprises: in step S304, the initial loss function is modified by limiting the upper and lower bounds of the initial loss function by using the objective function, so as to obtain a first loss function. The objective function may be referred to as a clip function. In the embodiment of the invention, the setting of the first loss function can avoid the excessive update of the parameters of the initial actor neural network model, so that the whole learning and updating process of the model is more stable and has smaller fluctuation.

In an alternative embodiment, as shown in fig. 2, the beam resource allocation method further includes: and step S106, training the initial critic neural network model by adopting a near-end optimization strategy based on at least one piece of strip data until parameters of the initial critic neural network model are converged, so as to obtain the target critic neural network model. As shown in fig. 2, the embodiment of the present invention may further include step S107, where the target actor neural network model and the target critic neural network model form a beam resource allocation model.

Step S106, based on at least one piece of status data, trains the initial critic neural network model by adopting a near-end optimization strategy until parameters of the initial critic neural network model converge, and obtains a target critic neural network model, including: step S401, calculating a second loss value based on the dominance estimation set and a preset second loss function; and step S402, updating parameters of the initial critic neural network model based on the second loss value, and determining the critic neural network model after the parameter updating as a target critic neural network model. The initial critic neural network model is used for determining a state value function corresponding to each piece of state data, and the state value function is used for calculating corresponding advantage estimation. Thus, step S301 above, determining a set of dominance estimates based on at least one strip of data, includes the following steps S501: inputting the at least one piece of state data into an initial critic neural network model to obtain a state value function corresponding to the at least one piece of state data; step S502, calculating the dominance estimation corresponding to each state data according to the state value function corresponding to the at least one piece of state data to obtain a dominance estimation set.

The embodiment of the invention discloses a fast joint optimization method for user beam power and frequency band resources of a satellite internet system, which is based on a deep reinforcement learning framework, solves the problem of long time consumption for developing resource allocation calculation by intelligent optimization algorithms such as classical genetic algorithm and the like, updates parameters of a neural network by utilizing a near-end strategy optimization technology, and overcomes the limitation of discretizing a resource allocation decision variable by the existing reinforcement learning method. The deep reinforcement learning method provided by the embodiment of the invention can greatly improve the calculation timeliness of resource allocation and obtain a high-quality user beam power and frequency band resource joint optimization scheme at the cost of losing acceptable systems and not meeting capacity requirements.

Example 2:

a flow chart of another beam resource allocation method provided by the embodiment of the invention is shown in fig. 3. For ease of understanding, the embodiments of the present invention are described in two parts:

the first part defines the state, action, and prize for deep reinforcement learning as follows:

state s _t : the status is the input to the neural network (i.e., the actor network in fig. 3) and is the basis for the agent to take action. In the modeling process of the embodiment of the invention, the factors influencing the decision of the agent are the capacity requirements of each beam, so that the factors can be used as the states in a Deep Reinforcement Learning (DRL) model (Deep Reinforcement Learning, deep learning network) as shown in the following formula (2):

s _t ＝D _t ＝[D _1，t ，D _2，t ，...，D _N，t ] (2)

Wherein D is _t Is a matrix containing a plurality of scalars, each scalar D _i，t And (3) the capacity requirement of the beam i at the time t for the user terminal.

Action a _t : the action is input of the environment, which is output obtained by the agent after the observation state, in the problem of joint allocation of the power and the bandwidth of the satellite internet user beam, which is to be solved by the embodiment of the invention, the agent allocates resources for each beam by analyzing the capacity requirement of the user terminal for each beam at the current moment, and further gives out the power and the bandwidth allocation scheme of each beam, so that the action can be defined as follows according to the formula (3):

a _t ＝{P _t ，B _t } (3)

wherein P is _t Is made up of multiple scalar quantities P _i，t A matrix of, i.e. P _t ＝[P _1，t ，P _2，t ，...，P _N，t ]，P _i，t The power distributed for the beam i at the time t for the satellite internet system; b (B) _t Is made up of multiple scalar quantities B _i，t A matrix of, B _t ＝[B _1，t ，B _3，t ，...，B _N，t ]，B _i，t The bandwidth allocated for beam i at time t for the satellite internet system.

Because the satellite Internet system user wave beam considered by the embodiment of the invention adopts a four-color frequency multiplexing technology, the homopolar adjacent wave beam is limited by the total bandwidth. In order to reduce UCD as much as possible and reduce the dimension of the output of the neural network, so as to facilitate the learning convergence of the intelligent agent, the embodiment of the invention satisfies the bandwidth limitation through the following formula (4):

Wherein: BW (BW) _m Representing the bandwidth allocated to beam m in the set of adjacent beams; BW (BW) _n Representing the bandwidth allocated by beam n adjacent to beam m in the set of adjacent beams; BW (BW) ^total Representing the total bandwidth available to the system user beam, AP _p Representing a set of adjacent beams with polarization p; the BW described above ^total And AP (Access Point) _p Is a known quantity, BW _m And BW _n Are all unknowns. Therefore, in the beam set N, only one of the two adjacent beams with the same polarization needs to be allocated with bandwidth resources to obtain the bandwidth allocation results of all the beams at the current moment. That is, according to the features of the embodiments of the present invention to solve the technical problems, the reinforcement learning reward function r is designed specifically _t R in the embodiment of the invention _t A value of not more than 0, r _t The closer to 0, the better the description policy, r _t The smaller the description policy the worse.

Prize r _t : rewarding is that the agent passes through the observation state s _t Output action a _t The environment then gives feedback to the agent. The agent takes action a _t Later, it is necessary to know whether this action is satisfiedOr approach an optimization objective to decide to increase or decrease action a when action selection is performed later _t So as to better meet the requirements of system design, learn the best power and bandwidth joint allocation strategy and minimize the UCD of the system. Prize r _t Action a as an environment pair _t Is feedback to action a _t And (5) performing an index of quality evaluation. The optimization objective in embodiments of the present invention is to minimize the UCD of the system, so rewards r _t Should be related to optimizing the objective function, and since it is found in practical implementation that if the result after output of the neural network is subjected to correlation operation when the single beam power bandwidth limitation is satisfied, learning convergence of the neural network is not facilitated, the limitation of the power bandwidth is added to r _t In the design of (2), r _t The calculation mode of (a) is designed as the following formula (5):

wherein P is _i，t Representing the power allocation of beam f at the moment of death, p ^max Represents the maximum power available for allocation of any beam, B _i，t The bandwidth allocated to the beam i at the time t for the satellite Internet system is zeta and eta are different coefficients, and the two coefficients are optimized and selected according to experience through trial calculation for a plurality of times.

As can be seen from fig. 3, during the fast joint optimization of the beam power and the frequency band resource of the satellite internet system user, the agent first observes the current environment s _t Subsequently the actor network will s _t As input, output action a _t Acting on the environment, changing state s _t+1 And rewards r _t The PPO algorithm module (PPO is a part of DRL) collects the stored state, rewards and other information in the process, after a certain time interval, each of the stored L pieces of data is sequentially input into the critic network to obtain a corresponding state value function, and then the corresponding advantage estimation, the first loss value and the second loss value are calculated and obtained, and finally the Adam optimizer is used for updating parameters of the actor network to enable the actor neural network to output The actions of (a) can more meet the optimization objective. It should be noted that the L states are not input simultaneously, and the critic network functions to give a corresponding state value function to the L states, where the L states are input sequentially one by one.

And the second part, the implementation flow of the fast joint optimization of the user beam power and the frequency band resource based on the deep reinforcement learning is consistent with the description of a code flow chart of the fast joint optimization algorithm of the user beam power and the frequency band resource based on the deep reinforcement learning, which is given in the table 1. Considering that the PPO algorithm is a strategy gradient algorithm based on an Actor-Critic framework, the method has good effect in processing the problem of continuous action space, and the embodiment of the invention designs a PPO algorithm to update the parameters of the deep neural network. The main implementation flow of the algorithm is shown in Table 1:

table 1 code flow chart of satellite Internet user beam power and frequency band resource fast joint optimization algorithm based on deep reinforcement learning

As can be seen from table 1, in step 1, relevant parameters of the satellite system are initialized, including, but not limited to, the number of beams, the satellite orbit height, the total satellite load power, the maximum beam power, the beam center frequency, the total available downlink bandwidth, the satellite transmitting antenna caliber, the satellite power amplifier output back-off, the satellite transmitting antenna maximum gain, the terminal receiving antenna caliber, the terminal receiving antenna maximum gain, the roll-off coefficient, the system noise temperature, the ratio of the useful signal to the polarized interference power, the ratio of the useful signal to the third-order intermodulation interference power, and the ratio of the useful signal to the adjacent satellite interference power in table 2 below.

Table 2 relevant parameter settings for satellite systems

Step 2, initializing relevant parameters of the DRL model; wherein the relevant parameters of the DRL model comprise at least one of the following: discount factors, learning rates, update intervals, number of times a set of data updates network, track length, clip function clipping range, prize design scaling parameters, and prize design scaling parameters.

And 3, constructing a 4-layer fully-connected neural network, wherein each layer comprises 10 x N neurons, and each hidden layer adopts a Relu activation function to add a nonlinear relation to data. Note that N is the total number of user beams. The purpose of this step is to build a fully connected neural network structure that is independent of DRL parameter initialization.

Step 4, training a deep neural network: firstly, generating T different capacity demand scenes D as materials of network training, wherein each moment T is the capacity demand scene D _t The capacity requirement of each beam in (a) is in accordance with the uniform distribution U (m, n), the state s _t Input to an actor network, and the network outputs action a _t The multi-element normal distribution mean vector of (a) can be established according to a given covariance matrix and sampled to obtain a _t ＝{P _t ，B _t Carrying the obtained resource allocation scheme into a link model to calculate r _t Then obtain L groups of data s _t ，a _t ，r _t ]Thereafter, the dominance estimate A is calculated according to the following equation (6) _l (l＝1，2，...，L)：

A _l ＝-V(s _l )+r _l +γr _l+1 +…+γ ^L-l+1 r _L-1 +γ ^L-l V(s _T ) (6)

Wherein V(s) _l ) Is the critic network received state s _l The latter output, γ, is the discount factor. It should be noted that since l is a variable, it can take different values, and the result of the dominance estimation is different when taking different values, and thus it can be understood that: one state corresponds to one dominance estimate. Then, calculating a loss value of the actor neural network according to the following formulas (7) and (8), and calculating a loss value of the actor neural network according to the following formula (9):

L _actor (θ)＝E _l [min(p _l (θ)A _l ，clip(p _l (θ)，1-∈，1+∈)A _l )] (7)

the above formula (7) can be understood as the first loss function in embodiment 1, the loss value of the actor network is calculated by using the first loss function (i.e. the first loss value in embodiment 1), the upper and lower bounds of the first loss function are limited by using the clip function, and excessive update of the parameters of the target actor network is avoided, so that the whole learning update process is more stable and has smaller fluctuation. The above equation (9) can be understood as the second loss function in embodiment 1, and the loss value of the critic network (i.e., the second loss value in embodiment 1) is calculated using the second loss function. After obtaining the first loss value and the second loss value, the parameters of the actor network and the critic network are updated by using an Adam optimizer or a gradient descent algorithm. Finally copying the parameter theta of the updated actor network and critic network to theta _old Interaction with the environment continues. The method can update the data acquired by the old network and the environment for multiple times, and can accelerate the learning and training speed. P in formula (8) _θ (a _l |s _l ) Representing that the network with parameter θ is in input state s _l Time output action a _l The logarithmic probability is calculated using the following equation (10):

wherein y is the dimension of the action space, a _k,x Representing action a _k Setting independent distribution of each action and mu when establishing multi-element normal distribution _x Sum sigma _x Is the mean and variance of the x-th dimension of the action.

And 5, inputting the beam capacity requirement scene into a trained actor network, and outputting the network as a user beam power and frequency band resource allocation joint optimization scheme. It should be noted that in practical applications, the critic network is no longer operational.

The following analysis was performed by way of example: and the power and the frequency band resources of the single satellite Internet satellite user beam are allocated so as to simplify the evaluation scene. The method provided by the embodiment of the invention is used for analyzing the beam power and frequency band resource joint optimization problem under the assumption that a single Internet satellite has 23 user beams. In order to verify the beneficial effects brought by the embodiment of the invention, the method is compared with a classical genetic algorithm and an improved genetic algorithm in the same scene. Because the genetic algorithm can obtain a better solution by increasing the iterative optimization times at the cost of simulation time, the embodiment of the invention carries out the simulation on the OSGA algorithm and the DRLGA algorithm for 10 times and 20 times, and the simulation parameters of the algorithm are shown in the table 3:

Table 3 algorithm simulation parameter settings

/>

The classical genetic algorithm GA_20, the improved genetic algorithm OSGA_20 (simple explanation is that the OSGA algorithm can avoid the solution in the solution space from changing to a worse direction in the crossover and mutation operation of each iteration due to the addition of a feasible solution optimizing strategy, so that compared with the classical genetic algorithm, the probability of obtaining a better solution in each iteration is greatly improved, the better solution can be obtained under a certain iteration number) and the method of the application are adopted, and the user beam power and frequency band resource combined optimization is carried out under 50 different beam demand scenes to obtain the average UCD of which the system does not meet the capacity demand _mean Average RATIO of UCD over all beam requirements and RATIOs _mean And averaging the simulation time T of each beam demand scenario to derive an allocation scheme _mean Listed in table 4:

table 4 assignment results for different algorithms

As can be seen from table 3, the three algorithms are in USC _mean There is a gap in this parameter index. The algorithm USC provided by the application _mean The value is between the OSGA algorithm and the GA algorithm, and is close to USC distributed by the OSGA algorithm _mean The accuracy of the method provided by the application depends on factors such as a sample data set, a neural network structure, PPO algorithm parameters and the like used in the neural network learning process. Because the state space and the action space are continuous in the technical problem to be solved, the state space is huge, and the training process is difficult to traverse all states, the better result can be obtained theoretically through further abundant training data sets.

Further, the capacity demand scene, i.e. the state, in the system is generated according to uniform distribution, so that the system can be considered to be continuous, in the PPO algorithm used in the training process, the direct output of the actor network is the mean vector of the multi-element normal distribution, the multi-element normal distribution can be established by using the mean vector and the covariance matrix set in advance, and then sampling is carried out from the vicinity of the mean in the distribution, so as to obtain the action a of final output _t The power bandwidth allocation results required for the system (corresponding to step 1 of training the neural network, and steps 8-9 in table 1) can also be considered as continuous.

From the aspect of simulation time index, the iteration time of the GA algorithm and the OSGA algorithm is far higher than that of the algorithm provided by the application, the operation time of the genetic algorithm is approximately proportional to the iteration times, a link calculation model and an interference analysis model of the satellite communication system are required to be called in the process of each iteration, and the state s is received each time by the method provided by the application _t After the input of the (2), the power and the frequency band allocation result of each wave beam are directly given by the neural network without complex link calculation and interference analysis process, thus greatly reducing the time required by calculation and being far superior to the time efficiency of resource allocation calculation It is an algorithm.

Simulation results show that the deep reinforcement learning method provided by the embodiment of the invention can greatly improve the timeliness of resource allocation calculation at the cost of losing acceptable system to not meet capacity requirements, the calculation time required by the same beam demand scene is less than 1/1000 of that of a genetic algorithm, and a high-quality user beam power and frequency band resource joint optimization scheme can be obtained.

Example 3:

the embodiment of the invention provides a beam resource allocation device, which is mainly used for executing the beam resource allocation method provided by the embodiment 1, and the beam resource allocation device provided by the embodiment of the invention is specifically described below.

Fig. 4 is a schematic structural diagram of a beam resource allocation apparatus according to an embodiment of the present invention. As shown in fig. 4, the beam resource allocation apparatus mainly includes: an acquisition unit 11 and an input unit 12, wherein:

an acquiring unit 11, configured to acquire beam capacity requirement information of a user terminal;

an input unit 12, configured to input beam capacity requirement information to a beam resource allocation model, so as to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by performing model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process.

According to the embodiment of the invention, through the functions of the acquisition unit 11 and the input unit 12, the fitting reality of the obtained beam resource allocation model can be ensured by updating the model parameters based on the near-end strategy, so that the timeliness of beam resource allocation calculation is ensured, and meanwhile, the accuracy of the beam resource allocation result can be considered.

Optionally, the beam resource allocation apparatus further includes: the system comprises a construction unit, an interaction unit and a first training unit, wherein:

the construction unit is used for constructing an initial beam resource allocation model; the system comprises an initial beam resource allocation model, an initial beam resource allocation model and an environment module, wherein the initial beam resource allocation model comprises an intelligent agent and an environment module, and the intelligent agent comprises an initial actor neural network model and an initial critic neural network model;

the system comprises an interaction unit, an initial actor neural network model and an environment module, wherein the interaction unit is used for acquiring different sample beam capacity requirement information of a user terminal and carrying out interaction operation on the initial actor neural network model and the environment module based on the different sample beam capacity requirement information to obtain at least one strip of data; wherein the number of the state data is the same as the number of the interactive operations, and the state data is data containing states, actions and rewards corresponding to the states;

the first training unit is used for training the initial actor neural network model by adopting a near-end optimization strategy based on at least one piece of state data until the parameters of the initial actor neural network model are converged to obtain a target actor neural network model.

Optionally, the acquiring the interaction unit includes: the device comprises a control execution module, a control output module, a first determination module, a second determination module and a sending module, wherein:

the control execution module is used for inputting the beam capacity requirement information of different samples as an initial state to the initial actor neural network model in a mode of executing initial actions by the control environment module;

the control output module is used for controlling the initial actor neural network model to output the next action and sending the next action to the environment module;

a first determining module for controlling the environment module to determine a prize and a next state corresponding to the initial state based on the next action;

a second determining module for determining the initial state, the initial action, and the prize corresponding to the initial state as a piece of status data;

the sending module is used for sending the next state to the initial actor neural network model through the environment module so that the initial actor neural network model and the environment module can continuously interact to obtain multi-strip data.

Optionally, the first training unit includes: a third determination module, a first calculation module, and a first update determination module, wherein:

a third determining module for determining a dominance estimation set based on the at least one strip data; wherein each piece of data corresponds to one dominance estimation;

The first calculation module is used for calculating a first loss value based on the dominance estimation set and a preset first loss function;

the first updating determining module is used for updating parameters of the initial actor neural network model based on the first loss value, and determining the actor neural network model after parameter updating as a target actor neural network model.

Optionally, the apparatus further comprises a modification module, wherein:

and the modification module is used for modifying the initial loss function in a mode of limiting the upper and lower bounds of the initial loss function by adopting the objective function to obtain a first loss function.

Optionally, the apparatus further comprises a second training unit, wherein:

and the second training unit is used for training the initial critic neural network model by adopting a near-end optimization strategy based on at least one piece of strip data until the parameters of the initial critic neural network model are converged to obtain the target critic neural network model.

Optionally, the second training unit comprises a second calculation module and a second update determination module, wherein:

the second calculation module is used for calculating a second loss value based on the dominance estimation set and a preset second loss function;

and the second updating determining module is used for updating the parameters of the initial critic neural network model based on the second loss value and determining the critic neural network model after the parameters are updated as a target critic neural network model.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

In an alternative embodiment, the present embodiment further provides an electronic device, including a memory, and a processor, where the memory stores a computer program that can be executed on the processor, and the processor executes the steps of the method of the embodiment of the method.

In an alternative embodiment, the instant embodiment further provides a computer readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to perform the method of the method embodiment described above.

In addition, in the description of embodiments of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the several embodiments provided in this embodiment, it should be understood that the disclosed method and apparatus may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present embodiment may be essentially or a part contributing to the prior art or a part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A method for beam resource allocation, comprising:

acquiring beam capacity demand information of a user terminal;

inputting the beam capacity demand information into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by carrying out model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process;

Wherein, still include:

constructing an initial beam resource allocation model; the initial beam resource allocation model comprises an intelligent agent and an environment module, wherein the intelligent agent comprises an initial actor neural network model and an initial critic neural network model;

acquiring different sample beam capacity requirement information of a user terminal, and performing interactive operation on the initial actor neural network model and the environment module based on the different sample beam capacity requirement information to obtain at least one piece of data; wherein the number of the state data is the same as the number of the interactive operations, and the state data is data containing states, actions and rewards corresponding to the states;

training the initial actor neural network model by adopting a near-end optimization strategy based on the at least one piece of strip data until parameters of the initial actor neural network model are converged to obtain a target actor neural network model;

training the initial critic neural network model by adopting a near-end optimization strategy based on the at least one piece of strip-shaped data until parameters of the initial critic neural network model are converged to obtain a target critic neural network model;

The training the initial actor neural network model by adopting a near-end optimization strategy based on the at least one piece of state data until the parameters of the initial actor neural network model are converged to obtain a target actor neural network model comprises the following steps:

determining a dominance estimation set based on the at least one strip of data; wherein each piece of data corresponds to one dominance estimation;

calculating a first loss value based on the dominance estimation set and a preset first loss function;

and updating parameters of the initial actor neural network model based on the first loss value, and determining the actor neural network model with the updated parameters as the target actor neural network model.

2. The method for beam resource allocation according to claim 1, wherein said interoperating the initial actor neural network model and the environment module to obtain at least one strip data comprises:

the beam capacity requirement information of the different samples is input into the initial actor neural network model as an initial state by controlling the environment module to execute an initial action;

controlling the initial actor neural network model to output a next action, and sending the next action to the environment module;

Controlling the environment module to determine a prize and a next state corresponding to the initial state based on the next action;

determining the initial state, the initial action and the reward corresponding to the initial state as a piece of state data;

and sending the next state to the initial actor neural network model through the environment module so as to enable the initial actor neural network model to interact with the environment module continuously, and obtaining a plurality of pieces of state data.

3. The beam resource allocation method according to claim 1, further comprising, before calculating a first loss value based on the set of dominance estimates and a preset first loss function:

and modifying the initial loss function by adopting a mode of limiting the upper and lower bounds of the initial loss function by adopting an objective function, so as to obtain the first loss function.

4. The beam resource allocation method according to claim 1, wherein the training the initial critic neural network model with a near-end optimization strategy based on the at least one piece of data until parameters of the initial critic neural network model converge, to obtain a target critic neural network model, includes:

Calculating a second loss value based on the dominance estimation set and a preset second loss function;

and updating parameters of the initial critic neural network model based on the second loss value, and determining the critic neural network model with the updated parameters as the target critic neural network model.

5. A beam resource allocation apparatus, comprising:

the acquisition unit is used for acquiring the beam capacity requirement information of the user terminal;

the input unit is used for inputting the beam capacity requirement information into a beam resource allocation model to obtain a beam resource allocation result of the user terminal; the beam resource allocation model is a neural network model obtained by carrying out model training based on different sample beam capacity requirement information and beam resource allocation results corresponding to the different sample beam capacity requirement information and updating model parameters by adopting a near-end optimization strategy in the model training process;

wherein, the beam resource allocation device further comprises:

the construction unit is used for constructing an initial beam resource allocation model; the initial beam resource allocation model comprises an intelligent agent and an environment module, wherein the intelligent agent comprises an initial actor neural network model and an initial critic neural network model;

The system comprises an interaction unit, an initial actor neural network model and an environment module, wherein the interaction unit is used for acquiring different sample beam capacity requirement information of a user terminal and performing interaction operation on the initial actor neural network model and the environment module based on the different sample beam capacity requirement information to acquire at least one piece of data; wherein the number of the state data is the same as the number of the interactive operations, and the state data is data containing states, actions and rewards corresponding to the states;

the first training unit is used for training the initial actor neural network model by adopting a near-end optimization strategy based on the at least one piece of strip data until the parameters of the initial actor neural network model are converged to obtain a target actor neural network model;

the second training unit is used for training the initial critic neural network model by adopting a near-end optimization strategy based on the at least one piece of strip-shaped data until the parameters of the initial critic neural network model are converged to obtain a target critic neural network model;

wherein the first training unit comprises:

a third determining module, configured to determine a dominance estimation set based on the at least one strip data; wherein each piece of data corresponds to one dominance estimation;

and the first updating determining module is used for updating the parameters of the initial actor neural network model based on the first loss value and determining the actor neural network model with the updated parameters as the target actor neural network model.

6. An electronic device comprising a memory, a processor, the memory having stored therein a computer program executable on the processor, wherein the processor, when executing the computer program, implements the method of any of claims 1 to 4.

7. A computer readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to perform the method of any one of claims 1 to 4.